ArticlePDF Available

Abstract and Figures

Background: Today a variety of phylogenetic file formats exists, some of which are well-established but limited in their data model, while other more recently introduced ones offer advanced features for metadata representation. Although most currently available software only supports the classical formats with a limited metadata model, it would be desirable to have support for the more advanced formats. This is necessary for users to produce richly annotated data that can be efficiently reused and make underlying workflows easily reproducible. A programming library that abstracts over the data and metadata models of the different formats and allows supporting all of them in one step would significantly simplify the development of new and the extension of existing software to address the need for better metadata annotation. Results: We developed the Java library JPhyloIO, which allows event-based reading and writing of the most common alignment and tree/network formats. It allows full access to all features of the nine currently supported formats. By implementing a single JPhyloIO-based reader and writer, application developers can support all of these formats. Due to the event-based architecture, JPhyloIO can be combined with any application data structure, and is memory efficient for large datasets. JPhyloIO is distributed under LGPL. Detailed documentation and example applications (available on significantly lower the entry barrier for bioinformaticians who wish to benefit from JPhyloIO's features in their own software. Conclusion: JPhyloIO enables simplified development of new and extension of existing applications that support various standard formats simultaneously. This has the potential to improve interoperability between phylogenetic software tools and at the same time motivate usage of more recent metadata-rich formats such as NeXML or phyloXML.
UML diagram showing the data adapter interfaces providing access to the application model for JPhyloIO writers. From top to bottom the object relation (indicated by compositions) is shown, while the class hierarchy can be read from bottom to top. Note that not all but only exemplary methods are shown for each interface. The DocumentDataAdapter is the main adapter that provides access to other adapters modelling OTU lists, matrices and phylogenetic trees or networks. Not all application models will provide all these datatypes and therefore not need to implement all types of adapters. The format specific writer classes in JPhyloIO can access the data either by event getter methods (e.g., MatrixDataAdapter.getSequenceStartEvent()) with an event ID as its parameter or by writeXXX() methods (e.g., MatrixDataAdapter.writeSequencePart-ContentData()), which write a whole subsequence of the event stream to a special receiver object provided by the application. To simplify the adapter implementation for application developers only frequently used events are provided by getter methods, while the others can directly be written in a sequence by implementing an appropriate writer method. (Getter methods were introduced for cases where random access to events with known IDs is frequently necessary for writers, to avoid requesting a whole sequence, if only one event is needed. Providing some events by getter and some by writer methods in the data adapter model is a compromise between ease of implementation and runtime performance.). Some adapters share common functionality, which is modelled by common superinterfaces, such as AnnotatedDataAdapter or ElementDataAdapter
Content may be subject to copyright.
S O F T W A R E Open Access
JPhyloIO: a Java library for event-based
reading and writing of different
phylogenetic file formats through a
common interface
Ben C. Stöver
, Sarah Wiechers and Kai F. Müller
Background: Today a variety of phylogenetic file formats exists, some of which are well-established but limited in
their data model, while other more recently introduced ones offer advanced features for metadata representation.
Although most currently available software only supports the classical formats with a limited metadata model, it
would be desirable to have support for the more advanced formats. This is necessary for users to produce richly
annotated data that can be efficiently reused and make underlying workflows easily reproducible. A programming
library that abstracts over the data and metadata models of the different formats and allows supporting all of them
in one step would significantly simplify the development of new and the extension of existing software to address
the need for better metadata annotation.
Results: We developed the Java library JPhyloIO, which allows event-based reading and writing of the most common
alignment and tree/network formats. It allows full access to all features of the nine currently supported formats. By
implementing a single JPhyloIO-based reader and writer, application developers can support all of these formats. Due
to the event-based architecture, JPhyloIO can be combined with any application data structure, and is memory efficient
for large datasets. JPhyloIO is distributed under LGPL. Detailed documentation and example applications (available on significantly lower the entry barrier for bioinformaticians who wish to benefit from
Conclusion: JPhyloIO enables simplified development of new and extension of existing applications that support
various standard formats simultaneously. This has the potential to improve interoperability between phylogenetic
software tools and at the same time motivate usage of more recent metadata-rich formats such as NeXML or phyloXML.
Keywords: Phylogenetic metadata, Data reuse, Data annotation, NeXML,PhyloXML,NEXUS, Phylogenetic tree, Multiple
sequence alignment
The amount of available data in organismic and
biodiversity-related disciplines, such as phylogenetics,
taxonomy or ecology [1,2] as well as related fields of
molecular biology, especially genomics or genome
evolution has been growing and continues to grow at
an accelerated rate. Among other factors, increasingly
cheaper high-throughput sequencing technologies [3],
data collected in the context of barcoding initiatives
[46], the ongoing digitization of biological collec-
tions [7,8], and large-scale data acquisition (e.g., related
to monitoring biodiversity) in citizen-science [912]con-
tribute to this increasing amount of primary data. On top
of that, the availability of faster processing units allows for
increasingly advanced downstream analyses and the paral-
lel application of multiple alternative methods and para-
meter sets, which in turn leads to even more (derived)
data, potentially multiplying the amount with value for
reuse in subsequent studies.
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (, which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
( applies to the data made available in this article, unless otherwise stated.
* Correspondence:
Institute for Evolution and Biodiversity, WWU Münster, Hüfferstraße 1, 48149
Münster, Germany
Stöver et al. BMC Bioinformatics (2019) 20:402
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
While these developments open up new perspectives
for studies and applications that make use of big data,
the practical reusability of data continues to be an issue.
Primary sequence data tend to be reused on a regular
basis, but the accessibility of other derived data types
like phylogenetic trees is still low [13]. Phylogenetic hy-
potheses, for example, are often provided as image files
only. Although software exists to reconstruct phylogen-
etic trees from images [1417], making trees easily re-
usable would require representing tree topologies and
branch lengths in defined phylogenetic file formats
[13,18]. Public availability and searchability of scien-
tific data are necessary to foster reuse [19,20]and
can be addressed by respective policies of funding
agencies and scientific journals [21] together with
cyberinfrastructure (e.g., [2226]) for the long-term
storage. Data annotation is an equally important re-
quirement to easily and unambiguously identify and
understand relevant data that fits the need for a con-
crete project [13,27,28]. Metadata annotation is ne-
cessary to, e.g., unambiguously link tree nodes to
sequences in a multiple sequence alignment that was
used to generate the tree, or to link tree nodes and
sequences to taxonomic information, ideally also
using a taxonomic ID (e.g., NCBI Taxonomy [29]) -
or even better linking a sequence back to the individ-
ual specimen it was derived from [30]. Additionally,
linking relevant external resources (e.g., voucher in-
formation, digitized specimens or sequencing raw
data) and providing metadata that reliably identifies
the methods that were used to generate data (e.g., the
software and parameters used for a phylogenetic infer-
ence) would further improve reusability of data and repro-
ducibility of studies. More extensive lists of information to
be provided for better reusability and reproducibility can
be found in [31,32]. Storing the results of phylogenetic
analyses using metadata-rich formats is an ideal basis to
link all necessary metadata and resources.
The big data-driven development of databases to reuse
phylogenetic information (e.g., [2426]) or the applica-
tion of deep learning approaches (e.g., [3335]) would
be significantly simplified if more data were annotated
with more structured metadata and could be automatic-
ally collected from a wide variety of studies. Beyond that,
making phylogenetic data more reusable is important for
every discipline where data is analyzed in a phylogenetic
context, evolutionary aspects are part of studies, or
alignments by homology or phylogenetic trees are
needed at some point during analysis.
Phylogenetic file formats to store taxon or OTU lists,
character matrices, multiple sequence alignments or
phylogenetic trees and networks, are an integral part of
phylogenetic workflows and the basis for interoperability
between the software tools and databases involved.
Several formats exist, which can be grouped into two
categories: (i) classical formats, such as FASTA,PHYLIP
[36], Newick [37]orNEXUS [38] and (ii) more recently
developed formats with significantly more advanced
metadata models, such as NeXML [39]orphyloXML
[40] (See also Table 1). Due to their ability to store rich
metadata, the more advanced formats are much better
suited to annotate data in a way that allows efficient
reuse and clear reproducibility, as argued above. The
metadata model of NeXML, for example,uses RDF-
predicates from externally defined ontologies to link ex-
ternal resources, allowing maximal reusability of phylo-
genetic data (RDF =Resource Description Framework,a
semantic web technology to formulate logical state-
ments, [41,42]). These modern formats are XML-based
and therefore well-defined by XML schema definitions.
XML libraries available for nearly all programming lan-
guages can be used to process them. In contrast, the
classical formats are plain text-based and usually do not
have a formal definition (e.g., a grammar), resulting in
different dialects and incomplete reader implementa-
tions. This frequently leads to interoperability issues
when exchanging data between different software, which
is another downside of using the classical formats
instead of the more recent ones. In practice though, the
classical formats dominate and only a small number of
researchers actually make use of these newer formats
and annotation standards. One important reason for this
is that the majority of widely used computer applications
only support the classical formats.
To help with the transition of the community to pro-
viding richly annotated data, software is required that
simplifies such annotation and supports the modern for-
mats. This software should allow importing from and
exporting to the classical formats to provide downwards
compatibility and be interoperable with applications that
do not (yet) support metadata-rich formats. The devel-
opment of such software can be costly and complicated
since different readers and writers for all formats need
to be implemented and the representation of the appli-
cationsdata in the different formats must be developed,
which usually requires detailed knowledge on all for-
mats. Since the necessary resources would have to be
subtracted from working on the core functionality of the
application, this is usually omitted. To foster the deve-
lopment of new and the extension of existing software
to support the necessary variety of formats, we created
the Java programming library JPhyloIO. It allows access
to phylogenetic data through one common interface that
fully models the data and metadata concepts of all men-
tioned formats. This enables Java application developers
to support both modern metadata-rich formats (to
produce easily reusable data) and classical formats
(for larger interoperability) by implementing only one
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 2 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
JPhyloIO-based reader and writer. JPhyloIO models
data and metadata in all formats (as far as they allow
this) without the need for the developer to explicitly
deal with this problem, which we hope will help
encourage the use of modern metadata-rich formats.
JPhyloIO offers a general way for reading and writing vari-
ous phylogenetic file formats, without imposing any con-
straints on the data model of applications using the library.
(See Fig. 1and Fig. 2.) An event-based architecture (similar
to iterator-based StAX for XML parsing that is common in
Java) was chosen over a model-based approach (that
would define its own data model classes), because repre-
senting phylogenetic data as a sequence of event objects
allows compatibility with all application data structures
and memory efficient processing. The JPhyloIO readers
and writers for different formats all implement a common
interface (using the strategy pattern [43]) and instances of
them are created using a factory implementation [43]that
can guess the format from a file or input stream.
Event streams for reading documents
JPhyloIOs reader classes translate the hierarchical data
structure of a document with phylogenetic data into a
linear sequence of event objects, which is formally
described by the grammar in Fig. 3. Events representing
data elements that consist of smaller parts (e.g., an align-
ment that consists of sequences) are modelled as a pair
of a start and an end event. The subsequence between
such two consists of events that model the content of
the data element at hand. By applying this recursively,
the hierarchical structure of a document can be serialized
to a linear event stream, as shown in the example in
Fig. 4. Applications using JPhyloIO need to implement a
reader for processing the encountered events and storing
relevant information in their data structure (Fig. 1, Fig. 2).
This can be done by iterating over the event stream using
(StaX-like) pull parsing, which allows the application to
actively request events one by one and therefore keep the
control flow. Classes for (SAX-like) push parsing are
additionally available, if an inversion of control is benefi-
cial, e.g., if multiple event listeners need to be present on
the application side. One such application reader imple-
mentation allows access to all supported formats, even
when additional formats are added in future releases of
JPhyloIO, without the need for any format-specific logic.
All created event objects have a string ID, which is
unique inside a documents event stream and allows to
link events to one another (e.g., a tree node to an OTU).
References will only be made to previous events that are
already known to the application. To reduce the amount
of work for application developers, the minimum infor-
mation is always directly contained in an event object,
so that only more complex application models will need
to resolve such ID dependencies.
A web interface that generates lists of JPhyloIO events
from any example file is available at http://r.bioinfweb.
info/JPhyloIOEventLister. It allows to fold and unfold
the output of subsequences and helps to get started with
the way JPhyloIO translates phylogenetic data and meta-
data into events. Besides using custom files, users can
also choose from a set of predefined example files in
different formats.
Table 1 Formats supported by JPhyloIO
Format Type OTUs MSAs Trees Networks Simple Metadata
Complex metadata
Read Write
Relaxed PHYLIP Text X XX
Newick Text X X X X
eNewick (Newick, NEXUS) Text X X X X
phyloXML XML X X X X 3 X X
A variety of file formats used in phylogenetics are supported. These can either be based on XML, or define custom types of structured text which is indicated in the
second column. The central columns show whether a format supports taxon/OTU lists, multiple sequence alignments, phylogenetic trees or networks and what type of
metadata can be attached to at least some of these elements or their subelements. As shown in the two rightmost columns, JPhyloIO can read and write many of the
common formats, while formats specific to single applications can only be read
Attaching simple numeric or textual values to data elements
Attaching complex metadata elements that may be represented as XML structures
Simple annotations can be linked using CURIE-like identifiers and custom XML tags can be added to all elements but no explicit reference to external ontologies
is currently supported in phyloXML
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 3 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Data adapters for writing documents
Format-independent writing of phylogenetic data can-
not be implemented in as straightforward a manner
as, e.g., StAX writing for XML, since the required
order of the data elements varies between the differ-
ent target formats, and direct writing of an event
stream (as defined by the grammar in Fig. 3)isnot
possible without having to buffer large amounts of
data in some cases. Therefore, we provide adapter in-
terfaces to be implemented by an application. These
bridge between the application data model and JPhy-
loIO writers (Fig. 2,Fig.5), which then can request
certain subsequences of the event stream (which
correspond to a grammar node in Fig. 3)intheorder
required by their target format.
Implementing such data adapters may be slightly more
effort for application developers than just writing a
method that directly creates an event stream from their
data model, but it has the advantage of allowing un-
buffered access in the required order for all formats.
Therefore, writing becomes much more memory effi-
cient, especially for large datasets.
Generalization over different metadata concepts
A key feature of JPhyloIO is that it provides a general
way of attaching metadata to any element in a phylogen-
etic data set, thereby abstracting over different metadata
concepts found in the supported formats. In our opin-
ion, the RDF-based metadata tags used by NeXML [39]
represent the most powerful way of modelling metadata
and therefore were chosen as the foundation of our
general concept. It allows to link external resources and
to represent trees of hierarchical metadata that can be
attached to any element of a phylogenetic document.
Predicates from externally defined ontologies are used to
link data and metadata, which ensures both maximal
flexibility regarding the type of metadata and unambigu-
ous and machine-readable descriptions of the relation-
ships at the same time. Technically, RDF distinguishes
between resource metadata elementsthat link external
Fig. 1 Data flow diagram showing how data is read into and written from an application data model. JPhyloIO contains a reader for each format
that translates the contents of a file to a sequence of events that are then processed by the custom reader of an application. This reader has
knowledge of the specific application data model and stores relevant information there. The writers available in JPhyloIO access the contents of
that model using data adapters provided by the application that allow random access to the applications data model. (For supported formats specific
for a single application, only readers are provided)
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 4 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
resources or a set of other internal metadata elements
(forming a subtree within a metadata tree) and literal
metadata elementsthat link concrete values (e.g.,
strings or integers).
Following this structure, JPhyloIO provides a resource-
and a literal metadata event class. As shown in Fig. 3,
the event grammar allows nesting sequences of metadata
events (represented by the grammar node MetaInfor-
mation) in all data elements. Metadata can either be
represented as a resource- or a literal metadata event.
The literal metadata objects may be simple values (e.g.,
numbers or strings) or complex XML data, modelled by a
sequence of respective events. As an alternative to pro-
cessing the JPhyloIO event stream directly, our library pro-
vides adapter classes between JPhyloIO readers and both
iterator- and cursor-based StAX readers and writers to
empower application developers to directly reuse possibly
existing code for StAX-based reading and writing of
respective data without the need for any adaptation
to JPhyloIO.
Ways to extend JPhyloIO
All readers and writers in JPhyloIO implement common
interfaces and several abstract implementations of these
are available that provide shared functionality, e.g.,
specific for processing text or XML formats. It is
therefore simplified for developers to add new readers
and writers for additional (custom) formats that inte-
grate seamlessly with the architecture of the library
and can directly be used with all JPhyloIO-dependent
code in other software.
Fig. 2 UML class diagram showing the relation between JPhyloIO and an application based on it. All readers and writers implement a common
interface to be easily exchangeable in the application. Event readers produce a sequence of events (see Fig. 1) processed by an application reader
class that acts as an adapter between JPhyloIO and the application data model. Conversely, a set of data adapter implementations of the application
allows the JPhyloIO writers to access the data. Writing needs a slightly more complex architecture than reading, because writers need to access that
data in different orders depending on the target format. To achieve this, a set of data adapters (see Fig. 5For details) is necessary, each providing a
subsequence of the whole event stream modelling a document
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 5 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
For creating complex Java objects from metadata
event sequences or to write them back, an interface with
a set of default implementations for common types is
provided, which can be used for additional custom
NEXUS-related classes are designed to use individual
handlers for all NEXUS blocks and commands, allowing
to easily add support for new or custom NEXUS ele-
ments in third party modules.
JPhyloIO is an open-source programming library that
allows to read and write different phylogenetic file for-
mats using a single event-based interface as described
above. It covers taxon- or OTU lists, character matrices
or multiple sequence alignments, phylogenetic trees or
networks and sets of elements (e.g., character sets).
Simple annotations and more complex metadata can be
attached to all elements of a document (e.g., trees, tree
branches, sequences) and JPhyloIO translates these using
the available features of each supported format.
Source codes and binary distributions are available
under the terms of the GNU Lesser General Public
License 3 from This web-
site also provides an extensive documentation, including a
detailed JavaDoc and a set of example applications.
Supported formats
As shown in Table 1,JPhyloIO supports reading and
writing the majority of phylogenetic file formats, includ-
ing common extensions of these. Additionally, reading
of some application-specific alignment and tree formats
is possible. The library imposes no restrictions on alpha-
bets used in molecular, morphological and other charac-
ter matrices, but guarantees that no invalid output for
any of the target formats can be written.
Sequence data, including optional comments, can be
read from and written to the FASTA format, with op-
tional column indices at the beginning of each line being
processed correctly. Writing of sequences and optional
comments is supported, but generated files will never
contain column indices, since these are not widely
supported and may cause problems in other software.
The PHYLIP format exists in a standard [36] and a re-
laxed [44] variant, which can both be read in interleaved
and non-interleaved forms (the non-interleaved form is
Fig. 3 Grammar describing the event sequence generated by JPhyloIO readers. These readers translate the hierarchical data structure of a phylogenetic file
(e.g., a NeXML file consisting of an alignment and a tree, which again consist of sequences or nodes and edges, and so forth) into a sequence of events as
defined by this grammar in extended Backus-Naur form (EBNF). The terminal symbols (in green) represent the types of events, each of which either has
a single SOLE or a START and END version, depending on whether additional data can be nested or not
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 6 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
written for both). The PHYLIP format allows sequence
names only up to a certain length (which can be longer
in the relaxed variant), resulting in the need to shorten
them by JPhyloIO writers. In contrast to many other
available software tools, this implementation ensures
that all written names are unique, even if the full names
only differ in characters behind the cut-off position. If
sequence names were edited, the application will be
informed by a translation object, mapping old to new
The NEXUS format [38] is a text format consisting of
blocks that contain different types of data. Each block
consists of a set of NEXUS commands. JPhyloIO offers
readers and writers that support commands of the TAXA
block containing taxon lists, the DATA,CHARACTERS
and UNALIGNED blocks containing sequence and align-
ment data, the TREES block containing phylogenetic
trees and the SETS block, containing sets of other items.
One type of custom NETWORKS blocks containing
phylogenetic networks in eNewick format is also sup-
ported (see below). Sequence data can be in standard or
interleaved format with both single character and longer
tokens and ambiguous character definitions being sup-
ported. Tree nodes can be referenced by the taxon label,
the taxon index, or by using a separate translation table.
In contrast to other software, JPhyloIO supports all three
methods both for internal and terminal tree nodes, while
translation can be switched off for internal nodes if
necessary, e.g., to avoid conflicts between support values
and taxon list indices. For the SETS block, character,
taxon and tree sets are currently supported. The DIS-
TANCES,ASSUMPTIONS and NOTES blocks are cur-
rently not supported. As NEXUS files identify all
elements by a unique label (instead of distinguishing
between labels and IDs as, e.g., in NeXML), the re-
spective JPhyloIO writer edits labels to be unique if
necessary, and reports such changes using the same
translation object as the PHYLIP writer described above.
In addition to the initial NEXUS standard, the TITLE
and LINK commands from Mesquite [45] that allow
linking between blocks (e.g., TAXA blocks can be re-
ferenced by CHARACTERS or TREES blocks) and the
MIXED sequence datatype extension [46] from MrBayes
[47] are recognized.
Phylogenetic trees are represented as Newick strings
[37] in the TREES block of a NEXUS document or in
Fig. 4 Example document with its respective event sequence. The
document contains an OTU list and an alignment, which references
this list. The event sequence is generated by a JPhyloIO reader (see
also Fig. 2), where each box represents one event. Each has an ID in
order to be referenced by subsequent events, as exemplarily shown
by the OTU list and OTU start events, which are referenced by the
related alignment and sequence start events
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 7 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
separate text files containing a set of Newick strings sep-
arated by semicolons, which are sometimes referred to
as Newick files and are, e.g., used by MEGA [48]. Such
Newick filesaremodelledasaseparateformatin
JPhyloIO that can be read and written. Newick tree
definitions (in Newick and NEXUS files) may contain
metadata in hot comments, which can also be read
and written. (See below.)
The readers for both NEXUS and Newick can also read
definitions of phylogenetic networks in the Extended
Newick or eNewick format [49] and model its crosslink
type (if specified) as metadata.
NeXML [39] is a more recent XML format that is
inspired by NEXUS but allows a more advanced way
of linking different phylogenetic data elements (e.g., a
tree node to an OTU). Additionally, it uses RDFa to
attach metadata to all elements (trees, alignments,
nodes, sequences, ), which provides the basis for
the general metadata model used in JPhyloIO.(See
below.) Readers and writers supporting all features of
the format, including its full metadata concept and
automated handling of custom sequence tokens, are
provided by our library.
phyloXML [40] also models complex metadata using a
different concept than NeXML.Itisusedtostore
phylogenetic trees and is fully supported by JPhyloIO.
Although phyloXML uses a hierarchical tree represen-
tation, it allows to specify additional clade relation
tags to define phylogenetic networks that are used by
JPhyloIOs reader and writer.
In addition, readers for some application specific for-
mats are available. For the MEGA format [48], a reader
provides access to its alignment data and character sets
(attached by the LABEL,GENE or DOMAIN commands
of the MEGA format). Multiple sequence alignments and
attached metadata from PDE files produced by the align-
ment editor PhyDE [50] and trees, including their meta-
data, from XTG files used by the phylogenetic tree
editor TreeGraph 2 [51] can be read as well.
Supported metadata models
Whereas the metadata representation in NeXML is by
definition identical to JPhyloIOsRDF-based metadata
model (as described in Implementationabove), reading
and writing of other formats requires a translation to the
respective format-specific model. FASTA and PHYLIP do
Fig. 5 UML diagram showing the data adapter interfaces providing access to the application model for JPhyloIO writers. From top to bottom the object relation
(indicated by compositions) is shown, while the class hierarchy can be read from bottom to top. Note that not all but only exemplary methods are shown for
each interface. The DocumentDataAdapter is the main adapter that provides access to other adapters modelling OTU lists, matrices and phylogenetic trees
or networks. Not all application models will provide all these datatypes and therefore not need to implement all types of adapters. The format specificwriter
classes in JPhyloIO can access the data either by event getter methods (e.g., MatrixDataAdapter.getSequenceStartEvent()) with an event ID as
its parameter or by writeXXX() methods (e.g., MatrixDataAdapter.writeSequencePart-ContentData()), which write a whole subsequence
of the event stream to a special receiver object provided by the application. To simplify the adapter implementation for application developers only frequently
used events are provided by getter methods, while the others can directly be written in a sequence by implementing an appropriate writer method. (Getter
methods were introduced for cases where random access to events with known IDs is frequently necessary for writers, to avoid requesting a whole sequence, if
only one event is needed. Providing some events by getter and some by writer methods in the data adapter model is a compromise between ease of
implementation and runtime performance.). Some adapters share common functionality, which is modelled by comm on super in ter fa ces , such as
AnnotatedDataAdapter or ElementDataAdapter
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 8 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
not support metadata, so the respective writers ignore
provided attachments and log warnings.
phyloXML does not use an RDF-like concept, but of-
fers a fixed set of metadata elements, stored in special
XML tags. To access such metadata in JPhyloIO, we de-
fined RDF predicates for each predefined metadata elem-
ent for internal use in JPhyloIO, to allow identifying the
phyloXML tags in our RDF-based model. In addition,
phyloXML offers ways to freely attach metadata by (i)
property tags to attach simple annotations (e.g.,
strings, numeric values or URIs) to trees, clades or se-
quences and (ii) custom XML structures added to a
whole document, a tree, a clade or some of the prede-
fined annotation tags. JPhyloIO makes use of all these
features to attach metadata not linked using phyloXML-
specific predicates. In combination, this allows to read
and write all modelled metadata. Since representing cus-
tom hierarchical RDF metadata (different from the pre-
defined phyloXML annotation types) is not possible in
this format, parts of it will be ignored during writing
and respective warnings (similar to FASTA and PHYLIP)
will be logged. Different strategies on how to translate a
full RDFa annotation tree into phyloXML are offered by
JPhyloIO and can be selected using a writer parameter.
For attaching metadata to nodes and branches in New-
ick strings [37], two extensions that make use of hot
comments (comments that contain actual metadata) are
supported by JPhyloIO. One is New Hampshire eX-
tendedor NHX [52,53], a precursor of phyloXML that
allows to use a limited set of its predefined annotations,
identified by the respective phyloXML predicates in
The other extension, used by, e.g., TreeAnnotator from
the BEAST package [54] and recent versions of MrBayes
[47], allows to attach numeric or string values (or arrays
of these) to nodes and branches using a free string iden-
tifier. These identifiers differ from the RDF predicates
(used in NeXML), since they can have any form and do
not need to be URIs. To solve this, all meta-events in
JPhyloIO can carry a string identifier and an RDF predi-
cate as alternative descriptions of their relation to their
subject. If a string representation is needed for writing
and was not provided, the local part of the predicate
CURIE will be used.
By supporting these two annotation concepts, JPhy-
loIO allows to read and write metadata from and to
Newick and NEXUS files. As in phyloXML, hierarchical
metadata cannot be written and warnings will be logged.
JPhyloIO also reads metadata from the application-
specific XTG and PDE formats. Both formats may con-
tain a fixed set of metadata for some of their elements
and according predicates in namespaces for internal use
are defined to identify these (the same way as for
phyloXML). The XTG format and TreeGraph 2 [51]
additionally provide the functionality to attach numeric
or string annotations to each node or branch of a tree
using any string identifier, which are also supported.
Basic annotations present in the MEGA format (e.g., a
description text for a matrix) are read as well.
Code example
Figure 6provides example code for writing simple and
nested metadata attached to one node and one branch
of a tree and shows the output in three different formats
that result from it. In addition, the documentation on
the JPhyloIO website contains further code examples
that are documented in detail and can be downloaded to
test and run them. These are available at http://r.bioinf-
Ways to get involved
Feedback and contributions of the community to the
project are made possible by the public bug tracking sys-
tem on the JPhyloIO website, which is also open for fea-
ture requests. In addition users can ask questions in the
JPhyloIO ResearchGate project, using the support e-mail
address on the website or via the bioinfweb Twitter
account. Code contributions are possible within the
JPhyloIO GitHub repository. (URLs can be found below
and on the JPhyloIO website.)
Classic formats like FASTA,PHYLIP or NEXUS still play
an important role when working with sequences, align-
ments and phylogenetic trees, mainly because widely
used applications often solely rely on these formats
(until now). More modern formats with advanced meta-
data models like NeXML or phyloXML allow to unam-
biguously describe and link data and therefore increase
its reusability and the reproducibility of underlying stud-
ies [13,32,55] in a way that would never be possible
using only classical format to represent data. JPhyloIO
generalizes over these different types of file formats
while still supporting their individual feature sets. There-
fore, it allows bioinformaticians to efficiently develop ap-
plications that address both the need for interoperability
with widely used software by supporting classical for-
mats and the need for better metadata annotation using
modern formats at the same time. This can be achieved
by just implementing one interface without having to
invest additional resources into each supported format.
Developers do not even require detailed knowledge
about all formats their applications support, since JPhy-
loIO takes care of which format features to use to
optimally represent the content to be written.
Extending existing applications with support for
additional, in particular for modern metadata-rich for-
mats, is also simplified by JPhyloIO. It integrates well into
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 9 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Fig. 6 (See legend on next page.)
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 10 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
any existing application data model due to its event-
based architecture, which does not impose any require-
ments on the way data is modelled and stored by an
Comparison with other libraries
Other libraries exist for the Java programming language
that support reading or writing of alignment or tree for-
mats. Forester [56] allows to read and write alignments
in FASTA ,PHYLIP and NEXUS and phylogenetic trees
in phyloXML,NEXUS,andNHX. Phylogenies in the
Tree Of Life Response Format [57] can be read. The
NeXML format with its powerful metadata model is not
supported and no generalization over the different meta-
data models exists. The tree readers implement a com-
mon interface, but there is no such interface for reading
or writing trees and alignments together. As a conse-
quence, NEXUS files containing sequence and tree data
need to be processed multiple times independently.
Unlike JPhyloIO,Forester enforces its own predefined
data model, which can have disadvantages for certain
use cases as discussed below. The NEXUS TAXA block is
only supported when writing trees but not considered
for reading trees or for reading and writing alignments,
while NEXUS sets are not modelled at all.
In its current version 4.2.7, BioJava [58] includes only
readers and writers for sequence data from the FASTA
and the GenBank format. The BioJava legacy version
1.9.2 [59] provides an event/call-back based API through
a common interface for some sequence formats, among
them the alignment formats FASTA and MSF, but none
of the other formats supported by JPhyloIO. Independent
readers and writers for PHYLIP and NEXUS (including
support for trees but not for sets) are available, which
cannot be accessed through the event-based API.
There is no support for NeXML,phyloXML or complex
NeXML can of course also be read and written using
its reference Java API [60] implemented together with
the publication of the format [39] but this library is not
intended to support other formats and abstract over
their features.
In other languages, multiple format-specific APIs are
available (e.g., [61,62] and many unpublished ones),
some of which also generalize over different formats
(e.g., [6365]).
BIO::Phylo [66]isaPerl library that supports a num-
ber of alignment and tree formats, among them 6 of the
9 formats supported by JPhyloIO. Reading and writing is
possible through a common interface but a predefined
data structure is enforced. Metadata connected using
RDF predicates is modelled. phyloXML-specific predi-
cates are used in a similar way as in JPhyloIO, while the
set of supported elements is less complete (property
and clade_relation tags are not, and legal custom
tags are only partly supported). BIO::Phylo is able to read
(but not write) some types of hot comment tree annota-
tions from NEXUS, while JPhyloIO supports to read and
write a larger set of these.
NCL for C++ [64] supports FASTA ,Newick,NEXUS
and PHYLIP. Plans to support NeXML and phyloXML
were announced in 2010, but have not been imple-
mented as of this writing, and therefore complex meta-
data is not modeled. Hooks for the application to
directly process a whole alignment or a whole tree are
provided, but these data elements are much larger than in
JPhyloIO (where event objects only model, e.g., a short
sequence part or a single tree node) and processing of
large alignments or trees can be less efficient in NCL.
By making use of the Java Native Interface (JNI)itis
possible in principle to access the functionality provided
by JPhyloIO from nearly all other programming lan-
guages. For some languages, special packages are avail-
able to make this more convenient, e.g., Py4J [67] for
Python or rJava [68] for R. It should be noted that
making API calls this way is usually more intricate than
working with JPhyloIO in Java directly.
Compared to the existing Java libraries and even
libraries in other languages, JPhyloIO supports a large
(See figure on previous page.)
Fig. 6 Example code for format-independent metadata writing. The code examples in the two boxes on top show how metadata can be written
in JPhyloIO. In the lower of the two boxes two support values are attached to a node as literal metadata elements using predicates from a fictional
ontology The used convenience method writeSimpleLiteralMetadata internally writes a
literal metadata start event followed by a respective content and end event as defined by the grammar node LiteralMeta in Fig. 3. The upper box
contains an example where literal metadata elements are nested within resource metadata elements. In the concrete example respective predicates
for phyloXML are used to write an NCBI taxonomy ID. The metadata trees attached to the node (in purple) and the branch (in red) are shown on top of
the figure. All predicates linking metadata are shown in green, while the actual metadata values are shown in blue. Below the resulting output is
shown for three formats. The first box contains NeXML,whichusesitsmeta tags to represent the metadata and the linking predicates. The second
contains phyloXML that uses its specific taxonomy-related elements to represent the node metadata and its property tags to model the branch
metadata. (In contrast to NeXML, using fully qualified predicates with namespace declarations is not supported.) The box on the bottom contains the
NEXUS output, where only the metadata from the lowest level is represented using hot comments within the Newick string. (Using fully qualified
predicates is not possible here either). The full source code and output of this example can be found at
JPhyloIODemoSimpleMetadata. Additional examples for processing data and metadata are also available on the JPhyloIO website
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 11 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
number of formats with a more complete coverage of
their feature sets. It does that through a single common
interface, while allowing memory efficient event-based
processing independent of the applications data struc-
ture. JPhyloIOs generalization over different metadata
models, which allows full access to such data from all
formats, is currently not offered by any other Java
library. (As mentioned above, BIO::Phylo allows access
to a comparable range of formats in Perl but is not
Event-based processing versus predefined library data
With an event-based architecture as implemented in
JPhyloIO, application developers can decide for each
event how long it should be kept in memory or not.
Libraries with predefined data structures load all data
from a file into memory at the same time, regardless of
the application requirements. This is especially in-
efficient for use cases that do not need random access to
all data (e.g., determining the GC-content of large
sequence data sets, searching for certain repeat motives
in them or counting the occurrences of a certain node in
a large set of trees, e.g., taken as samples from Bayesian
phylogenetic inference). Event-based processing reduces
the amount of memory needed in such cases from O(n)
(linear to the dataset size, e.g., the number of nucleo-
tides) to O(1) (constant, independent of the dataset size),
since only the current or a few recent events need to be
in memory at once to perform such tasks.
For applications that need random access (e.g., align-
ment or tree editors), the event-based architecture is still
beneficial, because these are often not interested in the
whole content of a file (e.g., only in sequence data but
not trees) and therefore can directly discard unused
events, which they could not do when using library-
specific data structures. Even more relevant for complex
applications may be the flexibility regarding the data
structure. Providing concrete data storage classes with a
library, forces applications that need a more advanced or
specific model to load the data into instances of library
classes first and then copy it into their own specific data
structure. This way, the data of at least one file will be in
memory twice, which may become a problem for large
data sets. Such a problem does not occur with JPhyloIO,
since event data can directly be stored into any
application-specific data structure.
With this in mind, we acknowledge that predefined
model implementations may be beneficial for simpler
scripts and tools, because developers will not have to
deal with implementing their own data structure. Fur-
thermore, creating an application reader for an event
stream will usually require a little more effort than sim-
ply fetching information from a predefined library data
structure, since the reader will need to keep track of the
current state, e.g., the current alignment and sequence,
in order to process a sequence tokens event. However,
this additional effort will not be significant for the ma-
jority of more complex applications that benefit from
the memory efficiency and flexibility of the data struc-
ture and are the main target for JPhyloIO. It is also easy
to combine JPhyloIO with established model standards
like the sequence model of BioJava, while it still allows
to access data not modelled by such third-party libraries.
Current usage
JPhyloIO was developed closely together with LibrAlign
[69], a Java library providing powerful and reusable GUI
components for displaying and editing multiple se-
quence alignments and attached raw- and metadata. To
back its GUI components, LibrAlign provides a fully
implemented data model for sequence and alignment
data including ready-to-use reader and writer implemen-
tations for JPhyloIO.
The Taxonomic Editor of the EDIT platform for Cyber-
taxonomy [70] manages taxonomic workflows and their
data, while persistently linking character data to pre-
served individual specimens [30]. AlignmentComparator
[71] compares alternative multiple sequence alignments
of the same dataset. Both make use of JPhyloIO and
LibrAlign for reading and writing alignments and at-
tached metadata. LibrAlign and JPhyloIO also provide
the basis for the alignment editor PhyDE 2 [72] and they
are currently used by our group in the development of
tools for the evaluation of automated multiple sequence
alignments for phylogenetic purposes and the analysis of
The tree-related functionality of JPhyloIO is the basis
in the ongoing metadata model extension in the phylo-
genetic tree editor TreeGraph 2 [51]. Versions 2.11.0
and later already use JPhyloIO for importing phylogen-
etic trees and their metadata from NeXML. Future
versions will adopt the RDF-based metadata model into
the core data model of the application and simplify
meaningful annotation of phylogenetic trees and their
nodes and branches using metadata linked with predi-
cates from externally defined ontologies. The generalized
metadata model of JPhyloIO simplifies importing and
exporting metadata for future versions of TreeGraph 2
Future development
JPhyloIO will remain under active development in the
future and community contributions are easily possible
using GitHub and other platforms, as mentioned above.
According to the needs of depending software, the li-
brary will be adjusted to future changes of the supported
formats and be extended to support additional formats.
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 12 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
API stability is a key aspect and releases follow the
established standard of semantic versioning [73].
As mentioned, the NOTES block of the NEXUS format
is currently not supported by JPhyloIO, although it is
used by some applications to store metadata related to
data from other NEXUS blocks. While the future of
metadata representation probably lies in the usage of
more advanced formats like NeXML or phyloXML,
having downwards compatibility to the NOTES block
might still be beneficial. In contrast the metadata models
of the other supported formats (including the hot
comments in NEXUS), the NOTES block represents a set
of metadata elements that usually trail the actual data as
a whole and can reference a variety of the previous data
elements. In order to model this in JPhyloIO, the event
grammar would have to be extended to allow metadata
events related to a data event at any later position in the
stream instead of requiring it to be nested within the
respective data events. This would have made developing
application readers for the event stream more compli-
cated. The only option to avoid this is to buffer all data
until the NOTES block is read, which would destroy the
memory efficiency of event-based processing. Based on
the requirements of the applications currently based on
JPhyloIO, we preferred a more concise event grammar
and memory efficiency over supporting the NEXUS
NOTES block. Should the future usage of JPhyloIO
impose different requirements we would consider to
change that strategy and possibly extend the event
grammar, ideally in a downwards compatible way.
In addition to the current abstraction over different for-
mats, the abstraction over (future) metadata ontologies
relevant for phylogenetics (e.g., possible in NeXML)can
become a focus. If a critical number of established ontol-
ogies will be present, it may be interesting to extend JPhy-
loIO to model equivalent or similar predicates in different
ontologies to allow translating between them and to ac-
cess knowledge in a general way.
The field of phylogenetics as well as biological sciences
as a whole would strongly benefit from a more wide-
spread use of data annotation and respective formats.
Unambiguously describing and processing morpho-
logical characters and states, documenting voucher
information in collections, linking raw data or providing
information on the workflow that generated the data are
some of many examples where metadata annotation
(e.g., using RDF) and externally defined ontologies can
lead to increased reproducibility of workflows and
reusability of data. JPhyloIO simplifies writing new and
extending existing software that is aimed at achieving this
goal by fully supporting metadata-rich formats. Maximum
interoperability to older software and downwards
compatibility is guaranteed by the parallel support for
both advanced and more traditional formats, enabled
by the single, format-independent interface of JPhy-
loIO. Developers may support all formats in one step
without the need for detailed knowledge on all of
them. JPhyloIOs event-based architecture makes inte-
gration with any existing application data structure
easy and allows very memory-efficient processing even
of very large data sets.
Availability and requirements
Project name: JPhyloIO.
Project home page:
GitHub Repository:
ResearchGate project page:
Operating system(s): Platform independent.
Programming language: Java.
Other requirements: Java Runtime Environment 8
(or higher).
License: GNU Lesser General Public License Version 3
Any restrictions to use by non-academics: The re-
strictions specified in the LGPL apply. (See http://bioinf
API: Application programming interface; CURIE: Compact Uniform Resource
Identifier; DFG: Deutsche Forschungsgemeinschaft (German Research
Fundation); DNA: Deoxyribonucleic acid; FASTA: Fast Adaptive Shrinkage
Thresholding Algorithm (This abbreviation is used here for FASTA alignment
format.); GNU: Gnus Not Unix (Recursive acronym. Uses a wildebeest
(Connochaetes) as its icon, which is GnuinGerman.);GUI:Graphical
User Interface; I/O: Input/Output; ID: Identifier; LGPL: Lesser General
Public License; NCBI: National Center for Biotechnology Information;
OTU: Operational Taxonomic Unit; OWL: Web Ontology Language (The
letter reversal is intended by the authors.); PhyDE: Phylogenetic Data Editor;
RDF: Resource Description Framework; UML: Unified Modeling Language;
XML: Extensible Markup Language; XTG: Extensible TreeGraph format
We are grateful to the anonymous reviewers for their helpful comments. The
developers and users of the EDIT platform for Cybertaxonomy at the Berlin
Botanical Garden and Botanical Museum tested the integration of JPhyloIO
with the Tax o no m ic Ed i tor , which is highly appreciated. We thank the contributors
to the open source projects used by JPhyloIO (Apache commons,OWL API,JUnit,
Parts of this work came from chapter 2 of the PhD thesis of BCS with the
title Software Components for increased Data Reuse and Reproducibility in
Phylogenetics and Phylogenomicsthat is available at http://nbn-resolving.
de/urn:nbn:de:hbz:6-96159516963. Contributions also came from the master
thesis of SW with the title Development and implementation of software
components increasing the accessibility of phylogenetic metadatathat is
available at An earlier
version of JPhyloIO was presented in a poster by BCS, SW and KFM with the
title JPhyloIO - A Java library for event-based reading and writing of
different alignment and tree formats through one common interfaceat the
European Conference on Computational Biology (ECCB); The Hague, The
Netherlands; 2016 that is available at
992.1 .JPhyloIO was also subject of a poster by SW, KFM and BCS with the title
Increasing data accessibility and reuse in phylogenetics by employing exter-
nally defined ontologiesthat was presented at the 6th annual Symposium of
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 13 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the Münster Graduate School of Evolution; Münster, Germany; 2017 and is avail-
able at
BCS conceived the concept, implemented the software and wrote the
manuscript; SW contributed to the implementation the software; KFM
and SW contributed to the manuscript. All authors gave final approval
for publication.
Funded in part by grant MU 2875/31 to Kai Müller by the German research
foundation (DFG). The authors acknowledge support from the Open Access
Publication Fund of the University of Münster. The funding bodies did not
influence the design of the study and collection, analysis, and interpretation
of data or writing the manuscript.
Availability of data and materials
Binary distributions of JPhyloIO are available at
JPhyloIO/Download. Source codes are available at
JPhyloIO/SourceCode or at the GitHub mirror at
bioinfweb/JPhyloIO. Unit tests and test data that was used in the
development of JPhyloIO is available at
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Received: 17 May 2019 Accepted: 2 July 2019
1. Kelling S, Hochachka WM, Fink D, Riedewald M, Caruana R, Ballard G, et al.
Data-intensive science: a new paradigm for biodiversity studies. BioScience.
2. Michener WK, Jones MB. Ecoinformatics: supporting ecology as a data-intensive
science. Trends Ecol Evol. 2012;27:8593.
3. McKain MR, Johnson MG, Uribe-Convers S, Eaton D, Yang Y. Practical
considerations for plant phylogenomics. Appl Plant Sci. 2018;6:e1038.
4. Geiger MF, Astrin JJ, Borsch T, Burkhardt U, Grobe P, Hand R, et al. How to
tackle the molecular species inventory for an industrialized nation - lessons
from the first phase of the German barcode of life initiative GBOL (2012
2015). Genome. 2016;59:66170.
5. Ratnasingham S, Hebert PDN. Bold: the barcode of life data system (http:// Mol Ecol Notes 2007;7:355364.
6. KerrKC,StoeckleMY,DoveCJ,WeigtLA,FrancisCM,HebertPD.Comprehensive
DNA barcode coverage of north American birds. Mol Ecol Resour. 2007;7:53543.
7. Global Plants on JSTOR. Accessed 18 Jun 2019.
8. Smith V, Blagoderov V. Bringing collections out of the dark. ZooKeys.
9. Dickinson JL, Zuckerberg B, Bonter DN. Citizen science as an ecological
research tool: challenges and benefits. Annu Rev Ecol Evol Syst. 2010;41:
10. Hochachka WM, Fink D, Hutchinson RA, Sheldon D, Wong W-K, Kelling S.
Data-intensive science applied to broad-scale citizen science. Trends Ecol
Evol. 2012;27:1307.
11. Joly A, Goëau H, Bonnet P, BakićV, Barbe J, Selmi S, et al. Interactive plant
identification based on social image data. Ecol Inform. 2014;23:2234.
12. Schröter M, Kraemer R, Mantel M, Kabisch N, Hecker S, Richter A, et al.
Citizen science for assessing ecosystem services: status, challenges and
opportunities. Ecosyst Serv. 2017;28:8094.
13. Stoltzfus A, OMeara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, et al.
Sharing and re-use of phylogenetic trees (and associated data) to facilitate
synthesis. BMC Res Notes. 2012;5:574.
14. Rambaut A. TreeThief: a tool for manual phylogenetic tree entry from
scanned images. Department of Zoology, University of Oxford. Oxford.
Available from: http://microbe. bio. indiana. edu; 1999
15. Hughes J. TreeRipper web application: towards a fully automated optical
tree recognition software. BMC Bioinformatics. 2011;12:178.
16. Laubach T, von Haeseler A, Lercher MJ. TreeSnatcher plus: capturing
phylogenetic trees from images. BMC Bioinformatics. 2012;13:110.
17. Murray-Rust P, Smith-Unna R, Mounce R. AMI-diagram: mining facts from
images. -Lib Mag. 2014;20 11/12. doi:
murray-rust .
18. Cranston K, Harmon LJ, OLeary MA, Lisle C. Best practices for data sharing
in phylogenetic research. PLOS Curr Tree Life. 2014.
currents.tol.bf01eff4a6b60ca4825c69293dc59645 .
19. Poisot TE, Mounce R, Gravel D. Moving toward a sustainable ecological
science: dont let data go to waste! Ideas Ecol Evol. 2013;6.
20. Parr CS, Guralnick R, Cellinese N, Page RDM. Evolutionary informatics:
unifying knowledge about the diversity of life. Trends Ecol Evol. 2012;
21. Kenall A, Harold S, Foote C. An open future for ecological and evolutionary
data? BMC Evol Biol. 2014;14:66.
22. Piel WH, Chan L, Dominus MJ, Ruan J, Vos RA, Tannen V. TreeBASE v. 2: a
database of phylogenetic knowledge. In: e-BioSphere 2009. London; 2009. Accessed 28 Feb 2017
23. National Evolutionary Synthesis Center, UNC-CH Metadata Research Center,
Oxford University, The British Library. Dryad. Dryad.
Accessed 18 Jun 2019.
24. Stoltzfus A, Lapp H, Matasci N, Deus H, Sidlauskas B, Zmasek CM, et al.
Phylotastic! Making tree-of-life knowledge accessible, reusable and
convenient. BMC Bioinformatics. 2013;14:158.
25. Hinchliff CE, Smith SA, Allman JF, Burleigh JG, Chaudhary R, Coghill LM, et
al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life.
Proc Natl Acad Sci. 2015;112:127649.
26. Kumar S, Stecher G, Suleski M, Hedges SB. TimeTree: a resource for timelines,
Timetrees, and divergence times. Mol Biol Evol. 2017;34:18129.
27. White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR. Nine simple
ways to make it easier to (re) use your data. Ideas Ecol Evol. 2013;6. https:// .
28. Whitlock MC. Data archiving in ecology and evolution: best practices.
Trends Ecol Evol. 2011;26:615.
29. Federhen S. The NCBI taxonomy database. Nucleic Acids Res. 2012;40:D13643.
30. Kilian N, Henning T, Plitzner P, Müller A, Güntsch A, Stöver BC, et al. Sample
data processing in an additive and reproducible taxonomic workflow by
using character data persistently linked to preserved individual specimens.
Database. 2015;2015:bav094.
31. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible
computational research. PLoS Comput Biol. 2013;9:e1003285.
32. Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, et al.
Taking the first steps towards a standard for reporting on phylogenies:
minimum information about a phylogenetic analysis (MIAPA). OMICS J
Integr Biol. 2006;10:2317.
33. Hang ST, Tatsuma A, Aono M. Bluefield (kde tut) at lifeclef 2016 plant
identification task. In: Working Notes of CLEF 2016 conference; 2016. Accessed 14 Sep 2016.
34. Wilf P, Zhang S, Chikkerur S, Little SA, Wing SL, Serre T. Computer vision
cracks the leaf code. Proc Natl Acad Sci. 2016;113:330510.
35. BarréP,StöverBC,MüllerKF,Steinhage V. LeafNet: a computer vision
system for automatic plant species identification. Ecol Inform. 2017;
36. Felsenstein J. Phylip input files. 2013. http://evolution.genetics.washington.
edu/phylip/doc/main.html#inputfiles. Accessed 18 Jun 2019.
37. Felsenstein J. The Newick tree format. 1993. http://evolution.genetics. Accessed 18 Jun 2019.
38. Maddison DR, Swofford D, Maddison WP. NEXUS: an extensible file format
for systematic information. Syst Biol. 1997;46:590621.
39. Vos RA, Balhoff JP, Caravas JA, Holder MT, Lapp H, Maddison WP, et al.
NeXML: rich, extensible, and verifiable representation of comparative data
and metadata. Syst Biol. 2012;61:67589.
40. Han MV, Zmasek CM. phyloXML: XML for evolutionary biology and
comparative genomics. BMC Bioinformatics. 2009;10:356.
41. RDF - Semantic Web Standards.
Jun 2019.
42. Wang X, Gorlitsky R, Almeida JS. From XML to RDF: how semantic web
technologies will change the design of omicstandards. Nat Biotechnol.
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 14 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
43. Gamma E, Helm R, Johnson RE, Vlissides J. Design patter ns. Elements of
reusable object-oriented software. 1st ed. Reprint. Reading, Mass:
Prentice Hall; 1994.
44. Miller M, Schwartz T, Pfeiffer W. Relaxed PHYLIP format. Relaxed PHYLIP
format documentation at CIPRES. 2016.
help/relaxed_phylip. Accessed 18 Jun 2019.
45. Maddison WP, Maddison DR. Mesquite: a modular system for evolutionary
analysis. 2016. Accessed 18 Jun 2019.
46. Ronquist F, Teslenko M, van der Mark P, Larget B, Donald S, Huelsenbeck J.
The MrBayes Input Format. 2016.
format.html. Accessed 18 Jun 2019.
47. Ronquist F, Teslenko M, van der MP, Ayres DL, Darling A, Höhna S, et al.
MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice
across a large model space. Syst Biol. 2012;61:53942.
48. Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics
analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016:msw054.
49. Cardona G, Rosselló F, Valiente G. Extended Newick: it is time for a
standard representation of phylogenetic networks. BMC Bioinformatics.
50. Müller J, Müller K, Neinhuis C, Quandt D. PhyDE - Phylogenetic Data Editor.
2006. Accessed 18 Jun 2019.
51. Stöver BC, Müller KF. TreeGraph 2: combining and visualizing evidence from
different phylogenetic analyses. BMC Bioinformatics. 2010;11:7.
52. Zmasek CM, Eddy SR. ATV: display and manipulation of annotated phylogenetic
trees. Bioinformatics. 2001;17:3834.
53. Zmasek CM. NHX - New Hampshire eXtended, version 2.0. 2014. https:// Acces sed
18 Jun 2019.
54. Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, et al. BEAST 2: a
software platform for Bayesian evolutionary analysis. PLoS Comput Biol.
55. Vogt L. eScience and the need for data standards in the life sciences: in
pursuit of objectivity rather than truth. Syst Biodivers. 2013;11:25770.
56. Zmasek CM. forester: Software libraries for evolutionary biology and
comparative genomics research. 2015.
cmzmasek/home/software/forester. Accessed 18 Jun 2019.
57. Maddison DR, Schulz K-S, Maddison WP. The tree of life web project.
Zootaxa. 2007;1668 Linnaeus Tercentenary: Progress in Invertebrate
58. PrlićA,YatesA,BlivenSE,RosePW,JacobsenJ,TroshinPV,etal.
BioJava: an open-source framework for bioinformatics in 2012.
Bioinformatics. 2012;28:26935.
59. Holland RCG, Down TA, Pocock M, Prlic A, Huen D, James K, et al. BioJava:
an open-source framework for bioinformatics. Bioinformatics. 2008;24:20967.
60. Vos RA, Huang D, Midford PE. Balhoff J. Sukumaran J. - Java API
for NeXML. GitHub: Holder MT; 2016.
Accessed 18 Jun 2019
61. Hladish T, Gopalan V, Liang C, Qiu W, Yang P, Stoltzfus A. Bio::NEXUS: a Perl
API for the NEXUS format for comparative biological data. BMC Bioinformatics.
62. Boettiger C, Chamberlain S, Vos R, Lapp H. RNeXML: a package for reading
and writing richly annotated phylogenetic, character and trait data in r.
Methods Ecol Evol. 2016;7:3527.
63. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, et
al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res.
64. Lewis PO. NCL: a C++ class library for interpreting data files in NEXUS format.
Bioinformatics. 2003;19:23301.
65. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic
computing. Bioinformatics. 2010;26:156971.
66. Vos RA, Caravas J, Hartmann K, Jensen MA, Miller C. BIO::Phylo-phyloinformatic
analysis using perl. BMC Bioinformatics. 2011;12:63.
67. Py4J - A Bridge between Python and Java.
Accessed 18 Jun 2019.
68. rJava - Low-level R to Java interface. Accessed
18 Jun 2019.
69. Stöver BC, Müller KF. LibrAlign: a GUI library for displaying and editing
multiple sequence alignments and attached data. 2016. http://bioinfweb.
info/LibrAlign/. Accessed 18 Jun 2019.
70. Berendsohn WG. Devising the EDIT platform for Cybertaxonomy. In: Tools
for identifying biodiversity: Progress and problems. Proceedings of the
international congress, Paris, September 2022, 2010. EUT Edizioni Università
di Trieste; 2010.
Accessed 18 Jun 2019.
71. Stöver BC, Müller KF. AlignmentComparator: an application to efficiently
visualize and annotate differences between alternative multiple
sequence alignments. 2014.
Accessed 18 Jun 2019.
72. Stöver BC, Bohn J, van Groen S, Kösters L, Quandt D, Müller KF. PhyDE 2 -
an alignment editor for phylogenetic purposes. PhyDE 2 - an alignment
editor for phylogenetic purposes. 2019.
Accessed 18 Jun 2019.
73. Preston-Werner T. Semantic Versioning 2.0.0. 2013.
Accessed 18 Jun 2019.
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Stöver et al. BMC Bioinformatics (2019) 20:402 Page 15 of 15
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
Full-text available
Phylogenetic trees are an important analytical tool for evaluating community diversity and evolutionary history. In the case of microorganisms, the decreasing cost of sequencing has enabled researchers to generate ever-larger sequence datasets, which in turn have begun to fill gaps in the evolutionary history of microbial groups. However, phylogenetic analyses of these types of datasets create complex trees that can be challenging to interpret. Scientific inferences made by visual inspection of phylogenetic trees can be simplified and enhanced by customizing various parts of the tree. Yet, manual customization is time-consuming and error prone, and programs designed to assist in batch tree customization often require programming experience or complicated file formats for annotation. Iroki, a user-friendly web interface for tree visualization, addresses these issues by providing automatic customization of large trees based on metadata contained in tab-separated text files. Iroki's utility for exploring biological and ecological trends in sequencing data was demonstrated through a variety of microbial ecology applications in which trees with hundreds to thousands of leaf nodes were customized according to extensive collections of metadata. The Iroki web application and documentation are available at or through the VIROME portal Iroki's source code is released under the MIT license and is available at
Full-text available
We present the latest version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which contains many sophisticated methods and tools for phylogenomics and phylomedicine. In this major upgrade, MEGA has been optimized for use on 64-bit computing systems for analyzing bigger datasets. Researchers can now explore and analyze tens of thousands of sequences in MEGA. The new version also provides an advanced wizard for building timetrees and includes a new functionality to automatically predict gene duplication events in gene family trees. The 64-bit MEGA is made available in two interfaces: graphical and command line. The graphical user interface (GUI) is a native Microsoft Windows application that can also be used on Mac OSX. The command line MEGA is available as native applications for Windows, Linux, and Mac OSX. They are intended for use in high-throughput and scripted analysis. Both versions are available from free of charge.
Full-text available
The past decade has seen a major breakthrough in our ability to easily and inexpensively sequence genome‐scale data from diverse lineages. The development of high‐throughput sequencing and long‐read technologies has ushered in the era of phylogenomics, where hundreds to thousands of nuclear genes and whole organellar genomes are routinely used to reconstruct evolutionary relationships. As a result, understanding which options are best suited for a particular set of questions can be difficult, especially for those just starting in the field. Here, we review the most recent advances in plant phylogenomic methods and make recommendations for project‐dependent best practices and considerations. We focus on the costs and benefits of different approaches in regard to the information they provide researchers and the questions they can address. We also highlight unique challenges and opportunities in plant systems, such as polyploidy, reticulate evolution, and the use of herbarium materials, identifying optimal methodologies for each. Finally, we draw attention to lingering challenges in the field of plant phylogenomics, such as reusability of data sets, and look at some up‐and‐coming technologies that may help propel the field even further.
Full-text available
Citizen science approaches provide opportunities to support ecosystem service assessments. To evaluate the recent trends, challenges and opportunities of utilizing citizen science in ecosystem service studies we conducted a systematic literature and project review. We reviewed the range of ecosystem services and formats of participation in citizen science in 17 peer-reviewed scientific publications and 102 ongoing or finished citizen science projects, out of over 500 screened publications and over 1400 screened projects. We found that citizen science is predominantly applied in assessing regulating and cultural services. The assessments were often performed by using proxy indicators that only implicitly provide information on ecosystem services. Direct assessments of ecosystem services are still rare. Participation formats mostly comprise contributory citizen science projects that focus on volunteered data collection. However, there is potential to increase citizen involvement in comprehensive ecosystem service assessments, including the development of research questions, design, data analysis and dissemination of findings. Levels of involvement could be enhanced to strengthen strategic knowledge on the environment, scientific literacy and the empowerment of citizens in helping to inform and monitor policies and management efforts related to ecosystem services. We provide an outlook how to better operationalise citizen science approaches to assess ecosystem services.
Full-text available
Several applications currently developed in our group and by our cooperators deal with multiple sequence alignments (MSA) or associated raw and meta data, and allow the user to view and edit it in a graphical user interface (GUI). Instead of implementing independent solutions for these different tasks, we decided to create a library containing powerful and reusable common GUI components. Since this library is open source (GNU GPL 3) it can be used and extended by other researchers, who are then able to focus on the core functionality of their applications, but can still provide a user-friendly GUI. Besides components allowing the displaying and editing of MSAs, several types of data (e.g. trace files, comments, statistical sequence information, positions of tandem repeats, hairpins or inversions) can be attached either to single sequences or to the alignment as a whole. All these data views implement a common interface that makes it easy for developers to create new custom views. Several components from our library (e.g. displaying different types of data) can be connected to each other in the application they are embedded in, so that the user can scroll through one of them while all others will automatically display the data associated with the current position. LibrAlign is fully interoperable with the BioJava API and all components are provided in a native Swing and a native SWT version (the two major GUI frameworks for Java), so that they can be integrated into any Java GUI application, including projects based on the Eclipse Rich Client Platform or Bioclipse. Several software projects based on LibrAlign are currently in development in- and outside our group. Among those are (i) the Taxonomic Editor of the EDIT platform which is extended to support sequence and alignment associated data for the Campanula portal of EDIT, (ii) a new version of the alignment editor PhyDE, (iii) AlignmentComparator (an application to visualize differences between alternative automatic and manual alignments, which we currently use in study investigating the influence of manual alignment corrections on phylogenetic studies), and (iv) HIR-Finder (an application which locates microstructural mutations like tandem repeats possibly associated with hairpins). LibrAlign download:
Full-text available
With a growing number of alternative algorithms for automated multiple sequence alignment (MSA) and different strategies for manual alignment corrections it becomes more and more relevant to visualize the differences between alternative MSAs of the same data set. This allows the researcher to decide which alternative alignments to take into account as the bases of e.g. a phylogenetic study or numerous other tasks in biological research. Furthermore, manual alignment corrections can be visualized or a bioinformatician can determine the effect of changes made to a MSA algorithm, which both is very relevant in our current research focusing on the improvement of MSA for phylogenetic purposes. Here we present AlignmentComparator, a platform independent open-source GUI application that reveals the differences and shifts in alternative MSAs by calculating and displaying a super alignment (similar to a profile-profile-alignment) that distinguishes between gaps previously contained in a single MSA and super gaps that originated from the comparison. This super alignment allows to directly identify matching regions in the alternative alignments, which otherwise would be very time-consuming with a conventional alignment editor, especially with an increasing column count and therefore a potentially increasing column shift. All differences can be commented and annotated with the AlignmentComparator GUI and saved to a XML file. Software and poster download:
Full-text available
Evolutionary information on species divergence times is fundamental to studies of biodiversity, development, and disease. Molecular dating has enhanced our understanding of the temporal patterns of species divergences over the last five decades, and the number of studies is increasing quickly due to an exponential growth in the available collection of molecular sequences from diverse species and large number of genes. Our TimeTree resource is a public knowledge-base with the primary focus to make available all species divergence times derived using molecular sequence data to scientists, educators, and the general public in a consistent and accessible format. Here, we report a major expansion of the TimeTree resource, which more than triples the number of species (>97,000) and more than triples the number of studies assembled (>3,000). Furthermore, scientists can access not only the divergence time between two species or higher taxa, but also a timetree of a group of species and a timeline that traces a species' evolution through time. The new timetree and timeline visualizations are integrated with display of events on earth and environmental history over geological time, which will lead to broader and better understanding of the interplay of the change in the biosphere with the diversity of species on Earth. The next generation TimeTree resource is publicly available online at
Full-text available
Understanding the extremely variable, complex shape and venation characters of angiosperm leaves is one of the most challenging problems in botany. Machine learning offers opportunities to analyze large numbers of specimens, to discover novel leaf features of angiosperm clades that may have phylogenetic significance, and to use those characters to classify unknowns. Previous computer vision approaches have primarily focused on leaf identification at the species level. It remains an open question whether learning and classification are possible among major evolutionary groups such as families and orders, which usually contain hundreds to thousands of species each and exhibit many times the foliar variation of individual species. Here, we tested whether a computer vision algorithm could use a database of 7,597 leaf images from 2,001 genera to learn features of botanical families and orders, then classify novel images. The images are of cleared leaves, specimens that are chemically bleached, then stained to reveal venation. Machine learning was used to learn a codebook of visual elements representing leaf shape and venation patterns. The resulting automated system learned to classify images into families and orders with a success rate many times greater than chance. Of direct botanical interest, the responses of diagnostic features can be visualized on leaf images as heat maps, which are likely to prompt recognition and evolutionary interpretation of a wealth of novel morphological characters. With assistance from computer vision, leaves are poised to make numerous new contributions to systematic and paleobotanical studies.
Aims Taxon identification is an important step in many plant ecological studies. Its efficiency and reproducibility might greatly benefit from partly automating this task. Image-based identification systems exist, but mostly rely on hand-crafted algorithms to extract sets of features chosen a priori to identify species of selected taxa. In consequence, such systems are restricted to these taxa and additionally require involving experts that provide taxonomical knowledge for developing such customized systems. The aim of this study was to develop a deep learning system to learn discriminative features from leaf images along with a classifier for species identification of plants. By comparing our results with customized systems like LeafSnap we can show that learning the features by a convolutional neural network (CNN) can provide better feature representation for leaf images compared to hand-crafted features. Methods We developed LeafNet, a CNN-based plant identification system. For evaluation, we utilized the publicly available LeafSnap, Flavia and Foliage datasets. Results Evaluating the recognition accuracies of LeafNet on the LeafSnap, Flavia and Foliage datasets reveals a better performance of LeafNet compared to hand-crafted customized systems. Conclusions Given the overall species diversity of plants, the goal of a complete automatisation of visual plant species identification is unlikely to be met solely by continually gathering assemblies of customized, specialized and hand-crafted (and therefore expensive) identification systems. Deep Learning CNN approaches offer a self-learning state-of-the-art alternative that allows adaption to different taxa just by presenting new training data instead of developing new software systems.
Biodiversity loss is mainly driven by human activity. While concern grows over the fate of hot spots of biodiversity, contemporary species losses still prevail in industrialized nations. Therefore, strategies were formulated to halt or reverse the loss, driven by evidence for its value for ecosystem services. Maintenance of the latter through conservation depends on correctly identified species. To this aim, the German Federal Ministry of Education and Research is funding the GBOL project, a consortium of natural history collections, botanic gardens, and universities working on a barcode reference database for the country's fauna and flora. Several noticeable findings could be useful for future campaigns: (i) validating taxon lists to serve as a taxonomic backbone is time-consuming, but without alternative; (ii) offering financial incentives to taxonomic experts, often citizen scientists, is indispensable; (iii) completion of the libraries for widespread species enables analyses of environmental samples, but the process may not hold pace with technological advancements; (iv) discoveries of new species are among the best stories for the media; (v) a commitment to common data standards and repositories is needed, as well as transboundary cooperation between nations; (vi) after validation, all data should be published online via the BOLD to make them searchable for external users and to allow cross-checking with data from other countries.