PosterPDF Available

JPhyloIO - A Java library for event-based reading and writing of different alignment and tree formats through one common interface

Authors:

Abstract

Today a variety of alignment and tree file formats exist, some of which well-established but limited in their data model, others more recently proposed offer advanced future-orientated features for metadata representation. Most phylogenetic and other bioinformatic software currently only supports one or few different formats, while supporting many widely-used standards simultaneously would be desirable to achieve optimal interoperability and prevent data loss by external conversions. We developed JPhyloIO, which allows reading and writing of alignment and tree formats (NeXML, PhyloXML, Nexus, Newick, FASTA, Phylip, MEGA, XTG, PDE) using a common interface. It is the only currently available Java-library that generalizes between the different data and metadata concepts of all formats, while still allowing access to their individual features. By simply implementing a single JPhyloIO based reader and writer, application developers can easily support all formats in one step and the event-based architecture allows the library to be combined with any application business model design, while still being memory efficient for large datasets. We provide JPhyloIO as a service to the scientific community, which will benefit from simplified development of software that supports various standards simultaneously. Our aims are to increase the interoperability between different (phylogenetic) software tools and to foster usage of more recently proposed formats providing a powerful metadata concept. It currently integrated in a number of applications and is fully interoperable with our Java-library LibrAlign, which offers powerful components for multiple sequence alignments and attached raw and metadata. Download and documentation: http://bioinfweb.info/JPhyloIO/ .
Ben C. Stöver1, Sarah Wiechers1, Kai F. Müller1
1) Evolution and Biodiversity of Plants Group, Institute for Evolution and Biodiversity, WWU Münster, Hüfferstr. 1, 48149 Münster, Germany
JPhyloIO A Java library for event-based reading and
writing of different alignment and tree formats through
one common interface
Event based document reading
In JPhyloIO documents are represented as sequences of events that model the contained elements. Fig-
ure 1 shows the grammar that defines how documents can be represented as event sequences, while fig-
ure 2 shows an example document and its translation into events. All elements can carry metadata.
http://www2.ieb.uni-muenster.de/EvolBiodivPlants
Poster download: http://go.wwu.de/po5ac
Citations: Han, M.V. & Zmasek, C.M. (2009). phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics, 10, 356. Kumar, S., Stecher, G. & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Da-
tasets. Molecular Biology and Evolution, msw054. Maddison, D.R., Swofford, D. & Maddison, W.P. (1997). NEXUS: an extensible file format for systematic information. Systematic Biology, 46, 590621. Stöver, B.C. & Müller, K.F. (2010). TreeGraph 2: Combining
and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics, 11, 7. Vos, R.A., Balhoff, J.P., Caravas, J.A., Holder, M.T., Lapp, H., Maddison, W.P., Midford, P.E., Priyam, A., Sukumaran, J., Xia, X. & Stoltzfus, A. (2012). NeXML: Rich, Extensible,
and Verifiable Representation of Comparative Data and Metadata. Systematic Biology, 61, 675689.
Figure 2 The shown example document contains an OTU list and an align-
ment, which references this list. The shown sequence of events is generated
from it according to the grammar in figure 1, where each box represents one
event object. Each object has an ID in order to be referenced by subsequent
events, as exemplarily shown on the OTU list and OTU start events, which
are referenced by the according alignment and sequence start events.
Document = "DOCUMENT.START", {DocumentContent,} "DOCUMENT.END";
DocumentContent = OTUSet | Matrix | TreeNetworkGroup | CharacterSetPart | TreeNetworkSet | MetaInformaon;
OTUList = "OTUS.START", {OTUListContent,} {OTUSet,} "OTUS.END";
OTUListContent = OTU | MetaInformaon;
OTU = "OTU.START", {MetaInformaon,} "OTU.END";
OTUSet = "OTU_SET.START", {SetContent,} "OTU_SET.END";
Matrix = "ALIGNMENT.START", {MatrixContent,} "ALIGNMENT.END";
MatrixContent = CharacterDenion | TokenSetDenion | SequencePart | CharacterSetPart | SequenceSet | MetaInformaon;
CharacterDenion = "CHARACTER_DEFINITION.START" {MetaInformaon,} "CHARACTER_DEFINITION.END";
SequenceSet = "SEQUENCE_SET.START" {SetContent,} "SEQUENCE_SET.END";
TokenSetDenion = "TOKEN_SET_DEFINITION.START", {TokenSetDenionContent,} "TOKEN_SET_DEFINITION.END";
TokenSetDenionContent = SingleTokenDenion | MetaInformaon;
SingleTokenDenion = "SINGLE_TOKEN_DEFINITION.START", {MetaInformaon,} "SINGLE_TOKEN_DEFINITION.END";
SequencePart = "SEQUENCE.START", {SequencePartContent,} "SEQUENCE.END";
SequencePartContent = "SEQUENCE_TOKENS.SOLE" | SingleSequenceToken | MetaInformaon;
SingleSequenceToken = "SINGLE_SEQUENCE_TOKEN.START", {MetaInformaon,} "SINGLE_SEQUENCE_TOKEN.END";
CharacterSetPart = "CHARACTER_SET.START", {CharacterSetPartContent,} "CHARACTER_SET.END";
CharacterSetPartContent = "CHARACTER_SET_PART.SOLE" | SetContent;
(* In character sets only references to other character sets (and not single character denions) are using "SET_ELEMENT.SOLE". *)
TreeNetworkGroup = "TREE_NETWORK_GROUP.START", {TreeNetworkGroupContent,} "TREE_NETWORK_GROUP.END";
TreeNetworkGroupContent = Tree | Network | TreeNetworkSet;
Tree = "TREE.START", {TreeOrNetworkContent,} ["ROOT_EDGE.START",] {TreeOrNetworkContent,} {NodeEdgeSet,} "TREE.END";
Network = "NETWORK.START", {TreeOrNetworkContent,} {NodeEdgeSet,} "NETWORK.END";
TreeOrNetworkContent = Node | Edge | MetaInformaon;
Node = "NODE.START", {MetaInformaon,} "NODE.END";
Edge = "EDGE.START", {MetaInformaon,} "EDGE.END";
TreeNetworkSet = "TREE_NETWORK_SET.START" {SetContent,} "TREE_NETWORK_SET.END";
NodeEdgeSet = "NODE_EDGE_SET.START" {SetContent,} "NODE_EDGE_SET.END";
SetContent = "SET_ELEMENT.SOLE" | MetaInformaon;
(* Single elements and other sets of the same type can be linked using "SET_ELEMENT.SOLE". *)
MetaInformaon = ResourceMeta | LiteralMeta;
ResourceMeta = "RESOURCE_META.START", {MetaInformaon,} "RESOURCE_META.END";
LiteralMeta = "LITERAL_META.START", {"LITERAL_META_CONTENT.SOLE",} "LITERAL_META.END";
Figure 1 EBNF grammar describing JPhyloIO
event sequences. The terminal symbols (in
green) represent the types of events, each of
which either has a single SOLE or a START and
END version, depending on whether additional
data can be nested or not.
Availability
JPhyloIO is distributed under GNU
General Public License Version 3 at the
BioInfWeb Software portal:
http://bioinfweb.info/JPhyloIO
Aims and concept
To date many bioinformatic software tools support only a
single format. Applications based on JPhyloIO need to im-
plement just one single reader and writer to support all for-
mats without needing detailed knowledge of these. This
should increase interoperability and foster the usage of
more recently proposed powerful formats, such as NeXML.
Our library allows to access nine phylogenetic file formats
through one common interface, providing access to all fea-
tures of each format (including complex metadata of NeXML
and PhyloXML). Documents are translated to a stream of
events (see figure 1 and 2), allowing memory efficient pro-
cessing independent of the application business model.
Acknowledgements
The funding of parts of the development of JPhyloIO
with grant MU 2875/3-1 to KFM by the DFG (German re-
search foundation) is highly appreciated. BCS wants
to thank the European Conference on Computational
Biology (ECCB) and the International Society for Com-
putational Biology (ISCB) for partly financing the
presentation of this poster at the ECCB 2016. Further-
more the authors are very thankful to the developers
of the other open source projects JPhyloIO uses
(Apache commons, OWL API, JUnit, Hemcrest).
Writing events using data adapters
Since different formats require the data in differ-
ent orders, a simple event stream is not efficient
for writing. A number of data adapters have been
defined instead, each of them providing a subse-
quence of the event stream.
Figure 3 The data adapters of JPhyloIO are implemented by
an application to provide access to its business model.
Figure 4 An UML class diagram showing the inheritance and relations between the different data adapt-
er interfaces of JPhyloIO. DocumentDataAdapter is the start point providing access to the others.
Supported formats
NeXML (Vos et al., 2012)
Nexus (Maddison et al., 1997)
PhyloXML (Han & Zmasek, 2009)
FASTA
Newick tree format
Phylip and extended Phylip (also sequential)
MEGA (Kumar et al., 2016)
PDE used by the alignment editor PhyDE
XTG used by TreeGraph 2 (Stöver & Müller, 2010)
... Standalone tree visualization packages allowing manual or batch modification of trees are available (e.g., Archaeopteryx (Han & Zmasek, 2009), Dendroscope (Huson et al., 2007), FigTree (Rambaut, 2006), TreeGraph2 (Stöver & Müller, 2010), Treevolution (Santamaría & Therón, 2009)), but the process can be time consuming and error prone especially when dealing with trees containing many nodes. Some packages allow batch and programmatic customizations through the use of an application programming interface (API) or command line software (e.g., APE (Paradis, Claude & Strimmer, 2004), Bio::Phylo (Vos et al., 2011), Bio.Phylo (Talevich et al., 2012, ColorTree (Chen & Lercher, 2009), ETE (Huerta-Cepas, Serra & Bork, 2016, GraPhlAn (Asnicar et al., 2015), JPhyloIO (Stöver, Wiechers & Müller, 2016), phytools (Revell, 2012), treeman (Bennett, Sutton & Turvey, 2017)). While these packages are powerful, they require substantial computing expertise, which can be an impediment for some scientists. ...
Article
Full-text available
Phylogenetic trees are an important analytical tool for evaluating community diversity and evolutionary history. In the case of microorganisms, the decreasing cost of sequencing has enabled researchers to generate ever-larger sequence datasets, which in turn have begun to fill gaps in the evolutionary history of microbial groups. However, phylogenetic analyses of these types of datasets create complex trees that can be challenging to interpret. Scientific inferences made by visual inspection of phylogenetic trees can be simplified and enhanced by customizing various parts of the tree. Yet, manual customization is time-consuming and error prone, and programs designed to assist in batch tree customization often require programming experience or complicated file formats for annotation. Iroki, a user-friendly web interface for tree visualization, addresses these issues by providing automatic customization of large trees based on metadata contained in tab-separated text files. Iroki's utility for exploring biological and ecological trends in sequencing data was demonstrated through a variety of microbial ecology applications in which trees with hundreds to thousands of leaf nodes were customized according to extensive collections of metadata. The Iroki web application and documentation are available at https://www.iroki.net or through the VIROME portal http://virome.dbi.udel.edu. Iroki's source code is released under the MIT license and is available at https://github.com/mooreryan/iroki.
ResearchGate has not been able to resolve any references for this publication.