Background: Today a variety of phylogenetic file formats exists, some of which are well-established but limited in their data model, while other more recently introduced ones offer advanced features for metadata representation. Although most currently available software only supports the classical formats with a limited metadata model, it would be desirable to have support for the more advanced formats. This is necessary for users to produce richly annotated data that can be efficiently reused and make underlying workflows easily reproducible. A programming library that abstracts over the data and metadata models of the different formats and allows supporting all of them in one step would significantly simplify the development of new and the extension of existing software to address the need for better metadata annotation. Results: We developed the Java library JPhyloIO, which allows event-based reading and writing of the most common alignment and tree/network formats. It allows full access to all features of the nine currently supported formats. By implementing a single JPhyloIO-based reader and writer, application developers can support all of these formats. Due to the event-based architecture, JPhyloIO can be combined with any application data structure, and is memory efficient for large datasets. JPhyloIO is distributed under LGPL. Detailed documentation and example applications (available on http://bioinfweb.info/JPhyloIO/) significantly lower the entry barrier for bioinformaticians who wish to benefit from JPhyloIO's features in their own software. Conclusion: JPhyloIO enables simplified development of new and extension of existing applications that support various standard formats simultaneously. This has the potential to improve interoperability between phylogenetic software tools and at the same time motivate usage of more recent metadata-rich formats such as NeXML or phyloXML.
Specimens form the falsifiable evidence used in plant systematics. Derivatives of specimens (including the specimen as the organism in the field) such as tissue and DNA samples play an increasing role in research. The EDIT Platform for Cybertaxonomy is a specialist's tool that allows to document and sustainably store all data that are used in the taxonomic work process, from field data to DNA sequences. The types of data stored can be very heterogeneous consisting of specimens, images, text data, primary data files, taxon assignments, etc. The EDIT Platform organizes the linking between such data by using a generic data model for representing the research process. Each step in the process is regarded as a derivation step and generates a derivative of the previous step. This could be a field unit having a specimen as its derivative or a specimen having a tissue sample as its derivative. Each derivation step also produces meta data storing who, when and how the derivation was done. The Platform's Common Data Model (CDM) and the applications build on the CDM library thus represent the first comprehensive implementation of the largely theoretical models developed in the late 1990ies (Berendsohn et al. 1999). In a pilot project research data about the genus Campanula (Kilian et al. 2015, FUB, BGBM 2012) was gathered and used to create a hierarchy of derivatives reaching from field data to DNA sequences. Additionally, the open source library for multiple sequence alignments LibrAlign (Stöver and Müller 2015) was used to integrate an alignment editor into the EDIT platform that allows to generate consensus sequences as derivatives of DNA sequences. The persistent storage of each link in the derivation process and the degree of detail on how the data and meta data are stored will speed up the research process, ease the reproducibility of research results and enhance sustainability of collections.
We present the model and implementation of a workflow that blazes a trail in systematic biology for the re-usability of character data (data on any kind of characters of pheno- and genotypes of organisms) and their additivity from specimen to taxon level. We take into account that any taxon characterization is based on a limited set of sampled individuals and characters, and that consequently any new individual and any new character may affect the recognition of biological entities and/or the subsequent delimitation and characterization of a taxon. Taxon concepts thus frequently change during the knowledge generation process in systematic biology. Structured character data are therefore not only needed for the knowledge generation process but also for easily adapting characterizations of taxa. We aim to facilitate the construction and reproducibility of taxon characterizations from structured character data of changing sample sets by establishing a stable and unambiguous association between each sampled individual and the data processed from it. Our workflow implementation uses the European Distributed Institute of Taxonomy Platform, a comprehensive taxonomic data management and publication environment to: (i) establish a reproducible connection between sampled individuals and all samples derived from them; (ii) stably link sample-based character data with the metadata of the respective samples; (iii) record and store structured specimen-based character data in formats allowing data exchange; (iv) reversibly assign sample metadata and character datasets to taxa in an editable classification and display them and (v) organize data exchange via standard exchange formats and enable the link between the character datasets and samples in research collections, ensuring high visibility and instant re-usability of the data. The workflow implemented will contribute to organizing the interface between phylogenetic analysis and revisionary taxonomic or monographic work. Database URL : http://campanula.e-taxonomy.net/
Today a variety of alignment and tree file formats exist, some of which well-established but limited in their data model, others more recently proposed offer advanced future-orientated features for metadata representation. Most phylogenetic and other bioinformatic software currently only supports one or few different formats, while supporting many widely-used standards simultaneously would be desirable to achieve optimal interoperability and prevent data loss by external conversions. We developed JPhyloIO, which allows reading and writing of alignment and tree formats (NeXML, PhyloXML, Nexus, Newick, FASTA, Phylip, MEGA, XTG, PDE) using a common interface. It is the only currently available Java-library that generalizes between the different data and metadata concepts of all formats, while still allowing access to their individual features. By simply implementing a single JPhyloIO based reader and writer, application developers can easily support all formats in one step and the event-based architecture allows the library to be combined with any application business model design, while still being memory efficient for large datasets. We provide JPhyloIO as a service to the scientific community, which will benefit from simplified development of software that supports various standards simultaneously. Our aims are to increase the interoperability between different (phylogenetic) software tools and to foster usage of more recently proposed formats providing a powerful metadata concept. It currently integrated in a number of applications and is fully interoperable with our Java-library LibrAlign, which offers powerful components for multiple sequence alignments and attached raw and metadata. Download and documentation: http://bioinfweb.info/JPhyloIO/ .