Sample data processing in an additive and reproducible taxonomic workflow by using character data persistently linked to preserved individual specimens

Abstract and Figures

We present the model and implementation of a workflow that blazes a trail in systematic biology for the re-usability of character data (data on any kind of characters of pheno- and genotypes of organisms) and their additivity from specimen to taxon level. We take into account that any taxon characterization is based on a limited set of sampled individuals and characters, and that consequently any new individual and any new character may affect the recognition of biological entities and/or the subsequent delimitation and characterization of a taxon. Taxon concepts thus frequently change during the knowledge generation process in systematic biology. Structured character data are therefore not only needed for the knowledge generation process but also for easily adapting characterizations of taxa. We aim to facilitate the construction and reproducibility of taxon characterizations from structured character data of changing sample sets by establishing a stable and unambiguous association between each sampled individual and the data processed from it. Our workflow implementation uses the European Distributed Institute of Taxonomy Platform, a comprehensive taxonomic data management and publication environment to: (i) establish a reproducible connection between sampled individuals and all samples derived from them; (ii) stably link sample-based character data with the metadata of the respective samples; (iii) record and store structured specimen-based character data in formats allowing data exchange; (iv) reversibly assign sample metadata and character datasets to taxa in an editable classification and display them and (v) organize data exchange via standard exchange formats and enable the link between the character datasets and samples in research collections, ensuring high visibility and instant re-usability of the data. The workflow implemented will contribute to organizing the interface between phylogenetic analysis and revisionary taxonomic or monographic work. Database URL :
Content may be subject to copyright.
Original article
Sample data processing in an additive and
reproducible taxonomic workflow by using
character data persistently linked to preserved
individual specimens
Norbert Kilian
*, Tilo Henning
, Patrick Plitzner
, Andreas Mu¨ ller
Anton Gu¨ ntsch
, Ben C. Sto¨ ver
, Kai F. Mu¨ ller
, Walter G. Berendsohn
and Thomas Borsch
Botanic Garden and Botanical Museum Berlin-Dahlem, Dahlem Centre of Plant Sciences, Freie
Universita¨t Berlin, Ko¨nigin-Luise-Str. 6–8, 14195 Berlin, Germany and
Institut fu¨r Evolution und
Biodiversita¨t und Botanischer Garten Mu¨ nster, Westfa¨lische Wilhelms-Universita¨t Mu¨nster, Hu¨fferstr.
1, 48149 Mu¨nster, Germany
*Corresponding author: Tel: 0049 30 83850129; Fax: 0049 30 838450129; Email:
Citation details: Kilian,N., Henning,T., Plitzner,P., et al. Sample data processing in an additive and reproducible taxonomic
workflow by using character data persistently linked to preserved individual specimens. Database (2015) Vol. 2015: article
ID bav094; doi:10.1093/database/bav094
Received 19 June 2015; Revised 1 September 2015; Accepted 2 September 2015
We present the model and implementation of a workflow that blazes a trail in systematic
biology for the re-usability of character data (data on any kind of characters of pheno-
and genotypes of organisms) and their additivity from specimen to taxon level. We take
into account that any taxon characterization is based on a limited set of sampled individ-
uals and characters, and that consequently any new individual and any new character
may affect the recognition of biological entities and/or the subsequent delimitation and
characterization of a taxon. Taxon concepts thus frequently change during the know-
ledge generation process in systematic biology. Structured character data are therefore
not only needed for the knowledge generation process but also for easily adapting char-
acterizations of taxa. We aim to facilitate the construction and reproducibility of taxon
characterizations from structured character data of changing sample sets by establishing
a stable and unambiguous association between each sampled individual and the data
processed from it. Our workflow implementation uses the European Distributed Institute
of Taxonomy Platform, a comprehensive taxonomic data management and publication
environment to: (i) establish a reproducible connection between sampled individuals and
all samples derived from them; (ii) stably link sample-based character data with the meta-
data of the respective samples; (iii) record and store structured specimen-based charac-
ter data in formats allowing data exchange; (iv) reversibly assign sample metadata and
character datasets to taxa in an editable classification and display them and (v) organize
CThe Author(s) 2015. Published by Oxford University Press. Page 1 of 19
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes)
Database, 2015, 1–19
doi: 10.1093/database/bav094
Original article
at FU BerlinFB Humanmedizin on October 1, 2015 from
data exchange via standard exchange formats and enable the link between the character
datasets and samples in research collections, ensuring high visibility and instant re-
usability of the data. The workflow implemented will contribute to organizing the inter-
face between phylogenetic analysis and revisionary taxonomic or monographic work.
Database URL:
Biological systematics, referred to as systematics in this
study, aims to assess organismic diversity by attempting to
identify natural biological entities above the individual level
(taxa), to uncover their relationships and to characterize,
classify and name them (1). All analyses in systematics
(Figure 1) are based on ‘samples’, a term used in this study
in the unspecified sense of a probe or examination object
taken from an individual organism. Examination of these
samples produces ‘character data’—often named ‘descrip-
tive data’ (2,3) and sometimes ‘comparative data’ (1)—a
class of data referring to ‘taxonomic characters’ (4), which
each have two or more states and can cover all data suitable
to characterize a taxon in comparison with related or simi-
lar taxa. Character data that are suitable for use in evolu-
tionary analyses are processed in order to group sampled
individuals into natural biological entities. Evolutionary
analyses may include to study tokogenetic relationships
within a species, or to study sampled individuals as repre-
sentatives of species in a phylogenetic context. The charac-
ter data may be analysed also using a phenetic or other
approach. The results in each case are initially unclassified
entities, which in subsequent steps can be assigned to taxa
and then be named (Figure 1)(5,6). The taxon assignment
of unclassified entities revealed from evolutionary analyses
translates evolutionary relationships into a classification.
This translation essentially employs decisions on appropri-
ate circumscriptions and ranks of taxa, guided by certain
sets of criteria, which may be subject to debate. Additional
individuals that match these taxa can also be assigned to
them. Taxon assignment of individuals, i.e. the process of
matching sampled individuals with taxa, thus of their identi-
fication, uses a subset of character data as indicators that
are considered diagnostic for a taxon and for its distinction
from similar or related taxa. The available character data
obtained from all sampled individuals of a taxon are finally
‘aggregated’, thus summed up, into a comprehensive ‘taxon
characterization’ (7) (frequently but less appropriately
referred to as ‘description’, see section two of this study).
Taxon characterizations are thus the product of the taxon
delimitation (5) and may vary in so far as different
taxon delimitations are applied (‘taxon concepts’) (5,810)
or different geographic scopes may be considered. The
characterizations of higher taxa (taxa that include subordin-
ate taxa) are in the same way the product of taxon delimita-
tion and are the sum of their included subordinate taxa. The
taxon characterizations of all subordinate taxa making up a
higher taxon are thus to be aggregated into the characteriza-
tion of the corresponding higher taxon.
The generalized scheme in Figure 1 of steps from the in-
vestigation of organism individuals to the characterization
of taxa also illustrates the interface between evolutionary
analysis and taxonomy: both share step 1 (sampling and
examination of samples), while step 2 (analysis of relation-
ships) is the core domain of evolutionary analysis, and
steps 3 and 4 are the core domains of taxonomy. If the evo-
lutionary analysis in step 2 is replaced by an evaluation of
morphological similarities and discontinuities, the result is
a so-called ‘alpha-taxonomic’ classification. This article
addresses the taxonomic part of that work process, thus
step 1, and, taking up the results of evolutionary or other
analyses from step 2, it also addresses steps 3 and 4. We
are conscious of the fact that taxon characterizations of
microorganisms and fungi may set different accents for the
taxonomic work process (11).
Usually, a taxon characterization is based on the exam-
ination of a very limited set of individual representatives
of the taxon and on a set of character data limited by the
selection of examination methods applied. Consequently,
any new, sampled individual as well as any new character
data may affect the taxon characterization and/or the
taxon delimitation. Moreover, in the vast majority of cases
the evaluation of the sampled and examined individuals
is still based just on morphological similarities and discon-
tinuities (alpha-taxonomy) and remains to be confirmed by
phylogenetic reconstructions. Actually, our understanding
of the evolutionary history as well as the classification and
naming of taxa necessarily is an iterative process, with an
approximation to reality, often triggered by methodo-
logical innovations. The need for minor or major revisions
or adjustments of established classifications and taxon
characterizations, also affecting their names, is thus both
pervasive and continuous (1214).
The process of synthesizing our growing knowledge of
biodiversity is challenging. Integrative taxonomic treat-
ments in general, and monographs as a final product of
Page 2 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
systematics (15,16) in particular, consequently represent
the approximate knowledge at a given point of time.
Societal demands for reliable, up-to-date, and authoritative
products, such as biodiversity inventories, identification
aids and encyclopaedic works on groups of organisms
(17), call for name stability, while progress in systematics
may affect established classifications and names.
One of the major problems involved is that print publi-
cations are too static to function as knowledge bases of
organismic diversity. For this, biodiversity informatics has
developed solutions to design synthesizing works in
biodiversity research as dynamic ventures (1822) and to
facilitate data exchange by providing unified and conveni-
ent query mechanisms for distributed and often highly
heterogeneous data repositories.
However, in order to organize dynamic approaches to
such syntheses, they need to be generated from data that
are structured in a standardized form and stored in an
underlying database. Character data structured in charac-
ter and state matrices (3,4,23) and data aggregation
procedures for taxon-based character data are well estab-
lished, although still applied by a limited number of
workers. Several applications are available for storing
structured taxon-based character data in order to generate
identification keys and natural language descriptions, and
to aggregate them from lower to higher taxa. Starting with
the DELTA (DEscription Language for Taxonomy) (24)
system (2) as the pioneer, others followed such as Lucid
(25), Delta Access (26) and Xper2 (27). With the develop-
ment of the XML-based SDD (Structure of Descriptive
Data) standard (28), data in the DELTA and Nexus (29)
standards (30) are becoming fully exchangeable, SDD
compliance provided. With the NeXML exchange stand-
ard (31), recently an XML-based Nexus successor for
representing taxa, phylogenetic trees, character matrices
and associated metadata has been developed.
The implicit conclusions for the association between
character data from sampled individuals and taxon charac-
terization have, however, hitherto hardly been drawn with
the necessary rigor. The Prometheus Model (3,5,32), an
Fig. 1. Generalized scheme of the steps in systematics from the investigation of organism individuals to the characterization of taxa. The first column
lists the processes (lower case letters þitalics) and products (upper case letters þnormal style), the diagram illustrates the data flow and the last col-
umn numbers the steps as explained in the following: (1) samples of individuals are examined, providing different types of character data (green,
blue, yellow), not all of them necessarily available for all samples. (2) Analysis of relationships (e.g. phylogenetic or tokogenetic), using e.g. available
molecular character data (blue), reveals evolutionary relationships among the sampled individuals, grouping them into unclassified entities such as
clades. In a phenetic approach, the evolutionary analysis in this step is replaced by an evaluation of morphological similarities and discontinuities. (3)
In order to translate inferred (from whatever analysis) relationships into classification, the unclassified entities with the included samples and charac-
ter data are assigned to taxa, also employing further character data types (yellow, green). (4) Naming of taxa and aggregating (summing up) of the
character data from the individuals included results in named and characterized taxa. Further sampled individuals not included in the evolutio nary
analysis but matching the taxon characterization can be included, their data adding to the characterization.
Database, Vol. 2015, Article ID bav094 Page 3 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
approach based on taxonomic working practices rather
than on taxonomic outputs, is a remarkable exception. In
order to make an investigation both transparent and repro-
ducible, vouchers (commonly termed specimens) allowing
an assured identification of the sampled individuals are
permanently preserved. Consequently, the Prometheus
Model emphasizes that the research process in systematic
biology at the species level and below is specimen based,
and the taxon characterization is the product of the
included specimens, and the taxa above the species level
are circumscribed by the subordinate taxa. The taxon
characterization can thus only be determined and repro-
duced in an objective way by the included specimens.
The Prometheus Model takes a specimen-oriented rather
than a taxon-oriented approach. With the Prometheus
Description Model, Pullan et al. (3) moreover perspica-
ciously addressed the need for the re-use and exchange of
character data between different research projects, and
modelled pioneering solutions for the main problems
involved. This includes a solution for compatibility issues
of character datasets from different sources and also the
possibility of recording character data at various levels of
concreteness, ranging from a single instance of a structure
on a specimen to the individual specimen as such. Yet, the
Prometheus Model was never developed to a tool available
for taxonomic work.
Therefore, until today common taxonomic working
practise is that the characterization of a taxon refers only
collectively to a set of included specimens so that the char-
acter data are not associated with the individual specimens
they were taken from. In this way, the only accurate way
of achieving adjustments with respect to taxon delimitation
and consequently to taxon characterization is the most
laborious: re-examining the characters and specimens.
For a sound foundation of the character data aggrega-
tion procedure and in order to streamline taxon
characterization, a reversible generation of a taxon char-
acterization from the character data of the sampled indi-
viduals is necessary. The prerequisite for this foundation
is to establish a persistent and unambiguous connection
between each sampled individual and the data processed
from it. Specimens remain the representatives of the
sampled individuals after the conclusion of the systematic
research process and are preserved and curated in corres-
ponding research collections. The obvious conclusion
should therefore be the establishment of an unambiguous
association between the character states and ranges re-
corded for each specimen, or for each sample substanti-
ated by a specimen, and their persistent connection with
the specimen metadata. Any newly examined individual
assigned to a certain taxon may then confirm or modify
the taxon characterization upon re-aggregation of the
character data. Once evolutionary analysis of character
data reveals changes in taxon delimitations, its character-
ization can then be regenerated upon aggregation of the
character data from the altered sample sets. The necessity
to document the character data for the individual speci-
mens rather than for taxa similarly applies to phylogen-
etic analyses, in particular for such based on
morphological characters, where the corresponding prob-
lems have been clearly addressed (33).
This article presents the concept of a workflow and
dataflow that blazes a trail in systematic biology for the
re-usability of character data and their additivity from
specimen to taxon level, and its implementation, using the
EDIT (European Distributed Institute of Taxonomy)
Platform (22). We first (part 2) explain our concept for the
implementation of a persistent and unambiguous connec-
tion between character data and samples in the systematic
research process. Subsequently (parts 3 and 4), we describe
the implementation of the single steps of the workflow
using the EDIT Platform.
Our solution aims to (i) establish a reproducible connec-
tion between sampled individuals and all types of samples
derived from them during the research process; (ii) persist-
ently link the metadata of all types of samples with the
respective character data; (iii) record and store specimen-
based phenotypic, geographic and environmental as well as
molecular character data in formats suitable for data
exchange; (iv) reversibly assign sample metadata and char-
acter datasets to taxa in an editable classification and
display them and (v) organize the exchange of sample data
sets via standard exchange formats. Finally, we discuss the
opportunities that our solution opens up for the preserva-
tion of raw data and for the deposition of character datasets
along with samples in research collections, and we identify
fields where further developmental work is needed.
Conceptual foundations of integrated
sample data processing
Organismic samples, their associations and data
In systematics, the analysed samples each directly or in-
directly originate from a population of organisms in the
field. Collecting samples of such a population creates a
‘gathering’ [the term is here used in the sense of the
‘International Code of Nomenclature for algae, fungi and
plants’ (ICN)] (34, p. 156) for ‘a collection of one or
more specimens made by the same collector(s) at one
place and time’. The ‘gathering event’ (35) is thus con-
nected to a specific time and location. The single gather-
ing, to which usually a unique ‘field number’ or
‘collecting number’ is assigned, is a data object termed
Page 4 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
‘field unit’ (35). We here use this term to refer to a single
(named or unnamed) taxon, and either to a single individ-
ual, of which it may include one or more samples (de-
pending on the size of the individual), or to a population,
of which it consequently includes a number of individuals
or parts of them. Therefore, the field unit can consist of
one ‘specimen’ or a number of ‘specimens’ and in the lat-
ter case they are commonly considered duplicates of one
another and are thus principally exchangeable with re-
spect to their essential information content. This depends,
of course, on the research context: population genetic
analyses, e.g. require that duplicates must stem from the
same individual. However, the concept of the ‘field unit’
also allows the handling of multitaxa gatherings; as
taxon-ambiguous field units are permitted, it is, therefore,
also applicable to the study of microorganisms.
All further samples taken from a specimen of a field
unit are termed ‘derivatives’, more precisely ‘specimen
derivatives’ (‘derived units’) (35). Based on the field unit,
derivation events can create a series of derivatives. Being
products of derivation events, derivatives are usually
hierarchically structured [e.g. specimen !pollen sample
!scanning electron microscope (SEM) micrographs].
Both the derivative hierarchy and any single derivative are
rooted to the field unit, ensuring that each derivative is
rooted even if an intermediate derivative is lost, of ephem-
eral nature or has never been recorded. A first derivation
step from a taxon-specific field unit is the individualization
of specimens, the specimen thus constitutes a first deriva-
tive of the field unit (Figure 2).
The taxon assignment of a taxon-specific field unit is
normally inherited (in terms of data processing) to and
valid for all the field unit’s derivatives. Similarly, a taxon
assignment to a derivative is inherited to all its other elem-
ents and the field unit. Erroneous assignments of samples
to a taxon-specific field unit may result from misidentifica-
tion in the field or in light of novel insights following later
analyses leading to the re-circumscription of a taxon.
Another possibility is the consideration of a taxon which
was outside the scope of the original gathering (e.g. epi-
phytic lichens, parasites) (36). Consequently, such samples
need to be separated (at least in their virtual representa-
tion) and assigned to the correct newly developed taxon-
specific field units.
Once a derivative becomes part of a collection (e.g. a
herbarium), and thus a collection object, a metadata type
termed ‘collection unit’ can be assigned to the derivative.
Any sample that is examined, regardless of whether it is
newly collected in the course of the research process or
taken from a research collection, which, in the latter case,
may be from a living collection (e.g. botanical or zoo-
logical garden) or museum collection, is assigned to a
specimen derivative hierarchical level. Two types of data
are principally associated with each sample:
i. ‘Sample metadata’ predominantly include the event-
related information, including sample origin, collecting
locality, observations in the field, gathering method,
preparation process, derivation events in the examin-
ation, position in the derivative hierarchy, accession
and storage place in a collection and more. The main
functions of the sample metadata are to give the sample
a unit identity and to make it reproducible or at least
traceable. The core of the sample metadata is found on
labels attached to a collection object, which may be
supplemented, in the case of poorly labelled ‘historical’
specimens, by data from related sources, such as pub-
lished reports on expeditions and laboratory protocols.
The ‘taxon assignment data’ are a particular type of
sample metadata, which indicate the taxonomic identi-
fication of a (taxon-specific) field unit and all its deriva-
tives, including the taxon name, typification, name of
identifying scientist, date of determination, synonym-
izations and determination history. The taxon name
connects the sample and its data to a certain taxon in
the classification. One type of sample metadata has a
double nature: data related to the gathering event in the
field, such as locality data, gathering date and observa-
tions on the gathered organism, will also contribute to
the characterization of that taxon (described in detail
below), by information such as distribution, ecology or
ii. ‘Character data’ include all primary (raw) and second-
ary (edited or derived) data gained through the examin-
ation of a sample. They can theoretically comprise
the entire phenome (the entirety of a taxon’s ‘traits’ or
‘features’), genome information plus all related geo-
graphical and environmental data. If character data
have an unambiguous connection to a single docu-
mented sample, they are referred to as specimen-based
as opposed to merely taxon-based character data.
‘Structured character data’ are organized in a matrix
distinguishing characters and two or more states, in
contrast to ‘textual character data’ (e.g. in a natural
language description).
The term ‘trait’ is conceptually narrower than ‘character
data’. Trait refers to phenotypic variation relative to genetic
and environmental factors for particular phenotypes.
However, it has been used ambiguously either
corresponding with a character or, more commonly, with a
state. The definition of the term trait has been widened in
ecology to functional and physiological traits. The term
character data is inclusive of these as well. The terms
‘descriptive data’ (3,27) and ‘comparative data’ (1)are
Database, Vol. 2015, Article ID bav094 Page 5 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
largely synonymous to character data. However, the former
in particular has often been used in the narrow sense, refer-
ring to the data of the ‘taxon description’, which historically
ranges from a brief morphological differential diagnosis to a
more or less comprehensive morphological description of a
taxon. The term ‘factual data’ (37), coined in the context of
modelling data relations of taxon concepts and names, is
wider than the above mentioned terms. It refers to any fac-
tual information that is connected to a taxon and thus also
includes information about human uses or the conservation
Fig. 2. Exemplar scheme of samples with metadata and character data in a derivative hierarchy.
Page 6 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
status of a taxon, which is too extensive to be included with
the character data.
Derivation events frequently lead to samples that are either
not preserved as physical objects, or they lose their physical
concreteness and then are merely present as digital objects.
Examples include the SEM analysis of pollen samples, where
only the digital SEM micrographs remain, or the amplifica-
tion and sequencing of markers from a DNA isolate, where
Where derivation events transform physical into digital
objects, the digital objects can, with similar justification, be
treated as sample derivatives or as data gained from samples.
Generalizing this, one could consider the generation of char-
acter data from a sample as a derivation event, and the
obtained character dataset as a further derivative instead of
a sample-based characterization item. We have decided, how-
ever, to treat in the data model only derivatives in the nar-
rower sense, i.e. not character data as derivatives, but in the
interest of user convenience, a joint visualization of deriva-
tives and character datasets in the user interface independent
of the model decision is possible (Figure 2,andseebelow).
Processing sample metadata
Usually, samples in a research project in systematic biology
are to some part newly collected, while to some other part ob-
tained from research collections, either as physical objects or
as digital representations. A required functionality is therefore
the communication with research collection databases or cor-
responding aggregators to search for and to import digital
sample representations and sample metadatasets. The stand-
ard exchange formats ABCD (Access to Biological Collection
Data) (38) and Darwin Core (39) should be supported.
Imported metadatasets may need, at some stage, to be
edited. Editing may include the following: (i) completion
of label data; (ii) addition of relevant metadata from other
sources, such as duplicate samples and itineraries, for insuffi-
ciently labelled historical collection items; (iii) standardiza-
tions, such as making collector names unambiguous and
conversion of data into standard units; (iv) completion or
correction of the parsing of the metadata into the relevant
data fields; (v) clarification of toponyms and georeferencing
localities and (vi) fixed associations of taxon names with
specimens following nomenclatural typification. Editing
with respect to (iii), (iv) and (v) is essential for the processing
of metadata elements in the context of taxon characteriza-
tion, such as georeferenced localities for distribution map-
ping or collecting dates for phenology. Type information (vi)
is to be processed in order to fix the application of a name to
the taxon containing this specimen.
In the case of collection items or their derivatives for
which no digital metadatasets are available, these need to
be newly entered into the data store. In the case of newly
collected material for an investigation, it depends on insti-
tutional workflows; the material may be first accessioned
by the research collection and its metadata can then be
imported from the institutional collection database, or vice
versa. Exporting the newly entered and the edited sample
metadatasets to institutional research collections is possible
using standard exchange formats ABCD (38) and Darwin
Core (39). Furthermore, this can be done in a way that
clearly distinguishes original and edited data.
Linking specimen-based character data to sample
Sample examination produces specimen-based character
datasets of various types and formats. These datasets are
characterization items to be persistently linked to the ana-
lysed samples (represented by their metadataset) and via
the derivative hierarchy also to the individual specimens
documenting the individual organismic source of these
data. For all stages of the research process the correspond-
ing character datasets should be available, visible and
easily accessible. An export of the sample metadatasets to
the respective research collection should contain a stable
link to the existing character datasets, or even be directly
associated with the available character datasets.
Taxon assignment of samples and their data
Through assignment to a taxon, the field unit as the root of
the specimen derivative hierarchy becomes connected to
the taxonomic classification of a group of organisms. As a
consequence, all connected derivatives, the sample meta-
data corresponding to the gathering and the character data
resulting from the examination of a sample also become
assigned to that taxon. The taxon assignment is thus effect-
ive for all levels in the derivative hierarchy and is revers-
ible. Samples and character datasets assigned to taxa
should be easily visible and accessible.
Simple moving of a taxon within a classification or renam-
ing does not affect the connection between samples and taxa.
In contrast, re-delimitation of a taxon, which involves a re-
evaluation of the included samples and/or character data, will
also demand to adjust the taxon assignment of the samples.
Aggregating specimen-based character data at
the taxon level
The essential procedure for any taxon characterization is
the aggregation of the specimen-based character data to
taxon character data according to the delimitation of the
taxon. The extent and type of the aggregation depends on
Database, Vol. 2015, Article ID bav094 Page 7 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
the data type and structure, and the means and purposes of
their use at the taxon level. This may include an ‘append-
ing aggregation’ (leaving the appended data unchanged),
such as DNA sequence data, or a ‘merging aggregation’
(statistical values), such as the measurement of floral fea-
tures or altitudinal distribution ranges.
It is necessary for data aggregation to be designed as an
iterative and automated procedure, permitting changes in
the sample basis of the data, due to changing taxon delimi-
tation or data availability. This would trigger a new round
of aggregation, which replaces the results of the preceding
one. The prerequisites are that the data are structured and
compatible. Taking the domain of morphological data as
an example, it becomes evident that the main obstacle is to
ensure that sets of characters and states are compatible
during specimen investigation across a larger group of
organisms. Aggregation for distant taxa of the same larger
group of specimen-based data at the lowest taxon rank
applied must not use incompatible matrices in order for
subsequent aggregations at higher ranks to be successful.
A number of applications exist to create taxon-based
character and state matrices and to further process them for
the generation of identification keys and natural language
descriptions, and to aggregate them from lower to higher
taxa (2,2527). As long as compliance with the XML-based
SDD standard (28) is provided, the data are exchangeable
between the applications. Problems regarding exchangeabil-
ity of structured data matrices, term ontologies including
addressing homology issues and the character data model
(3) remain to be addressed in future work.
Fortunately, the aggregation of character data from lower
to higher taxa is principally the same as the primary aggrega-
tion of specimen-based character data at the taxon level,
with respect to data structure and aggregation algorithms.
The same applications can thus be employed in order to re-
cord and aggregate specimen-based character data.
Workflow implementation using the EDIT
Extending the EDIT Platform to handle the variety
of sample data
Our concept for an integrated workflow for sample data
spans from the selection of sampled individuals to the
aggregation of character data for named taxa (Figure 1),
but intentionally it excludes the capacity to conduct evolu-
tionary analysis of sampled individuals. However, it aims
to include the entire data recording for the examined sam-
ples (metadata and character data) and to hold and provide
the specimen-based structured character data (morpho-
logical and molecular) of the sampled individuals for any
evolutionary analysis, such as phylogenetic reconstruction.
The datasets for the sampled individuals can be assigned to
taxa according to the results of the analysis and the character
data can be aggregated to add to the taxon characterization.
The implementation of this workflow requires a web-
enabled working platform, readily allowing networking of
distributed team workers, capable of the pertinent data
exchange standards for collection data, with suitable inter-
faces to handle character data, and capable to handle taxo-
nomic classifications. Therefore, the ‘EDIT Platform for
Cybertaxonomy’ (22,40,41), or shorter, ‘EDIT Platform’
has been selected for development of our workflow model.
The EDIT Platform provides the necessary basic function-
alities which require minimal extensions, especially in the
specimen module. The EDIT Platform is based on the
‘Common Data Model’ (CDM) (42), which is a compre-
hensive object-oriented taxonomic information model cov-
ering the flow of taxonomic information from fieldwork to
data publication. The pivot of this model is the ‘taxonomic
concept’ (or ‘potential taxon’) being strictly separated
from scientific names. This approach was originally de-
veloped by Berendsohn (43) and later refined and imple-
mented in the ‘Berlin Model’ e-Platform (44,45). Added to
this was a rule-based ‘transmission engine’ for the transfer
of character and other taxon-related ‘factual data’ between
concepts in a network of taxonomic concepts (46,47). The
CDM complies to the relevant data standards of biodiversity
informatics (Biodiversity Information Standards [TDWG],
also known as Taxonomic Databases Working Group) (48),
including ABCD (38), Taxon Concept Schema (49), SDD
(28) and Darwin Core (39). Besides the EDIT Platform it is
also the basis for Creating a Taxonomic E-Science (19).
An outstanding feature of the EDIT Platform is its
connectivity and interoperability among the emerging interna-
tional biodiversity informatics infrastructures through standar-
dized web service layers. Data exchange interfaces to various
biodiversity e-infrastructures have been implemented including
the GBIF (Global Biodiversity Information Facility) Checklist
Bank (50), Biowikifarm (51), Scratchpads (52), Plazi (53),
BioVeL (54) and Biodiversity Heritage Library (55).
The EDIT Platform is open source and applicable to all
groups of organism, in particular those covered by the ICN
(34) and the International Code of Zoological Nomenclature
(56). Current applications are monographic in approach
(Cichorieae Portal; CLD-CoW Portal; Palmweb) (5759), re-
gional checklists (60)o
Basic functionalities of the EDIT Platform,
scalability and use cases
The EDIT Platform can be employed to handle and con-
nect the different data types associated with the samples
Page 8 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
right from the start of a research process in systematic
biology. It provides three main components:
i. Data repository and server: The CDM store hosts the
taxonomic classification, the metadata and character
data for samples, and also links to external web re-
sources. All data objects are accessible through Java
and web service interfaces.
ii. Taxonomic Editor: The core application of the work-
ing platform functionality is the Taxonomic Editor.
Among others, it allows the searching for, importing,
entering and editing of all taxon- and specimen-related
information stored in the CDM.
iii. Data Portal: The portal provides a dynamic visual user
interface for online publication. It gives access to all
publication-relevant data objects stored in the CDM.
Classifications are represented by a taxon tree, which
allows users to navigate through multiple hierarchies.
The portal links out to biodiversity e-infrastructures such
as BHL (Biodiversity Heritage Library) (55)andGBIF
(50) and has advanced functions for visualizing species
distributions and multimedia files.
Although the EDIT Platform is being designed to sup-
port the distributed research process in systematic biology
from sample acquirement to the publication of a mono-
graph, more frequent use cases are taxonomic revisions or
phylogenetic analysis of smaller groups of organisms, and
in some cases a combination of both. Such work is fre-
quently conducted by an individual scientist or a working
group and usually without a long-term dedication to a par-
ticular group of organisms. Cases like these often lack the
active institutional support, in particular IT infrastructure.
Instead of the fully operational ‘community installation’,
they may use the easy to install ‘individual installation’,
which allows a single worker to edit and maintain an indi-
vidual dataset on a personal desktop, and a ‘group installa-
tion’ for a working group with a shared data repository
within an institutional intranet. In contrast to the individ-
ual installation scheme, the group installation comes with
a data portal to publish the data electronically (http:// An installation with the
full implementation of the workflow described in this art-
icle is expected to be available for download by the end of
the project in December 2015.
Steps of the integrated sample data
Scope of the workflow
Here we outline the steps of the integrative processing of
sample metadata, sample character data and their taxon as-
signment, as it has been developed and is being implemented
in the EDIT Platform. Its aim is to create the prerequisites for
a consistently specimen-based research process in systematic
biology. This includes the following: (i) establishing a repro-
ducible connection between sampled individuals and all types
of samples derived from them, which allows instant sample
metadata processing, including de novo input, retrieval, im-
port, documented (for potential synchronization with exter-
nal sources) editing, display and export, within the research
process; (ii) stably linking the metadata of all sample types
with the respective character data gathered from them, by
providing means for handling specimen-based character data
(morphological and molecular) and for firmly linking them
to the sampled individual; (iii) recording structured speci-
men-based character data in formats allowing data exchange
and easy retrieval; (iv) reversibly assign sample metadata and
character datasets to taxa in an editable classification, allow-
ing optional publication of the investigated samples in the
context of taxon-based information portals and (v) organiz-
ing data exchange via standard exchange formats and ena-
bling persistent, specimen-linked storage effectively accessible
for humans and machines in research collections, ensuring
high visibility and instant re-usability of the data.
The workflow described is still a work-in-progress.
Although the foundations were laid for the implementation
of the entire workflow in the EDIT Platform, its single
steps have been elaborated so far to different depths. It will
be workable throughout by the end of 2015 but in particu-
lar the handling of structured (morphological) character
data will have to be considerably improved to meet all
essential needs by a corresponding follow-up project pro-
posal submitted to the German Research Foundation.
Establishing a reproducible connection between
sampled individuals and all types of samples
derived from them
Searching, retrieving and importing of sample metadata
The ‘specimen search’ in the Taxonomic Editor is defined
by the search parameters and the query interface supported
by a specimen data provider. The implemented system sup-
ports a list of specimen data providers and allows users to
decide which provider to query. It converts the query to
the required format for the provider’s interface. One
option currently implemented is GBIF (50), which is
queried via web services; the other option is BioCASE (62),
the providers of which are queried with a specific XML-
based query protocol (63)(Figure 3). Common search par-
ameters are taxon name, collector, collector’s number and
country. Specifying two or three of these is usually suffi-
cient to reduce the search results.
The Taxonomic Editor provides an import routine
that can both convert the different formats returned
Database, Vol. 2015, Article ID bav094 Page 9 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
(ABCD and Darwin Core) to display the results in a
CDM-unique, standardized format and provide the func-
tionality to store the specimen data in the CDM, merging
it with existing data. The imported data are stored with
the provider’s original unique identifiers to enable data
Editing metadatasets
The specimen module of the Taxonomic Editor has been
extended to provide full user interface functionality for dis-
playing and editing all levels of the derivative hierarchy.
The tissue and molecular sample modules of the CDM
have been extended to enable full data coverage. Fields
with pre-defined or user-defined elements have been se-
lected to avoid redundancy and ensure coherent use of
terms and names, e.g. for primers and DNA markers.
Building and editing specimen derivative hierarchies
The derivative hierarchy is displayed as a tree in a separ-
ate interface, the ‘derivative search view’ (Figure 4A).
‘Derivative view’ (Figure 4B) and ‘details view’ (Figure
4C) form a functional unit that allows the convenient ac-
cess to, and the creation and processing of, derivatives
and their data. The field unit element is obligatory be-
cause it is the root of the derivative hierarchy and ap-
pears, if not manually created, automatically once a
specimen or any other sample is entered or imported. All
subsequent derivation steps and derivative types are pre-
arranged in a hierarchical order according to the typical
research workflow.
According to our concept of the derivative hierarchy,
the derivative view holds a central position in the specimen
module of the Taxonomic Editor. It is used to build the
Fig. 3. Taxonomic Editor of the EDIT Platform, derivatives perspective: screenshot of the specimen query and import interface. The black arrows
indicate the single menu steps that specify the import. After the import form has been sent out, the search results are listed in a separate tab. The
specimen can then be chosen (A) and the import of the datasets can be completed (B).
Page 10 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
Fig. 4. Taxonomic Editor of the EDIT Platform, derivatives perspective: screenshot of the derivative view displaying the derivative search (A), the de-
rivative hierarchy (B) and a details view for the corresponding metadata (C). Screenshots illustrate the stepwise establishment of a derivative hier-
archy by successive creation of derivatives and insertion of their data: (a) addition (1) of a tissue sample and (2) of a DNA sample; (b) addition (3)ofa
consensus sequence with links to one of the INSDC (International Nucleotide Sequence Database Collaboration) databases, (4) of single reads
(Sanger sequencing trace files) and/or a contig file.
Database, Vol. 2015, Article ID bav094 Page 11 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
derivative hierarchy, thus to select derivatives, to
visualize associated character datasets of the respective
samples, to add and edit sample metadata and to display
the hierarchy with all its data types in the Taxonomic
Editor as they may also appear in the Data Portal. An ex-
ample of such a prearranged derivative hierarchy
(Figure 2) in the Taxonomic Editor is as follows: field unit
!specimen collected !tissue sample taken !DNA
isolated !DNA trace file created by the sequencer !
consensus sequence generated from the contigs (Figure 4).
In all such cases, the full sequence of the derivatives is not
mandatory and can be applied as appropriate. For ex-
ample, if no tissue sample and DNA isolate are stored, the
trace file or consensus sequence can directly be attached to
the specimen.
By selecting the details view, input options are provided
for the essential metadata of each derivative. In the case of
molecular data, the necessary terms and input options are
matched with those compiled for the GGBN (Global
Genome Biodiversity Network) network (64). The full ex-
tent of data covered by other repositories can be accessed
via links in the details views (Figure 4B).
Versioning, synchronizing and exchanging metadatasets
The editing process (adding, deleting or changing data)
will enrich and refine the original metadataset. These
changes are separated from the original dataset, resulting
in two semantic parts of sample metadata: (i) the ‘core
copy’, the original dataset from an external provider and
(ii) ‘enrichment and refinement’, the edited data which
can be subject to a manual versioning by employing the
auditing functionality of the EDIT Platform (Figure 5).
On this basis, a special ‘Diff-Viewer’ can be implemented
in the future to visualize the differences between versions
and additionally allow the user to revert changes to an older
version (Figure 5). Edited and newly entered specimen meta-
datasets can be provided to the corresponding research col-
lections using AnnoSys (65) as a back-end service for storing
and communicating the annotations. AnnoSys provides the
functionality to annotate publicly displayed specimen records
by users, to keep track of, and to inform data providers
about annotations. AnnoSys exposes the annotations in the
ABCD standard exchange format. The exposed dataset will
include the documentation of the editing of the core copy to
give the providers the opportunity to update their data.
Conversely, the researchers can ask providers for a possible
update of an earlier imported core copy and manually update
their local copy.
Stably linking character datasets to the sample
derivative hierarchy
The central interface for linking specimen-based character
datasets to the sample derivative hierarchy is the ‘factual
data view’ of the Taxonomic Editor. The addition of such
character data is displayed in the derivative hierarchy,
where the derivative symbol is then replaced by a ‘deriva-
tive þcharacter data’ symbol (Figure 2). Storage of the
sample derivative hierarchy data in the CDM is configured
to include the information about and, optionally, a stable
link to external character datasets, or the stored character
datasets themselves.
Fig. 5. Scheme of the envisaged versioning functionality for sample metadata. The core copy is a copy of an imported dataset of an external provider,
which is edited (green data). The versioning support of the CDM database, reporting every single change in the data, is used at certain intervals to cre-
ate versions of the data, which can be compared using a diff viewer. The result of a subsequent query at the provider is stored as a new core copy,
which can be compared with the latest version based on the first core copy and subsequently be edited.
Page 12 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
Recording and storing specimen-based
morphological and molecular character data
Specimen-based character data can be stored and curated
in the EDIT Platform using the Taxonomic Editor, inde-
pendent of their format. Data available in files from exter-
nal applications can be stored and linked via the Web. For
storing data files of various types in a working environ-
ment, we are using a server with Apache Subversion
(svn,, which combines
convenient accessing of the file repository (e.g. using
TortoiseSVN) with the advantages of a versioning and re-
vision control system. The files are publicly available via a
URI (uniform resource identifier). Mere textual data can
be stored in free text fields of the CDM data store, some
types of structured data can be directly mapped to the cor-
responding CDM classes for structured factual data.
A fully functional data management, however, requires
structured data in the supported exchange formats (see
For recording and editing character datasets, the
Taxonomic Editor provides the ‘Factual Data View’ and
specialized views for different data types, in which seam-
less integrations of otherwise independent applications are
Structured morphological character data
For the recording and processing of structured morpho-
logical and related types of character data, the Xper
ware (27) is used. This software enables free creation of
matrices of characters and character states and the record-
ing of qualitative and quantitative character data of
specimens and derivatives. In a recent paper (6, p. 295–
296) we have outlined an approach, employing a termin-
ology server and semantic web technology to ensure the
compatibility of characters and states taken across a larger
group of organisms, which we identified as a main chal-
lenge in part 3, above. There, we have also proposed a
strategy as to how the wealth of unstructured textual
descriptions in the literature can, in a controlled way, be
employed in the frame of an otherwise specimen-based
approach relying on structured data for taxon character-
izations at lower taxonomic rank. Implementation of these
approaches is subject to a corresponding follow-up project
Molecular character data
For recording and processing molecular (DNA) data, the
Taxonomic Editor has been extended using several GUI
(graphical user interface) components that display phero-
grams (trace files from Sanger sequencing) imported from
AB1 or SCF files with their base call sequences and allows
the combination of these in contig alignments and the cre-
ation of consensus sequences. The user can easily manually
correct the base calls or edit the contig alignment and the
consensus sequences. To achieve this, a new open source
Java library called LibrAlign (66) has been developed. It
provides powerful and flexible GUI components for dis-
playing and editing raw data and metadata for sequences
and alignments. Although LibrAlign was mainly developed
for use in the Taxonomic Editor, its components have been
designed to be of general use for other developers in the
scientific community and it may be integrated into any
Java GUI application, based on Swing, SWT, Eclipse RCP
and Bioclipse (67), and it is interoperable with the CDM
Library (42) and BioJava API (application programming
interface) (68). Furthermore, support for importing and ex-
porting whole contig alignments in various formats, such
as FASTA, Nexus (29), MEGA (69) or NeXML (31), is
currently implemented using JPhyloIO (70) in combination
with LibrAlign. JPhyloIO is another general purpose Java
Library developed for the Taxonomic Editor that provides
event-based format-independent access to different se-
quence and alignment file formats. It is closely integrated
with LibrAlign, but can also be used in the development of
any application that does not use LibrAlign.
Taxon assignment of sample metadata and
character datasets
Adding sample data to a classification
The classification used for an investigated group of organ-
isms can be displayed and edited in the ‘taxonomic
perspective’ of the Taxonomic Editor, a pre-defined and
pre-ordered set of graphical interfaces. This enables taxo-
nomic hierarchies with synonymies to be imported, created
and edited, including complex re-classification operations.
The taxon assignment of a specimen or derivative hier-
archy is effected in the details views of the derivative view
or in the factual data view, where a taxon of the stored
classification can be selected (and deselected). In this way,
the derivative hierarchy with all linked character data be-
comes assigned to a certain taxon. If the status or position
of a taxon is changed during the revision of a taxonomic
classification in the taxonomic perspective of the Editor,
all appended sample metadata and character data remain
with the taxon. If one taxon is united with another one,
the appended sample metadata and character data are syn-
chronously moved with the taxon and their former place-
ment is recorded. If a changed circumscription of a taxon
requires the moving of specimens to another taxon, their
former placement also is recorded.
Database, Vol. 2015, Article ID bav094 Page 13 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
Aggregating specimen-based character data at the taxon
Iterative character data aggregation procedures are being
implemented in the EDIT Platform for two different data
i. Occurrence data: primary aggregation of geographical
coordinates will result in dot distribution maps in the
Data Portal. Aggregation of combined area unit distri-
bution and occurrence status data at the same or from
lower to higher taxon ranks is currently operated using
a corresponding transmission engine. This rule-based
engine aggregates distribution information (including
occurrence status data) for a given taxon and region,
recursively using its subtaxa and subregions. In the case
of conflicting status values, decisions are made on the
basis of defined priority rules.
ii. Character data stored in SDD-compliant character-
state-matrices: the Xper
software for character data
management and interactive identification (27), which
is integrated into the EDIT Platform, provides algo-
rithms for data aggregation, merging numerical data
while appending categorical data. The primary aggre-
gation of the specimen-based character data at taxon
rank currently only tentatively allows the automated
generation of a natural language taxon description
from the matrix. However, a workaround is the manual
editing of the data using a description template. The
storage of structured character data also enables the
use of the data matrix for interactive taxon identifica-
tion with the aid of multi access keys accessible through
the data portal’s Keys Tab which we describe below.
Publishing sample metadata and character data with the
CDM Data Portal
The EDIT Platform, unless in the individual installation of
the software, allows the visualization of the data through
its online Data Portal, which is customizable in its basic
structure according to one of the principal aims: (i) a sys-
tematic revision or monograph providing maximum data,
(ii) a flora or (iii) a checklist with the most restricted array
provided. Classifications and taxon-related data are visual-
ized in the portal and are accessible through a navigable
taxon tree or via taxon name, area and subject searches. A
data portal with the function of a systematic revision or
monograph presents the information for each taxon, inde-
pendent of its rank, in five basic tabs: (i) the ‘general tab’
displays the summarized taxon-based character data
organized in feature chapters; (ii) the ‘synonymy tab’ dis-
plays the detailed synonymy and typification data organ-
ized in blocks of homotypic synonyms; (iii) the ‘image tab’
displays stored images; (iv) the ‘key tab’ offers
identification keys (interactive or single access) optionally
for taxa including subordinate taxa and (v) the ‘specimen
tab’ finally displays the investigated or determined speci-
mens with their derivative hierarchies and available char-
acter datasets, as well as a dot distribution map for the
taxon based on the georeferenced specimens (Figure 6).
Setting the ‘publis’ flag in the Taxonomic Editor for a spe-
cimen derivative hierarchy and the appended character
datasets displays these data in the Data Portal. A search
function, still in preparation, will allow users to filter cer-
tain derivative types and their data in the specimen tab of
the Data Portal. For each specimen and its derivatives be-
sides the expanded table view, a separate page with meta-
data, character datasets and links to other available
character datasets can be opened (Figure 6). The
Campanula Portal (71) (see, under ‘Preview’ on the
‘Welcome’ page, the exemplar taxa listed) is being used to
visualize exemplars of taxa with various types of specimen-
based datasets.
Using the publication services of the EDIT Platform,
more specific outputs can be designed for publication of
subsets of data in print or electronic publication media.
Data exchange via standard exchange formats
and enabling persistent, specimen-linked storage
in research collections
Exchange of sample metadata between the EDIT Platform,
research collections, biodiversity networks and collabor-
ators is managed using ABCD (38) and Darwin Core (39)
as the standard exchange formats. Both formats allow the
import and export of the combined sample metadatasets of
entire derivative hierarchies, such as represented, e.g. by
the specimen with its scan and tissue sample collection for
DNA extraction. Moreover, data import of such a deriva-
tive hierarchy further extended for isolated DNA sources
plus marker consensus sequences with their contig files
and corresponding pherograms has successfully been tested
from the GGBN network (64) to the CDM Platform.
Even our still further reaching concept of persistently
and stably linking morphological and other types of char-
acter data with sample metadata and combined sample
metadatasets of entire derivative hierarchies is already
possible. ABCD currently offers a container element
(‘<MeasurementsOrFacts>’), which can be used as a
workaround to store atomized data, a complete character
data matrix or a link to such a matrix. In this way, the ex-
change of the sample metadata with the respective research
collection can include the information about and, option-
ally, a stable link to existing character datasets, or even the
stored character datasets themselves. In the proposed fol-
low up project and in connection with the development of
Page 14 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
ABCD 3.0, we envisage a more straightforward implemen-
tation for the exchange of associated structured character
datasets. This will lay the foundation to popularize the as-
sociation of (structured) character data with sample meta-
data, as well as their display and effective accessibility for
humans and machines in research collections, ensuring
high visibility and instant re-usability of character data
through research collections.
Our solution emphasizes the editing and enrichment of
specimen metadata (e.g. taxon identifications, nomencla-
tural type status designations, georeferencing) by the re-
searchers in the course of their examination of the
material, as well as on the synchronization of edited data
with the existing datasets. Doing so, it takes into account
that the rapid advancements in the digitization of research
collections have conducted the work and data flows related
to collections in an analogous and a digital branch.
Consequently, solutions have to be designed for the various
use cases to ensure that revised and enriched metadatasets
can conveniently be connected to the collections (65,72).
Moreover, our solution, which streamlines the taxon
characterization through establishing a persistent unam-
biguous relation between each sampled individual and the
corresponding data, also opens new opportunities for the
old problem of securing raw data associated with the re-
search process in systematic biology. Primary research data
do not only include pure data but also digital representa-
tions of preparations from specimens, ranging from light
or scanning electron micrographs to sequencing trace files.
Currently, if specimen-based character data are recorded,
these are frequently treated as raw data, not usually
included in publications, or, e.g. micrographs, published in
a very limited selection only. At best they have, in more re-
cent times, been deposited in repositories (7375), other-
wise they are still frequently considered only worth short-
term preservation and disposed after the compulsory peri-
ods of record keeping, if not earlier (76). The deeper reason
for not preserving raw data is often the lack of appropriate
means to document, persistently link and visibly store
them. Additionally, individual research databases are often
not integrated in institutional data management strategies
(77). National research funding bodies increasingly recog-
nize the need for permanent storage facilities for primary
Fig. 6. Data Portal of the EDIT Platform: screenshot of the Campanula data portal displaying the specimen tab visualizing the specimens and their de-
rivatives available for a taxon. The Derivatives column indicates availability of additional datasets by displaying the respective icons. Clickingona
row (A) folds out the table cell and the listed items (here, specimen scan, DNA sequence contig and trace files) can be accessed by following the links
given. The specimen ID functions as a link (B) to a separate specimen page where all derivatives of this specimen are clearly arranged, character data-
sets are provided and respective files are linked; clicking on the specimen scan thumbnail (C) opens the specimen scan in a separate browser
Database, Vol. 2015, Article ID bav094 Page 15 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
research data (75,78). However, the investment of extra
work for long-term storage of specimen-based character
data in a meaningful way is not economic as long as their
re-use is not well organized. Primary research data there-
fore must appear effectively visible in a potential use con-
text, must be technically compatible and so on. Evidently,
the mere presence of data in some sort of public repository
does not ensure their actual availability in a relevant re-
search context. To become effectively visible, a firm, per-
sistent link from the metadata of the deposited specimen to
the respective character data in a repository would be a solu-
tion. Such links can be stored and conveniently exchanged
in the standard metadata exchange formats for specimens
(ABCD or Darwin Core). When accessing such a specimen,
e.g. via online specimen catalogues, the link to existing char-
acter data sets becomes readily available. Alternatively, the
array of specimen-associated data can be extended to also
include character datasets themselves. Recently, a system of
persistent http-URI identifiers for collection items associated
with the digital representation of a specimen was suggested
by Hyam et al. (79), which immediately gained wide accept-
ance and has been further elaborated since (80). Using this
system, the inclusion of character data into the array of spe-
cimen-associated data would make an attractive functional
solution, facilitating brief, precise and convenient reference
in scientific publications to a specimen with its digital image
(if available), its metadata and existing character datasets.
Such a solution would certainly help to increase significantly
the visibility and re-usability of character datasets. Research
collections are currently in a far-reaching process of trans-
formation from curating pure analogous to sizable and com-
plex collections of analogous and digital objects with the
related datasets. Extending curation to specimen-based char-
acter data may secure research collections to play an appro-
priate key role in current and future research in systematic
biology and thus in biodiversity assessment and analysis.
Our solution blazes a trail in systematic biology research
for a streamlined process of taxon characterization and the
additivity and re-usability of character data. The implemen-
tation is expected to be operational and available for down-
load by the end of the project in December 2015. We have
started to use this implementation in the integrative and
dynamic approach for monographing the angiosperm order
Caryophyllales (6). The current implementation has focused
on various aspects of sample and data associations, while
has relied on available software for the handling of morpho-
logical character data and for their aggregation from speci-
mens to taxon characterization as well as from lower to
higher taxon levels. The entire field of morphological char-
acter data aggregation, however, is waiting to become a
subject of further developmental work. This concerns in
particular three complexes: the modelling of character data
(3); semantic web solutions for ontologies of descriptive
terms (3,81); the exchangeability of data and the interoper-
ability of different character data matrices (e.g. merging
procedures for data matrices).
The concept of the modules for handling metadata of DNA-related
derivatives has been developed in coordination with and based on
work by Gabriele Droege (BGBM, Berlin) for the GGBN network.
The concept of linking derivatives and their visualization in the
EDIT Data Portal has been developed in discussion with and using
the work by Wolf-Henning Kusber (BGBM, Berlin) for AlgaTerra.
The distributed database query addressing specimen data providers
on standardized protocols was implemented in cooperation with
Jo¨ rg Holetschek (BioCASe network) and Patricia Kelbert (DFG pro-
ject BinHum). The work done by the developers of the other open
source projects (from the Eclipse and Apache foundations, BioJava,
NeXML) our software is built on is highly appreciated. We thank
Katy Jones (Berlin) for her critically reading and linguistically pol-
ishing an earlier version of the text and two anonymous reviewers
for valuably commenting on the original submission.
German Research Foundation (DFG, Deutsche
Forschungsgemeinschaft) within the Scientific Library Services and
Information Systems programme (KI 1175/1-1, MU 2875/3-1).
Funding for open access charge: DFG-funded Open Access
Publication Fund of the Freie Universita¨ t Berlin.
Conflict of interest. None declared.
1. Stuessy,T.F., Crawford,D.J., Soltis,D.E. and Soltis,P.S. (2014)
Plant systematics: the origin, interpretation, and ordering of
plant biodiversity. In: Gradstein SR (ed.), Regnum Vegetabile,
Vol. 156. Koeltz Scientific Books, Koenigstein, p. 425.
2. Dallwitz,M.J. (1980) User’s guide to the DELTA system—a gen-
eral system for coding taxonomic descriptions. CSIRO Australian
Division of Entomology Report No. 13, Canberra, Austalia.
3. Pullan,M.R., Armstrong,K.E., Paterson,T. et al. (2005) The
Prometheus Description Model: an examination of the taxo-
nomic description-building process and its representation.
Taxon,54, 751–765.
4. Sokal,R.R. and Sneath,P.H.A. (1963) Principles of Numerical
Taxonomy. Freemon & Co., San Francisco, CA.
5. Pullan,M.R., Watson,M.F., Kennedy,J.B. et al. (2000) The
Prometheus Taxonomic Model: a practical approach to repre-
senting multiple classifications. Taxon,49, 55–75.
6. Borsch,T., Hernandez-Ledesma,P., Berendsohn,W.G. et al. (2015)
An integrative and dynamic approach for monographing species-rich
plant groups—building the global synthesis of the angiosperm order
Caryophyllales. Perspect. Plant Ecol. Evol. Syst.,17, 284–300.
7. Redford,A.E., Dickison,W.C., Massey,J.R. and Bell,C.R. (1974)
Vascular Plant Systematics. Harper & Row, New York.
8. Berendsohn,W.G. (1995) The concept of “potential taxa” in
databases. Taxon,44, 207–212.
Page 16 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
9. Franz,N.M. and Peet,R.K. (2009) Towards a language for mapping
relationships among taxonomic concepts. Syst. Biodiv.,7, 5–20.
10. Franz,N.M. and Cardona-Duque,J. (2013) Description of two
new species and phylogenetic reassessment of Perelleschus
O’Brien & Wibmer, 1986 (Coleoptera: Curculionidae), with a
complete taxonomic concept history of Perelleschus sec. Franz &
Cardona-Duque, 2013. Syst. Biodiv.,11, 209–236.
11. Schoch,C.L., Robbertse,B., Robert, V. et al. (2014) Finding
needles in haystacks: linking scientific names, reference speci-
mens and molecular data for Fungi. Database (Oxford),
2014, 1–21.
12. Bachmann,K. (1998) Species as units of diversity: an outdated
concept. Theor. Biosci.,117, 213–230.
13. Judd,W.S., Campbell,C.S, Kellogg,E.A. et al. (2007) Plant
Systematics: A Phylogenetic Approach, 3rd edn. Sinauer
Associates, Sunderland, MA.
14. Stuessy,T.F. (2009) Paradigms in biological classification (1707–
2007): has anything really changed? Taxon,58, 68–76.
15. Stuessy,T.F. and Lack,H.W. (2011) Monographic plant system-
atics: fundamental assessment of plant biodiversity In: Stuessy
TF, Lack HW (eds). Regnum Vegetable, Vol. 153, Gantner,
Rugell, pp. 179–190.
16. Marhold,K., Stuessy,T., Agababian,M. et al. (2013) The future of
botanical monography: report from an international workshop,
12–16 March 2012, Smolenice, Slovak Republic. Taxon,62, 4–20.
17. Samper,C. (2004) Taxonomy and environmental policy. Philos.
Trans. R. Soc. Lond. B,359, 721–728.
18. Godfray,H.C.J., Clark,B.R., Kitching,I.J. et al. (2007) The web
and the structure of taxonomy. Syst. Biol.,56, 943–955.
19. Scoble,M.J., Clark,B.R., Godfray,H.C.J. et al. (2007)
Revisionary taxonomy in a changing e-landscape. Tijdschr.
Entomol.,150, 305–317.
20. Mayo,S.J., Allkin,R., Baker,W. et al. (2008) Alpha e-taxonomy:
responses from the systematics community to the biodiversity
crisis. Kew Bull.,63, 1–16.
21. Smith,V.S., Rycroft,S.D., Harman,K.T. et al. (2009) Scratchpads:
a data-publishing framework to build, share and manage informa-
tion on the diversity of life. BMC Bioinformatics,10,S6.
22. Berendsohn,W.G. (2010) Devising the EDIT Platform for
Cybertaxonomy. In: Nimis PL, Vignes-Lebbe R (eds). Tools for
Identifying Biodiversity: Progress and Problems. Proceedings of
the International Congress, Paris, 20–22 September 2010. EUT
Edizioni Universita` di Trieste, Trieste, pp. 1–6.
23. Diederich,J. (1997) Basic properties for biological databases:
character development and support. Math. Comput. Model.,25,
24. Dallwitz,M.J. (1980) A general system for coding taxonomic de-
scriptions. Taxon,29, 41–46.
25. Lucidcentral (1999þ)Lucidcentral.
(3 March 2014, date last accessed).
26. Hagedorn,G. and Rambold,G. (2000) A method to establish and
revise descriptive data sets over the internet. Taxo n,49, 517–528.
27. Ung,V., Causse,F. and Vignes-Lebbe,R. (2010) Xper2: managing
descriptive data from their collection to e-monographs. In:
Nimis PL, Vignes-Lebbe R (eds). Tools for Identifying
Biodiversity: Progress and Problems. Proceedings of the
International Congress, Paris, 20–22 September 2010, EUT
Edizioni Universita` di Trieste, Trieste, pp. 113–120.
28. Hagedorn,G., Thiele,K., Morris,R. and Heidorn,P.B. (ed) (2006)
The Structured Descriptive Data (SDD) w3c-xml-schema,
Version 1.1.
SchemaVersion (3 March 2014, date last accessed).
29. Maddison,D.R., Swofford,D.L. and Maddison,W.P. (1997)
NEXUS: an extensible file format for systematic information.
Syst. Biol.,46, 590–621.
30. Dallwitz,M.J. (2010) A Comparison of Formats for Descriptive
Data. Institute of Botany, Chinese Academy of Sciences. http://delta- (3 March 2014, date last accessed).
31. Vos,R.A., Balhoff,J.P., Caravas,J.A. et al. (2012) NeXML: rich,
extensible, and verifiable representation of comparative data and
metadata. Syst. Biol.,61, 675–689.
32. Raguenaud,C., Pullan,M.R., Watson,M.F. et al. (2002)
Implementation of the Prometheus Taxonomic Model: a com-
parison of database models and query languages and an intro-
duction to the Prometheus Object-Oriented Model. Taxon,
33. Stevens,P.F. (1996) On phylogenies and data bases—where are
the data, or are there any? Taxon,45, 95–98.
34. McNeil,J., Barrie,F.R., Buck,W.R. et al. (2012) International
Code of Nomenclature for algae, fungi, and plants (Melbourne
Code) adopted by the Eighteenth International Botanical
Congress Melbourne, Australia, July 2011. In: Gradstein SR
(ed.), Regnum Vegetabile, Vol. 154. Koeltz Scientific Books,
Koenigstein, 208 p.
(20 May 2015, date last accessed).
35. Berendsohn,W.G., Anagnostopoulos,A., Hagedorn,G. et al.
(1999) A comprehensive reference model for biological collec-
tions and surveys. Taxon,48, 511–562.
36. Berendsohn,W.G. and Nimis,P.L. (2000) The complexity of col-
lection information. In: Berendsohn WG (ed). Resource
Identification for a Biological Collection Information Service in
Europe (BioCISE). BGBM, Berlin, pp. 13–18.
37. Berendsohn,W.G. (2003) MoReTax—handling factual informa-
tion linked to taxonomic concepts in biology. Schriftenreihe
38. Berendsohn,W.G. (ed). (2007) Access to Biological Collection
Data. ABCD Schema 2.06—ratified TDWG Standard.
Botanischer Garten und Botanisches Museum Berlin-Dahlem,
Freie Universtita¨ t Berlin, Berlin.
CODATA/Schema/default.htm (3 March 2014, date last
39. Robertson,T., Do¨ ring,M., Wieczorek,J. et al. (2009) Darwin
Core Text Guide. Biodiversity Information Standards (TDWG). (3 March
2014, date last accessed).
40. Ciardelli,P., Kelbert,P., Kohlbecker,A. et al. (2009) The EDIT
platform for cybertaxonomy and the taxonomic workflow: se-
lected components. In: Fischer S, Maehle E, Reischuk R (eds).
Informatik 2009—Im Focus das Leben,Vol.154,Lecture
Notes in Informa tics (LNI), Gesellsc haft fu¨ r Informa tik,
Bonn, pp. 625–638.
Proceedings154/article4943.html (20 May 2015, date last
41. Berendsohn,W.G., Gu¨ ntsch,A., Hoffmann,N. et al. (2011)
Biodiversity information platforms: from standards to interoper-
ability. In: Smith V, Penev L (eds). e-Infrastructures for Data
Database, Vol. 2015, Article ID bav094 Page 17 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
Publishing in Biodiversity Science.ZooKeys, Vol. 150,
pp. 71–87.
42. Anonymous. (2008) Common Data Model. http://dev.e-tax (3 March 2014, date
last accessed).
43. Berendsohn,W.G. (1997) A taxonomic information model for
botanical databases: the IOPI model. Taxon,46, 283–309.
44. Berendsohn,W.G., Do¨ ring,M., Geoffroy,M. et al. (2003) The
Berlin Taxonomic Information Model. Schr.reihe Veg.kd.,39,
45. Geoffroy,M. and Berendsohn,W.G. (2003) The concept
problem in taxonomy: importance, components, approaches.
Schriftenreihe Vegetationsk,39, 5–14.
46. Geoffroy,M. and Berendsohn,W.G. (2003) Transmission of
taxon-related factual information. Schriftenreihe Vegetationsk,
39, 83–86.
47. Berendsohn,W.G. and Geoffroy,M. (2007) Networking taxo-
nomic concepts—uniting without “unitary-ism”. In: Curry G,
Humphries C (eds). Biodiversity Databases—Techniques,
Politics, and Applications. CRC Taylor & Francis, Baton Rouge,
LA, pp. 13–22.
48. TDWG. (2007) Biodiversity Information Standards. TDWG
Secretariat, Hobart, Tasmania. (3 March
2014, date last accessed).
49. Hyam,R. and Kennedy,J. (2006) The Taxon Concept Schema. (11
February 2015, date last accessed).
50. GBIF. (2013) Species API.
cies (4 February 2014, date last accessed).
51. Metawiki Contributors. (2011) Biowikifarm. http://biowiki (3 March 2014, date last accessed).
52. Anonymous. (2006) Scratchpads. (3
March 2014, date last accessed).
53. Agosti,D., Klingenberg,C., Catapano,T. and Sautter,G. (2008)
Plazi, Taking Care of Freedom.
54. Vicario,S., Hardisty,A. and Haitas,N. (2011) BioVeL:
Biodiversity Virtual e-Laboratory. EMBnet.journal,17, 5–6.
55. BHL. (2014) Biodiversity Heritage Library. http://biodiversity (20 May 2015, date last accessed).
56. Ride,W.D.L., Cogger,H.G., Dupuis,C. et al. (1999)
International Code of Zoological Nomenclature, 4th edn. The
International Trust for Zoological Nomenclature, The Natural
History Museum, London.
iczn/code/ (4 February 2014, date last accessed).
57. Hand,R., Kilian,N. and von Raab-Straube,E. (2009þ) (continu-
ously updated) International Cichorieae Network: Cichorieae
Portal. (20 May 2015,
date last accessed).
58. Droege,G. (2013þ) (continuously updated) CLD-CoW Portal—
Corvids Literature Database: Corvids of the World
Portal. (20 May 2015, date last
59. Palmweb. (2014) Palmweb: Palms of the World Online. www. (3 March 2014, date last accessed).
60. Hand,R., Hadjikyriakou,G.N. and Christodoulou,C.S. (2011þ)
(continuously updated) Flora of Cyprus—A Dynamic Checklist. (20 May 2015, date last
61. Hamann,T.D., Mu¨ ller, A., Roos,M.C. et al. (2014) Detailed
mark-up of semi-monographic legacy taxonomic works using
FlorML. Taxon,63, 377–393.
62. BioCASE. (2005þ)Biological Collection Access Service for
Europe (BioCASE). (4 February 2014, date
last accessed).
63. Berendsohn,W.G. and Nimis,P.L. (2000) The complexity of
collection information. In: Berendsohn WG (ed). Resource
Identification for a Biological Collection Information Service in
Europe (BioCISE). BGBM, Berlin, pp. 13–18.
64. Droege,G., Barker,K., Astrin,J.J. et al. (2013) The Global
Genome Biodiversity Network (GGBN) data portal. Nucleic
Acids Res.,42, 607–612.
65. Tscho¨ pe,O., Macklin,J. A., Morris,R.A. et al. (2013) Annotating
biodiversity data via the internet. Taxon,62, 1248–1258.
66. Sto¨ ver,B.C. and Mu¨ ller,K.F. (2014) LibrAlign—A GUI Library
for Displaying and Editing Multiple Sequence Alignments and
Attached Data. (20 May 2015,
date last accessed).
67. Spjuth,O., Alvarsson,J., Berg,A. et al. (2009) Bioclipse 2: a
scriptable integration platform for the life sciences. BMC
Bioinformatics,10, 397.
68. Prlic´ , A., Yates, A., Bliven, S.E. et al. (2012) BioJava: an open-
source framework for bioinformatics in 2012. Bioinformatics,
28, 2693–2695.
69. Tamura,K., Stecher,G., Peterson,D. et al. (2013) MEGA6:
Molecular Evolutionary Genetics Analysis Version 6.0. Mol.
Biol. Evol. 30, 2725–2729.
70. Sto¨ ver,B.C. and Mu¨ ller,K.F. (2015) JPhyloIO—Event Based
Reading and Writing of Multiple Sequence Alignment File
Formats. (20 May 2015, date
last accessed).
71. Campanula Portal. (2013þ) (continuously updated) The
Campanula Data Portal.
tal (18 June 2015, date last accessed).
72. Gu¨ ntsch,A., Berendsohn,W.G., Ciardelli,P. et al. (2009) Adding
content to content, a generic annotation system for biodiversity
data. Studi Trent. Sci. Nat.,84, 123–128.
73. Vogt,L. and Grobe,P. (2010) MorphDBase—Eine online
Datenbank fu¨ r morphologische Daten und Metadaten. GfBS
Newsletter,24, 29–34.
74. Dryad. (2014) Dryad Digital Repository.
(20 May 2015, date last accessed).
75. Diepenbroek,M., Glo¨ ckner,F., Grobe,P. et al. (2014) Towards
an integrated biodiversity and ecological research data manage-
ment and archiving platform: The German Federation for the
Curation of Biological Data (GFBio). In: Plo¨ dereder E, Grunske
L, Schneider E (eds). Informatik 2014—Big Data Komplexita¨t
meistern, Vol. 232, Lecture Notes in Informatics (LNI),
Gesellschaft fu¨ r Informatik, Bonn, pp. 1711–1724. http://subs. (20
May 2015, date last accessed).
76. Vines,T.H., Albert,A.Y., Andrew,R.L. et al. (2014) The avail-
ability of research data declines rapidly with article age. Curr.
Biol.,24, 94–97.
77. Gu¨ ntsch,A., Fichtmu¨ ller,D., Kirchhoff,A. and Berendsohn,W.G.
(2012) Efficient rescue of threatened biodiversity data using
reBiND workflows. Plant Biosyst.,146, 752–755.
Page 18 of 19 Database, Vol. 2015, Article ID bav094
at FU BerlinFB Humanmedizin on October 1, 2015 from
78. Bach,K., Scha¨ fer,D., Enke,N. et al. (2012) A comparative evalu-
ation of technical solutions for long-term data repositories in in-
tegrative biodiversity research. Ecol. Inform.,11, 16–24.
79. Hyam,R., Drinkwater,R.E. and Harris,D.V. (2012) Stable cit-
ations for herbarium specimens on the internet: an illustration
from a taxonomic revision of Duboscia (Malvaceae). Phytotaxa,
3, 17–30.
80. Gu¨ ntsch,A. and Hagedorn,G. (2013) Stable Identifiers for
Specimens—A CETAF ISTC Initiative Supported by Pro-
4296 (20 May 2015, date last accessed).
81. GFBio Terminology Server. (2015) Terminology Server of the
German Federation for Biological Data (GFBio). http://termi (20 May 2015, date last accessed).
Database, Vol. 2015, Article ID bav094 Page 19 of 19
at FU BerlinFB Humanmedizin on October 1, 2015 from
... This was of particular importance in S. sect. Lupulinaria where species limits are unclear, in order to make sure that the morphological data correspond to the same individual represented in the tree (see Kilian & al. 2015). For the other species, this procedure was applied as far as possible and complemented with information form the literature such as Flora treatments [Flora of Bhutan (Grierson & Long (1983-2002; Flora zambesiaca (Flora Zambesica Managing Committee 1950+)]. ...
... Despite the resolution obtained with markers also used in other DNA barcoding projects (e.g. the German Barcode of Life, GBOL; Geiger & al. 2016), the still largely unclear species limits, with the need to link samples represented in molecular datasets to formally described taxa (including type specimens), hamper attempts of molecular species identification in the Scutel laria orientalis group at the moment. What is needed is an integrative taxonomic approach, using structured character data, both molecular and morphological (Kilian & al. 2015), and also testing geographical and ecological scenarios relevant for speciation. In the example of S. galericulata, which is a morphologically well-characterized species, the matK-trnK and rpl16 plastid genomic regions reveal several substitutions that resolve the species as monophyletic but show the individual from the German state of Bavaria (Scu001) to possess a haplotype closer related to that in the individual from the Caucasus (Scu024) than to the sampled individual from the German state of Saxony-Anhalt (Fig. 1). ...
Full-text available
Scutellaria is one of the largest genera in the Lamiaceae with an estimated 400–500 species with a nearly worldwide distribution. Most species occur in the N hemisphere, with the Caucasus and the wider Irano-Turanian region housing a large number of taxa, many of them considered endemic. We present an overall phylogeny of the monophyletic genus Scutellaria based on rapidly evolving plastid regions (matK-trnK, rpl16, trnL-F). Three well-supported clades are evident, which render the currently accepted S. subg. Scutellaria paraphyletic to S. subg. Apeltanthus, which appears nested in “clade A”, in which the African S. schweinfurthii is sister to all remaining taxa, followed by other lineages of S. subg. Scutellaria. Ancestral states of 12 morphological characters frequently used as diagnostic from subgenus to species level were reconstructed with BayesTraits. The S. orientalis group appears as a major radiation in the Caucasus area and the Irano-Turanian region that may comprise up to a quarter of the species in the genus. This radiation corresponds to a monophyletically defined S. sect. Lupulinaria, characterized by decussate inflorescences and specialized (e.g. cucullate) bracts. Our phylogenetic data present significant resolution at the species level within the S. orientalis group, indicating complex geographically centred patterns of speciation in adaptation to steppe and high mountain habitats, including multiple evolution of pinnate and tomentose leaves. The detailed infrageneric classification of Juzepczuk (1951, 1954) mostly does not reflect natural groups.
... Data annotation is an equally important requirement to easily and unambiguously identify and understand relevant data that fits the need for a concrete project [13,27,28]. Metadata annotation is necessary to, e.g., unambiguously link tree nodes to sequences in a multiple sequence alignment that was used to generate the tree, or to link tree nodes and sequences to taxonomic information, ideally also using a taxonomic ID (e.g., NCBI Taxonomy [29])or even better linking a sequence back to the individual specimen it was derived from [30]. Additionally, linking relevant external resources (e.g., voucher information, digitized specimens or sequencing raw data) and providing metadata that reliably identifies the methods that were used to generate data (e.g., the software and parameters used for a phylogenetic inference) would further improve reusability of data and reproducibility of studies. ...
... The Taxonomic Editor of the EDIT platform for Cybertaxonomy [70] manages taxonomic workflows and their data, while persistently linking character data to preserved individual specimens [30]. AlignmentComparator [71] compares alternative multiple sequence alignments of the same dataset. ...
Full-text available
Background: Today a variety of phylogenetic file formats exists, some of which are well-established but limited in their data model, while other more recently introduced ones offer advanced features for metadata representation. Although most currently available software only supports the classical formats with a limited metadata model, it would be desirable to have support for the more advanced formats. This is necessary for users to produce richly annotated data that can be efficiently reused and make underlying workflows easily reproducible. A programming library that abstracts over the data and metadata models of the different formats and allows supporting all of them in one step would significantly simplify the development of new and the extension of existing software to address the need for better metadata annotation. Results: We developed the Java library JPhyloIO, which allows event-based reading and writing of the most common alignment and tree/network formats. It allows full access to all features of the nine currently supported formats. By implementing a single JPhyloIO-based reader and writer, application developers can support all of these formats. Due to the event-based architecture, JPhyloIO can be combined with any application data structure, and is memory efficient for large datasets. JPhyloIO is distributed under LGPL. Detailed documentation and example applications (available on significantly lower the entry barrier for bioinformaticians who wish to benefit from JPhyloIO's features in their own software. Conclusion: JPhyloIO enables simplified development of new and extension of existing applications that support various standard formats simultaneously. This has the potential to improve interoperability between phylogenetic software tools and at the same time motivate usage of more recent metadata-rich formats such as NeXML or phyloXML.
... Geometric morphometrics makes progress in the exact, objective, and fine-scale assessment of differences in leaf shape via landmark methods and appropriate multivariate statistics (Hodač & al., 2014(Hodač & al., , 2018. Geometric morphometrics can support final taxonomic decisions based on next-generation sequencing data (Kilian & al., 2015), particularly in complexes where morphological differences are hard to assess with traditional morphological classification. So far, there have been only a few studies applying both genomic and morphometric approaches to disentangle phylogenetic relationships in plants (e.g., Jiang & al., 2019). ...
Full-text available
Species are the basic units of biodiversity and evolution. Nowadays, they are widely considered as ancestor-descendant lineages. Their definition remains a persistent challenge for taxonomists due to lineage evolutionary role and circumscription, i.e., persistence in time and space, ecological niche, or a shared phenotype. Recognizing and delimiting species is particularly methodically challenging in fast-evolving, evolutionary young species complexes often characterized by low genetic divergence, hybrid origin, introgression, and incomplete lineage sorting. Ranunculus auricomus is a large Eurasian apomictic polyploid complex that probably has arisen from the hybridization of a few sexual progenitor species. However, even delimitation of and relationships among diploid sexual progenitors are unclear, ranging from 2 to 12 species. Here, we present an innovative workflow combining phylogenomic methods based on 86,782 parameter-optimized RADseq loci and target enrichment of 663 nuclear genes accompanied by geometric morphometrics to delimit sexual species in this evolutionary young complex (<1 Mya). For the first time, we revealed a fully resolved and well-supported maximum likelihood tree phylogeny congruent to neighbor-net network and STRUCTURE results based on RADseq data. In a few clades, we found evidence of discordant patterns indicated by quartet sampling, and reticulation events in the neighbor-net network probably caused by introgression and incomplete lineage sorting. Together with coalescent-based species delimitation approaches based on target enrichment data, we found five main genetic lineages, with an allopatric distribution in central and southern Europe. A concatenated geometric morphometric dataset including data from basal and stem leaves, as well as receptacles, revealed the same five main clusters. We accept those five morphologically differentiated, geographically isolated, genetic main lineages as species: R. cassubicifolius s.l. (incl. R. carpaticola), R. envalirensis s.l. (incl. R. cebennensis), R. flabellifolius, R. marsicus, and R. notabilis s.l. (incl. R. austroslovenicus, R. calapius, R. mediocompositus, R. peracris, and R. subcarniolicus). Our comprehensive workflow combing phylogenomic methods supported by geometric morphometrics proved to be successful in delimiting closely related sexual taxa and applying an evolutionary species concept. This workflow is also applicable to other evolutionarily young complexes.
... In such cases, a practical remedy is to post pre-publication manuscripts in a free repository, such as bioRxiv (Sever et al. 2019), so that users can freely access the information while citing the original paper. Unified digital protologues with semantic standardization can be a further step towards automated collection, structuring and analysis of taxonomic data, based on both specimens and species (Kilian et al. 2015;Triebel et al. 2016;Plitzner et al. 2019;Dallwitz et al. 2020). However, this approach is challenging due to terminological ambiguity and the large set of characters required to cover all fungi, only a fraction of which is typically used in a particular lineage. ...
Full-text available
True fungi (Fungi) and fungus-like organisms (e.g. Mycetozoa, Oomycota) constitute the second largest group of organisms based on global richness estimates, with around 3 million predicted species. Compared to plants and animals, fungi have simple body plans with often morphologically and ecologically obscure structures. This poses challenges for accurate and precise identifications. Here we provide a conceptual framework for the identification of fungi, encouraging the approach of integrative (polyphasic) taxonomy for species delimitation, i.e. the combination of genealogy (phylogeny), phenotype (including autecology), and reproductive biology (when feasible). This allows objective evaluation of diagnostic characters, either phenotypic or molecular or both. Verification of identifications is crucial but often neglected. Because of clade-specific evolutionary histories, there is currently no single tool for the identification of fungi, although DNA barcoding using the internal transcribed spacer (ITS) remains a first diagnosis, particularly in metabarcoding studies. Secondary DNA barcodes are increasingly implemented for groups where ITS does not provide sufficient precision. Issues of pairwise sequence similarity-based identifications and OTU clustering are discussed, and multiple sequence alignment-based phylogenetic approaches with subsequent verification are recommended as more accurate alternatives. In metabarcoding approaches, the trade-off between speed and accuracy and precision of molecular identifications must be carefully considered. Intragenomic variation of the ITS and other barcoding markers should be properly documented, as phylotype diversity is not necessarily a proxy of species richness. Important strategies to improve molecular identification of fungi are: (1) broadly document intraspecific and intragenomic variation of barcoding markers; (2) substantially expand sequence repositories, focusing on undersampled clades and missing taxa; (3) improve curation of sequence labels in primary repositories and substantially increase the number of sequences based on verified material; (4) link sequence data to digital information of voucher specimens including imagery. In parallel, technological improvements to genome sequencing offer promising alternatives to DNA barcoding in the future. Despite the prevalence of DNA-based fungal taxonomy, phenotype-based approaches remain an important strategy to catalog the global diversity of fungi and establish initial species hypotheses.
Cryptic species are organisms which look identical, but which represent distinct evolutionary lineages. They are an emerging trend in organismal biology across all groups, from flatworms, insects, amphibians, primates, to vascular plants. This book critically evaluates the phenomenon of cryptic species and demonstrates how they can play a valuable role in improving our understanding of evolution, in particular of morphological stasis. It also explores how the recognition of cryptic species is intrinsically linked to the so-called 'species problem', the lack of a unifying species concept in biology, and suggests alternative approaches. Bringing together a range of perspectives from practicing taxonomists, the book presents case studies of cryptic species across a range of animal and plant groups. It will be an invaluable text for all biologists interested in species and their delimitation, definition, and purpose, including undergraduate and graduate students and researchers.
The species of classical taxonomy are examined with a view to their future role in integrative taxonomy. Taxonomic species are presented as the products of a cyclic workflow between taxonomists and biologists in general, and as the essential means to express the results of evolutionary biological research in a cognitive form which can be widely understood outside the systematics research community. In the first part, the procedures underlying the formation and structure of classical species taxon concepts are analysed and discussed, and this involves some passing reference to mental concepts as understood by cognitive psychologists. The second part considers the need for methodological advances in classical taxonomy in the form of computational modelling. It is argued that in order to accomplish this, species taxon concepts will need to be expressed as computable matrices in parallel to their conventional form, expanding their role in integrative taxonomy, facilitating the feedback from evolutionary biological research and potentially accelerating the update and modification of their delimitation as knowledge increases. The third part treats another, more immediate methodological issue: some kinds of data already produced by taxonomic revisions could be provided as standard online outputs but are not yet part of the canonical published format. The final part consists of a discussion of the gradually emerging global online framework of taxonomic species and its importance as a general reference system. A glossary of terms is provided.
Full-text available
This data paper presents a largely phylogeny-based online taxonomic backbone for the Cactaceae compiled from literature and online sources using the tools of the EDIT Platform for Cybertaxonomy. The data will form a contribution of the Caryophyllales Network for the World Flora Online and serve as the base for further integration of research results from the systematic research community. The final aim is to treat all effectively published scientific names in the family. The checklist includes 150 accepted genera, 1851 accepted species, 91 hybrids, 746 infraspecific taxa (458 heterotypic, 288 with autonyms), 17,932 synonyms of accepted taxa, 12 definitely excluded names, 389 names of uncertain application, 665 unresolved names and 454 names belonging to (probably artificial) named hybrids, totalling 22,275 names. The process of compiling this database is described and further editorial rules for the compilation of the taxonomic backbone for the Caryophyllales Network are proposed. A checklist depicting the current state of the taxonomic backbone is provided as supplemental material. All results are also available online on the website of the Caryophyllales Network and will be constantly updated and expanded in the future.
Full-text available
It is time to synthesize the knowledge that has been generated through more than 260 years of botanical exploration, taxonomic and, more recently, phylogenetic research throughout the world. The adoption of an updated Global Strategy for Plant Conservation (GSPC) in 2011 provided the essential impetus for the development of the World Flora Online (WFO) project. The project represents an international, coordinated effort by the botanical community to achieve GSPC Target 1, an electronic Flora of all plants. It will be a first‐ever unique and authoritative global source of information on the world's plant diversity, compiled, curated, moderated and updated by an expert and specialist‐based community (Taxonomic Expert Networks – “TENs” – covering a taxonomic group such as family or order) and actively managed by those who have compiled and contributed the data it includes. Full credit and acknowledgement will be given to the original sources, allowing users to refer back to the primary data. A strength of the project is that it is led and endorsed by a global consortium of more than 40 leading botanical institutions worldwide. A first milestone for producing the World Flora Online is to be accomplished by the end of 2020, but the WFO Consortium is committed to continuing the WFO programme beyond 2020 when it will develop its full impact as the authoritative source of information on the world's plant biodiversity.
Full-text available
Plants, fungi and algae are important components of global biodiversity and are fundamental to all ecosystems. They are the basis for human well-being, providing food, materials and medicines. Specimens of all three groups of organisms are accommodated in herbaria, where they are commonly referred to as botanical specimens. The large number of specimens in herbaria provides an ample, permanent and continuously improving knowledge base on these organisms and an indispensable source for the analysis of the distribution of species in space and time critical for current and future research relating to global biodiversity. In order to make full use of this resource, a research infrastructure has to be built that grants comprehensive and free access to the information in herbaria and botanical collections in general. This can be achieved through digitization of the botanical objects and associated data. The botanical research community can count on a long-standing tradition of collaboration among institutions and individuals. It agreed on data standards and standard services even before the advent of computerization and information networking, an example being the Index Herbariorum as a global registry of herbaria helping towards the unique identification of specimens cited in the literature. In the spirit of this collaborative history, 51 representatives from 30 institutions advocate to start the digitization of botanical collections with the overall wall-to-wall digitization of the flat objects stored in German herbaria. Germany has 70 herbaria holding almost 23 million specimens according to a national survey carried out in 2019. 87% of these specimens are not yet digitized. Experiences from other countries like France, the Netherlands, Finland, the US and Australia show that herbaria can be comprehensively and cost-efficiently digitized in a relatively short time due to established workflows and protocols for the high-throughput digitization of flat objects. Most of the herbaria are part of a university (34), fewer belong to municipal museums (10) or state museums (8), six herbaria belong to institutions also supported by federal funds such as Leibniz institutes, and four belong to non-governmental organizations. A common data infrastructure must therefore integrate different kinds of institutions. Making full use of the data gained by digitization requires the set-up of a digital infrastructure for storage, archiving, content indexing and networking as well as standardized access for the scientific use of digital objects. A standards-based portfolio of technical components has already been developed and successfully tested by the Biodiversity Informatics Community over the last two decades, comprising among others access protocols, collection databases, portals, tools for semantic enrichment and annotation, international networking, storage and archiving in accordance with international standards. This was achieved through the funding by national and international programs and initiatives, which also paved the road for the German contribution to the Global Biodiversity Information Facility (GBIF). Herbaria constitute a large part of the German botanical collections that also comprise living collections in botanical gardens and seed banks, DNA- and tissue samples, specimens preserved in fluids or on microscope slides and more. Once the herbaria are digitized, these resources can be integrated, adding to the value of the overall research infrastructure. The community has agreed on tasks that are shared between the herbaria, as the German GBIF model already successfully demonstrates. We have compiled nine scientific use cases of immediate societal relevance for an integrated infrastructure of botanical collections. They address accelerated biodiversity discovery and research, biomonitoring and conservation planning, biodiversity modelling, the generation of trait information, automated image recognition by artificial intelligence, automated pathogen detection, contextualization by interlinking objects, enabling provenance research, as well as education, outreach and citizen science. We propose to start this initiative now in order to valorize German botanical collections as a vital part of a worldwide biodiversity data pool.
Full-text available
Several applications currently developed in our group and by our cooperators deal with multiple sequence alignments (MSA) or associated raw and meta data, and allow the user to view and edit it in a graphical user interface (GUI). Instead of implementing independent solutions for these different tasks, we decided to create a library containing powerful and reusable common GUI components. Since this library is open source (GNU GPL 3) it can be used and extended by other researchers, who are then able to focus on the core functionality of their applications, but can still provide a user-friendly GUI. Besides components allowing the displaying and editing of MSAs, several types of data (e.g. trace files, comments, statistical sequence information, positions of tandem repeats, hairpins or inversions) can be attached either to single sequences or to the alignment as a whole. All these data views implement a common interface that makes it easy for developers to create new custom views. Several components from our library (e.g. displaying different types of data) can be connected to each other in the application they are embedded in, so that the user can scroll through one of them while all others will automatically display the data associated with the current position. LibrAlign is fully interoperable with the BioJava API and all components are provided in a native Swing and a native SWT version (the two major GUI frameworks for Java), so that they can be integrated into any Java GUI application, including projects based on the Eclipse Rich Client Platform or Bioclipse. Several software projects based on LibrAlign are currently in development in- and outside our group. Among those are (i) the Taxonomic Editor of the EDIT platform which is extended to support sequence and alignment associated data for the Campanula portal of EDIT, (ii) a new version of the alignment editor PhyDE, (iii) AlignmentComparator (an application to visualize differences between alternative automatic and manual alignments, which we currently use in study investigating the influence of manual alignment corrections on phylogenetic studies), and (iv) HIR-Finder (an application which locates microstructural mutations like tandem repeats possibly associated with hairpins). LibrAlign download:
Full-text available
Taxonomic information models The first step in the implementation of a database driven application is the definition of an appropriate information model, which has to be complex enough to meet the needs of the application and at the same time simple enough to be usable (GÜNTSCH & al., 2002). The taxonomic model has to incorporate nomenclatural rules and the traditional taxonomic relationships (synonymy, taxonomic hierarchy, etc.). In addition, it has to be capable of representing different taxonomic views in order to enable the system to express arbitrary relationships between potential taxa. The solution presented here is based on the IOPI model (BERENDSOHN, 1997), but the process of implementation has led to several changes in the overall design. Other concept-oriented models published over the past 6 years are cited by GEOFFROY & BERENDSOHN (2003a). The Berlin Model is addressing botanical data, but should serve for zoology as well, with some changes in the names section and the composition of nomenclatural reference citations. Because this is a physical model (i.e. the actual database design used in the implementation), the possibility of future changes to the design here presented cannot be excluded. These are and will be documented in the databased documentation attached to BERENDSOHN & al. (2002) on the WWW. That documentation also provides links to the different projects using the Berlin Model (among others, Euro+Med, IOPI / EuroCAT, Med-Checklist, the Dendroflora of El Salvador and AlgaTerra). The core model covers nomenclatural relationships, potential taxa and their relationships, bibliographical information, and a general structure for factual data. The core model is extensible in order to meet specific project requirements by means of adding further entities and relationships. Nomenclatural type designation, for example, is a central subject of the AlgaTerra project and is thus covered in a model extension (see KUSBER et al., 2003). For pragmatic reasons it was decided to base further specification on a relational model for the underlying database. There are clear advantages in other data models, but with the general aim of realising an implementation in the near future, the choice of using a relational model was based on the assumption that – for some time to come – relational database management systems (DBMS) will remain the standard tool for data storage. The DBMS used must be capable of processing stored procedures, functions, and triggers so that maximum integrity of taxonomic data can be achieved at database level. An MS SQL-Server 2000 database has been implemented as a documentation database, serving to store a model implementation of all core and extension tables, a reservoir for program elements related to the model (triggers, user defined functions, stored procedures), and to manage the documentation of the tables and attributes. Documentation of the core model as well as existing extensions is available on-line (BERENDSOHN & al., 2002), the list of tables and attributes being generated dynamically from the documentation database.
Technical Report
Version 2.06 at: Archived version in Github:
Monographs are fundamental for progress in systematic botany. They are the vehicles for circumscribing and naming taxa, determining distributions and ecology, assessing relationships for formal classification, and interpreting long–term and short–term dimensions of the evolutionary process. Despite their importance, fewer monographs are now being prepared by the newer generation of systematic botanists, who are understandably involved principally with DNA data and analysis, especially for answering phylogenetic, biogeographic, and population genetic questions. As monographs provide hypotheses regarding species boundaries and plant relationships, new insights in many plant groups are urgently needed. Increasing pressures on biodiversity, especially in tropical and developing regions of the world, emphasize this point. The results from a workshop (with 21 participants) reaffirm the central role that monographs play in systematic botany. But, rather than advocating abbreviated models for monographic products, we recommend a full presentation of relevant information. Electronic publication offers numerous means of illustration of taxa, habitats, characters, and statistical and phylogenetic analyses, which previously would have been prohibitively costly. Open Access and semantically enhanced linked electronic publications provide instant access to content from anywhere in the world, and at the same time link this content to all underlying data and digital resources used in the work. Resources in support of monography, especially databases and widely and easily accessible digital literature and specimens, are now more powerful than ever before, but interfacing and interoperability of databases are much needed. Priorities for new resources to be developed include an index of type collections and an online global chromosome database. Funding for sabbaticals for monographers to work uninterrupted on major projects is strongly encouraged. We recommend that doctoral students be assigned smaller genera, or natural portions of larger ones (subgenera, sections, etc.), to gain the necessary expertise for producing a monograph, including training in a broad array of data collection (e.g., morphology, anatomy, palynology, cytogenetics, DNA techniques, ecology, biogeography), data analysis (e.g., statistics, phylogenetics, models), and nomenclature. Training programs, supported by institutes, associations, and agencies, provide means for passing on procedures and perspectives of challenging botanical monography to the next generation of young systematists.
This website contains information on all species and subspecies of vascular plants occurring in Cyprus, one of the hotspots of Mediterranean biodiversity. The flora comprises 1649 indigenous taxa (species and subspecies), 254 introduced taxa occuring in the wild, 43 hybrids and 81 species with unclear status (as at March 2019). The website brings together data from authoritative sources and will be updated continuously. Currently available: • complete checklist of vascular plants occurring in Cyprus • selection of recently used synonyms • basic information on endemism and status of occurrence • distribution maps showing occurrence in the 8 phytogeographical divisions • documentation of specimen-based literature records in the 8 divisions • data on the altitudinal range • threat categories according to the Red Data Book of the flora of Cyprus • photographs for the majority of taxa • data on chromosome numbers • morphological descriptions of selected taxa • keys for all species and subspecies • bibliography • common names of selected taxa Planned for the near future: • specimen data • data on traits • statistics • a print version of the checklist Long-term perspective: • complete synonymy and nomenclatural references for all taxa • developing an online flora of Cyprus How to cite us Hand R., Hadjikyriakou G. N. & Christodoulou C. S. (ed.) 2011– (continuously updated): Flora of Cyprus – a dynamic checklist. Published at; accessed [date]