A Clustering Optimization Strategy for Molecular Taxonomy Applied to Planktonic Foraminifera SSU rDNA.

Markus Göker, Guido W Grimm, Alexander F Auch, Ralf Aurahs, Michal Kučera

German Collection of Microorganisms and Cell Cultures (DSMZ), Inhoffenstraße 7B, 38124 Braunschweig, Germany.

Journal Article: Evolutionary bioinformatics online (impact factor: 1.89). 01/2010; 6:97-112.

Abstract

Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to delineate molecular operational taxonomic units often rely on arbitrary parameter choices. Also, distance calculation is difficult for highly alignment-ambiguous sequences. Here, we applied a recently described clustering optimization method to highly divergent planktonic foraminifera SSU rDNA sequences. We determined the distance function and the clustering setting that result in the highest agreement with morphological reference data. Alignment-free distance calculation, when adapted to the use with partly non-homologous sequences caused by distinct primer pairs, outperformed multiple sequence alignment. Clustering optimization offers new perspectives for the barcoding of species diversity and for environmental sequencing. It bridges the gap between traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both genetic divergence and given species concepts.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
Open Access
Full open access to this and
thousands of other papers at
http://www.la-press.com.
Evolutionary Bioinformatics 2010:6 97–112
This article is available from http://www.la-press.com.
© the author(s), publisher and licensee Libertas Academica Ltd.
This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
A clustering Optimization strategy for Molecular Taxonomy
Applied to planktonic Foraminifera ssU rDnA
Markus Göker1, Guido W. Grimm2, Alexander F. Auch3, Ralf Aurahs4 and Michal Kučera4
1German Collection of Microorganisms and Cell Cultures (DSMZ), Inhoffenstraße 7B, 38124 Braunschweig, Germany.
2Swedish Museum of Natural History, Box 50007, Stockholm, Sweden. 3Center for Bioinformatics Tübingen, Eberhard
Karls University of Tübingen, Sand 14, 72076 Tübingen, Germany. 4Institute of Geosciences, Eberhard Karls University
of Tübingen, Sigwartstraße 10, 72076 Tübingen, Germany. Corresponding author email: markus.goeker@dsmz.de
Abstract: Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if
morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental
DNA sequences. However, clustering approaches to delineate molecular operational taxonomic units often rely on arbitrary parameter
choices. Also, distance calculation is difficult for highly alignment-ambiguous sequences. Here, we applied a recently described
clustering optimization method to highly divergent planktonic foraminifera SSU rDNA sequences. We determined the distance function
and the clustering setting that result in the highest agreement with morphological reference data. Alignment-free distance calculation,
when adapted to the use with partly non-homologous sequences caused by distinct primer pairs, outperformed multiple sequence
alignment. Clustering optimization offers new perspectives for the barcoding of species diversity and for environmental sequencing.
It bridges the gap between traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account
for both genetic divergence and given species concepts.
Keywords: automated taxonomy, linkage clustering, parameter optimization, planktonic foraminifera, SSU rDNA
Evolutionary Bioinformatics
O R I G I N A L R E S E A R C H
Evolutionary Bioinformatics 2010:6 97
Page 2
Introduction
A reliable taxonomy is crucial for the assessment of
biodiversity and for the comparison of habitats based
on their species composition. However, delimiting
taxa is challenging in the case of organisms for which
(almost) exclusively molecular data are available, even
in the case where robust phylogenetic hypotheses can
be inferred. For the species delimitation in microor-
ganisms such as bacteria, fungi, and many other uni-
cellular eukaryotes, only few diagnostic characters
may be present, and an increasing number of such
organisms are only known by their DNA sequences.1–9
Even in the case of organisms with well-established
phenotypic characteristics, molecular taxonomy is
necessary to validate established species concepts
and identify those that require a taxonomic revision.
Molecular data are also essential to detect so-called
cryptic species (or pseudocryptic species),10 ie, species
for which no morphological differences exist (or have
not been determined so far). Finally, molecular tax-
onomy is needed to analyze sequences that have been
directly sampled from their natural environment, eg,
in the context of metagenomics projects.11,12 Despite
its obvious utility in a number of cases, the entire
concept of molecular taxonomy has been intensively
debated in the literature, particularly regarding DNA
barcoding.13–18 The basic question is, whether or not
morphological and molecular data can be combined
in an objective and reproducible way for taxonomic
purposes. Is it possible to devise tools for (molecular)
identification of taxonomic units that reflect morphol-
ogy-based taxonomic concepts?
For sequence data-based species delimitation,
researchers mostly use a predefined threshold T for
pairwise genetic distances in clustering algorithms to
assign sequences to molecular operational taxonomic
units.1–3,5,8,9 Values of T used for clustering differ
in literature, even if applied to the same groups of
organisms and molecular markers,4,6,7,9 and are often
based on subjective criteria or on a recently emerged
tradition for the sake of comparability between studies.
However, the number and the content of the obtained
clusters greatly varies with T (see19 and below).
In addition to T, the clustering algorithm also affects
the circumscription and the shape of the clusters
formed.20(192) In the context of linkage clustering, a
link is defined as a pairwise distance shorter than
or equal to the chosen threshold T. To add a new
object to a given cluster one can either request that
at least one distance to the cluster member is a link
(single linkage; F = 0.0) or that all distances are links
(complete linkage; F = 1.0), or any proportion F of
the distances between (see overview in).20 However,
F has hardly been addressed in the recent literature
on molecular taxonomy. For instance, Meier et al21
regarded the clustering of triplets of sequences as
“logically inconsistent” if only two of the three dis-
tances are links. However, values of F smaller than
1.0 are well established in the clustering literature.19
An apparent solution for this inconsistency was
to explicitly specify F.19 For a given T mean and
maximum within-cluster distances may be much
larger for small values of F,20(192) even though this
becomes relevant in cases where genetic divergence
differs between morphologically defined lineages.22
Methods more advanced than linkage clustering have
been suggested,23,24 but these focus on identification,
ie, the assignment of query sequences to predefined
groups, and thus require a correct reference taxon-
omy. However, misidentifications even of organisms
with well-established microscopical characteristics
are possible, and sequences in public databases are
frequently mislabeled.25 Thus, it is obvious that meth-
ods are needed that can adapt molecular taxonomy to
reference data based on traditional taxonomy, with-
out requiring that the latter is 100% correct.
A recently introduced method, clustering
optimization,19 allows one to obtain taxonomic units
from non-hierarchical clustering that are in optimal
agreement with a given reference dataset. Reference
data can be derived from traditional taxonomy. For
instance, the morphology-based species identification
of specimens results in a partition, ie, a non-overlapping,
non-hierarchical division, of the objects (specimens).
In fact, every biological classification which comprises
only a single taxonomic rank represents a partition.
Because the non-hierarchical clustering of the sequences
also results in a partition, a metric for the disagreement
between two partitions allows one to determine those
clustering parameters T and F that result in the highest
agreement between the clustering partition and the
reference data. Because clustering optimization does
not require full agreement between the partitions, is it
suitable for biological datasets in which the reference
partition may contain errors due to misidentification or
current taxon boundaries that do not fully reflect the
Göker et al
98 Evolutionary Bioinformatics 2010:6
Page 3
natural history of the organisms. This principle can be
extended to more than two parameters to be optimized,
for instance by also optimizing the inference of the dis-
tance matrix to which the clustering is applied.
Calculating distances may be difficult because of
alignment ambiguity,26,27 particularly in the case of
highly divergent markers. For example, in the case
of our target organisms, planktonic foraminifera,
approximately 50% of the 3’ part of the small subunit
ribosomal DNA (SSU rDNA) represented in most
published fragments can be aligned across all lineages
but comprise limited phylogenetic signal; the sig-
nal contained in the highly length-polymorphic,
extremely divergent and generally “nonalignable”
regions of the multiple sequence alignment (MSA) is
lost.28 MSA-free sequence comparison methods have
been suggested (which may be based on pairwise
alignment).29–31 Even though some of them are very
fast, they have never been used in molecular taxonomy.
This is despite the advantages of such methods in an
era of rapidly advancing DNA sequencing technolo-
gies and the thus exponentially increasing amount of
molecular data.32
Modern planktonic foraminifera (PF) are classified
into about 50 species based on the morphology of
their calcite shells (termed “morphotaxonomy” in
the following), so that the paleontological taxonomy
of this group is consistent with that of the living
species.33 Their shells accumulate in huge quantities
on the sea floor making their fossil record one of the
most complete and continuous of all organisms and
PF one of the most important proxies in paleocli-
matology (eg,).34,35 However, proxies for past-ocean
properties are empirically derived and require species-
specific calibrations. Therefore, correct assessment
of species taxonomy, ecology and biogeography is
essential for reliable reconstructions. PF SSU rDNA
is characterized by generally higher substitution rates
than in many other groups, making it, unusually,
a suitable marker for genetic diversity below the level
of morphological species (eg,).36 The distinct genetic
types found within many PF morphospecies (reviewed
in)37–39 could be considered as biological species,
since they do not show any signs of introgression or
interbreeding, and are often restricted to certain oce-
anic regimes and areas.38,40–42 However, until now this
cryptic diversity has been used to arbitrarily define
and label “genotypes” (eg,).37 Because established
morphospecies are not genetically uniform, there is
an urgent need for standardization.
We here apply clustering optimization in three
dimensions (T, F, and distance function) to PF
SSU rDNA sequences and their currently accepted
taxonomy. To cope with alignment ambiguity, we
apply MSA-free distance methods, which we improve
for use with partial sequences. Optimal settings for
both clustering parameters and distance functions are
then used to define taxonomic units. As in a previous
study,19 resampling and permutation techniques are
applied to determine the robustness of the optimiza-
tion regarding taxon sampling and errors in the ref-
erence partition. The outcome is discussed regarding
current species concepts for PF and the general appli-
cability of our methods for combining morphological
and molecular data in an objective and reproduc-
ible way for taxonomic purposes and for automated,
sequence-based identification.
Material and Methods
Data sources and data preparation
The dataset comprised 299 (mostly partial) sequences
from the 3’ end of the PF SSU rDNA. 146 of these
sequences were recently published and have been
obtained from specimens collected in the Northeast
Atlantic Ocean and the Mediterranean Sea in the
course of the study of Aurahs et al28 The remaining ones
were downloaded on 28/01/2008 from the GenBank
database (http://www.ncbi.nlm.nih.gov/). Taxonomic
information for clustering optimization was extracted
from the GenBank flat files using the program
GBK2FAS,19 which is freely available at http://www.
goeker.org/mg/clustering/. To optimize for the agree-
ment with morphotaxonomy, the species affiliations
of corresponding specimens were taken as reference
partition. The PF taxonomy used in the “organism”
identifiers of the GenBank accessions follows mor-
photaxonomy, with two exceptions. In the case of
Globigerinella siphonifera and Orbulina universa s.l.
(including “Orbulina sp.”), a unique “organism” is
not present, but highly similar or identical sequences
are partly denoted with different names for genotypes
(G. siphonifera) or individuals (Orbulina) although
they belong to the same morphospecies in the original
literature.38,41,43 Vice versa, some “organism” names
include significantly divergent SSU fragments.
Therefore, to obtain a reference partition consistently
Optimal taxonomy of planktonic foraminifers
Evolutionary Bioinformatics 2010:6 99
Page 4
based on morphospecies, we assigned these acces-
sions to either “G. siphonifera” or “Orbulina sp.”
The downloaded GenBank data contained 60 distinct
“organism” names, which we corrected down to 23
distinct morphospecies by removing the parts after
the epithet. The morphotaxonomic reference data of
the 146 specimens published in28 relies on the exper-
tise of the original collectors. The total reference par-
tition comprised 27 reference taxa, as documented in
detail in the electronic supplementary material (ESM;
File 1).
Distance calculation
Distances between SSU rDNA sequences were
computed using the MSA-free method GBDP
(“genome BLAST distance phylogeny”),44–46 which
had been applied to whole genomes of prokaryotes45,46
and organelles44 and here was adapted for use with
single sequence regions (“gene BLAST distance phy-
logeny”). GBDP applies BLAST47 to identify local
regions of high sequence similarity, “high-scoring
segment pairs” (HSPs), between two sequences.
Among the formulae for inferring distances from
BLAST results44–46 the following one performed best
in recovering evolutionary relationships:44,46

D x y
I I
x y
xy yx( , ) :
( , )
=
+
1-
λ
(1)
D(x,y) denotes the distance between sequences x
and y, and I
xy
denotes the sum of the number of identi-
cal base pairs over all HSPs obtained by using x as the
query and y as the subject sequence for blast. In the
case of whole genomes, the denominator λ(x,y) can
correspond to the average length of both sequences45
(Formula 2), but here was corrected for the use with
single gene regions.
MSA-independent phylogenetic inferences can
suffer from limitations that are not present in MSA-
based approaches. Most importantly, evolutionary
relationships should be inferred from homologous
characters only (48(96),49(63),50(120) among others). Let
“fragment homology” denote the situation in which
two sequence fragments are, as a whole, homologous
to each other. In the case of MSA, fragment homology
is established implicitly by establishing the homology
of individual nucleotide (or amino acid) residues and
their non-homology by the insertion of gaps. Although
single-gene data have been amplified from homolo-
gous gene regions, they can violate the fragment
homology condition. For instance, Figure 1 shows
three HSPs (gray boxes) between sequences x and y.
While nucleotide (or amino acid) homology is estab-
lished within these HSPs, the fragment homology
of a non-HSP region enclosed between two adjacent
HSPs is established as long as the insertion of, eg,
a novel protein domain can be ruled out. In the case
of foraminifer SSU rDNA, regions between HSPs are,
of course, much more likely do be due to high evo-
lutionary rates.28,36,51 In contrast, leading and trailing
non-HSP regions may as well be caused by the use of
distinct primer pairs; fragment homology is not nec-
essarily given in this case. Leading and trailing gaps
in an alignment often represent missing data and not
evolutionary events.52 Because MSA-independent
methods treat entire sequences as single characters,
fragment homology is likely to matter. The robustness
of MSA-free sequence comparison against the vio-
lation of the fragment homology assumption has, to
the best of our knowledge, not yet been examined in
simulation or empirical studies. The problem is also
present in PF SSU data; for instance, most short ampl-
icons of “Orbulina sp.” from GenBank corresponded
only to parts of the few long amplicons of “Orbulina
universa”.
To correct for a potential length artifact on D(x,y),
two modifications of the denominator λ(x,y) in
fy Fy Ly ly
fx Fx Lx lx
x
y
Figure 1. Corrected alignment-free distance formulae.
notes: How to correct GBDP for the violation of fragment homology
(trimmed sequence ends). Symbols used: x and y, sequences; grey
boxes, location of HSPs; fx, first position; Fx, first position within the first
HSP; Lx, last position within the last HSP; and lx, globally last position,
within sequence x; fy, Fy, Ly, and ly are defined analogously. Without
background information, fragment homology is only established explicitly
between the Fx - Lx part of x and the Fy - Ly part of y. If the sequences
violate the fragment homology condition, using the full sequence lengths
in the denominator (λ0; Formula 2) will thus overestimate the number
of base pairs that can be compared in a biologically meaningful way.
The corresponding distances that use λ0 will thus be overestimated
(Formula 1). The modifications of the denominator in formulae 3 and 4
correct the distances that use λ1 and λ2 downwards.
Göker et al
100 Evolutionary Bioinformatics 2010:6
Page 5
Formula 1 (see also Formula 2) were applied (Fig. 1;
Formulae 3 and 4). The uncorrected mean length λ
0

of two sequences, x and y, is given by:

λ0 1 2
( , ) :x y
l lx y=
+
- (2)
which has been applied to complete genomes. We
hypothesize that this denominator must be corrected
downwards in the case of strong deviations from
fragment homology because otherwise the range of
base pairs that can be compared in a biologically sen-
sible way is overestimated.44–46 Correction λ
1
is shown
in Formula 3:

λ1 2
1( , ) :x y
L F L Fx x y y=
+
+
- -
(3)
The meaning of L
x
and F
x
is explained in Figure 1.
That is, the lengths of the fragment-homologous
parts of the sequences are estimated as the range
between the first position in the first HSP and the last
position in the last HSP (inclusively). Here, character
homology, which is established within HSPs, is used
to estimate the homology of whole sequence frag-
ments (both within and outside HSPs). Formula
4 introduces a correction (λ
2
), which is intermediary
between Formulae 2 and 3:

λ2 2
1( , ) :
min( , ) min( , )
x y
L F L F
F F l L l L
x x y y
x y x x y y
=
+
+
+ +
- -
- - (4)
Here, the shorter of the sequence sections before
the first HSP in each sequence is considered as part
of the homologous fragment, as well as the shorter of
the sequence sections after the last HSP (see Fig. 1).
Importantly, we here do not attempt to demonstrate
that the above-mentioned formulae necessarily result
in distance metrics in a mathematical sense; for
instance, they might violate the triangle inequali-
ty.53 However, the same holds for distances derived
from MSA, which apparently does not limit their
usability for phylogenetic inference.54(158) W. Gish’s
implementation of BLAST (http://blast.wustl.edu/)
was run with a word length of 4 and without the use
of the low-complexity filter. The GBDP program is
freely available at http://www.auch-edv.de/GBDP/.
The eleven MSAs from Aurahs et al28 were used,
inferred with six different software packages, CLUST-
ALW version 2.0,55,56 KALIGN v. 2.03,57 MAFFT
v. 6.24,58 MUSCLE v. 3.7,59 the NRALIGN derivative
of MUSCLE,60 and POA v. 2.0,61 using the respec-
tive default parameters. POA was also run in global
scoring mode (command-line switch -do_global;
henceforth referred to as POAGLO), CLUSTALW
also with the gap parameters optimized for RNA
alignments (abbreviated CLWOPT),62 and MAFFT
also in EINSI, GINSI and LINSI running modes.
Distances were inferred from the alignments with
PAUP* version 4b10,63 using the following formulae:
uncorrected (“P”) distances; JC; F81; K2P; F84; K3P;
TamNei; GTR; and LogDet (see64 for a survey of
these distance methods). As far as possible (ie, except
for P and LogDet distances), we combined the for-
mulae not only with equal, but also with gamma (Γ)
distributed substitution rates, using an alpha param-
eter of 0.5.64 Distances were also calculated under the
maximum likelihood (ML) criterion with RAxML
version 7.0465,66 and GTR+Γ as model. Accordingly,
198 MSA-based distance approaches were subjected
to clustering optimization in the same way as the
GBDP formulae.
Clustering optimization
Clustering optimization was conducted as previously
described19 and as implemented in the program OPT-
SIL, freely available at http://www.goeker.org/mg/
clustering/. Values of T and F were varied between
0.0 and 1.0, with a step width of 0.0001 (T) or 0.05
(F), and the resulting agreement with the reference
partition, measured using the Modified Rand Index
(MRI), was recorded. Cluster affiliations from the
globally optimal clustering were mapped on a NJ67
tree inferred from the best distances with PAUP*
version 4b10.63 To assess the stability of our method,
we applied taxon jackknifing.19 Within each jack-
knifing replicate, a defined proportion of randomly
selected sequences is removed before optimizing
the parameters. We here assessed removal of 5% to
50% of the sequences, using a step width of 5%, and
reported the respective range of optimal clustering
parameters. In theory, each resulting cluster defines
a TU equaling a morphospecies. However, because
Optimal taxonomy of planktonic foraminifers
Evolutionary Bioinformatics 2010:6 101
End of preview.
Preview full-text

Science & Research Jobs

Keywords

alignment-ambiguous sequences
 
Alignment-free distance calculation
 
barcoding
 
clustering approaches
 
delineate molecular operational taxonomic units
 
described clustering optimization method
 
distinct primer pairs
 
divergent planktonic foraminifera SSU rDNA sequences
 
environmental DNA sequences
 
environmental sequencing
 
Identifying species
 
modern taxonomic disciplines
 
multiple sequence alignment
 
non-homologous sequences
 
optimally account
 
organisms
 
taxonomic concepts