Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison.

Alexander F Auch, Mathias von Jan, Hans-Peter Klenk, Markus Göker

Journal Article: Standards in genomic sciences 01/2010; 2(1):117-34. DOI: 10.4056/sigs.531120

Abstract

The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
Standards in Genomic Sciences (2010) 2:117-134 DOI:10.4056/sigs.531120
The Genomic Standards Consortium
Digital DNA-DNA hybridization for microbial species
delineation by means of genome-to-genome sequence
comparison
Alexander F. Auch1, Mathias von Jan2, Hans-Peter Klenk2*, Markus Göker2
1 Center for Bioinformatics Tübingen, Eberhard-Karls-Universität, Tübingen, Germany
2 DSMZ – German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig,
Germany
* Corresponding author: Hans-Peter Klenk.
Keywords: Archaea, Bacteria, BLAST, GBDP, genomics, MUMmer, phylogeny, species con-
cept, taxonomy.
The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA
hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of
the overall similarity between the genomes of two strains, this technique is tedious and error-
prone and cannot be used to incrementally build up a comparative database. Recent tech-
nological progress in the area of genome sequencing calls for bioinformatics methods to re-
place the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate
state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH.
Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches
perform well as a basis of inferring intergenomic distances. The examined distance functions,
which are able to cope with heavily reduced genomes and repetitive sequence regions, out-
perform previously described ones regarding the correlation with and error ratios in emulat-
ing DDH. Simulation of incompletely sequenced genomes indicates that some distance for-
mulas are very robust against missing fractions of genomic information. Digitally derived ge-
nome-to-genome distances show a better correlation with 16S rRNA gene sequence distances
than DDH values. The future perspectives of genome-informed taxonomy are discussed, and
the investigated methods are made available as a web service for genome-based species de-
lineation.

Introduction
Macroscopic organisms, such as animals, plants
and fungi, are generally easy to distinguish for
species classification by an abundance of morpho-
logical differences, behavioral traits, or by inter-
breeding barriers. For microorganisms belonging
to the two ‘prokaryotic’ domains of life, Archaea
and Bacteria [1], species delineation is a much
more challenging task. Morphological features and
metabolic peculiarities can be used to classify mi-
croorganisms to a certain degree of confidence,
but the number of features and peculiarities that
can easily be recognized for differentiation is ra-
ther limited. Consideration of genetic – and nowa-
days increasingly genomic – features often enables
a deeper resolution for the differentiation, placing
DNA-DNA hybridizations (DDH) in a key position
as a major tool in microbial species delineation [2-
4]. Starting in the early 1970s [5-7], several me-
thods to determine DDH values have been devel-
oped [8]. The general principle of DNA-DNA re-
association requires (i) shearing the gDNA of the
assayed organism and the gDNA of the reference
organism(s) (type strain(s)) into small fragments
of 600-800 bp; (ii) heating the mixture of DNA
fragments from both strains to dissociate the DNA
double-strands; and (iii) subsequently decreasing
the temperature until the fragments reanneal. For
the reason that the melting temperature of a
Page 2
Digital DNA-DNA hybridization
118 Standards in Genomic Sciences
double-strand depends on the degree of match-
ing base pairings between both strands, genomic
(dis-)similarity can be inferred from the melting
temperature. The hybrid DDH value is usually
specified relative to the DDH value obtained by
hybridizing a reference genome with itself. DDH
values ≤70% are considered as an indication that
the tested organism belongs to a different species
than the type strain(s) used as reference(s) [2,4].
All established variations of DDH determination
are technically demanding, labor-intensive and
time-consuming procedures, therefore DDH de-
termination is now performed by only a few spe-
cialized laboratories, and microbial taxonomists
apply DDH only in cases where the strains to be
differentiated have previously been shown to be
closely related in terms of their 16S rRNA gene
sequences [3,4]. In practice, the distinct DDH de-
termination methods are all based on the same
principle, but frequently lead to different results
[8,9]. Accordingly, there is increasing interest to
replace DDH with more reproducible and absolute
methods that do not require the repeated use of
reference strains over and over again. Enabled by
the automation of Sanger sequencing (in the
1990s) and the now dominating pyrosequencing
methods [10], the rapid technical progress in se-
quencing technology lets us envision that genome
sequencing will very soon become a routine ana-
lytical method for microbial species delineation.
This situation resembles events in the mid 1980s
when the traditional DNA:rRNA hybridization [11]
was first technically improved [12] but then rapid-
ly and completely replaced by the use of 16S rRNA
sequences [1], which could be stored in databases
that (apart from resequencing to resolve artifacts
and 16S rRNA heterogeneity) require only one
experiment (sequence) per type strain to fix it for-
ever in a rapidly growing and seemingly unlimited
database [13]. The availability of whole genome
sequences has dramatically changed the way mi-
crobiologists formulate and answer questions
about subjects of interest, often termed the'-omics
revolution' [14]; the time has now arrived to use
genome sequences in the daily routine of microbi-
al taxonomists for the purpose of species delinea-
tion.
Some in silico methods based on the comparison
of completely sequenced genomes have already
been suggested as an alternative to DDH [15,16].
Goris et al. [17] applied BLAST [18] to determine
high-scoring segment pairs (HSPs) between ge-
nome sequences after cutting them into small
1000 bp-long pieces to emulate the DDH proce-
dure (see above). The 'Average Nucleotide Identi-
ty' (ANI) and the 'Percentage Conserved DNA'
were then calculated from the sets of HSPs. The
method was implemented in a Perl script that
could be obtained from its authors on request. Re-
gression analyses of the data suggested that the
resulting in-silico genomic similarity measures
were in good agreement with DDH values deter-
mined for the same pairs of strains in the wet lab
[17].
Here we try to expand this sequence-based ap-
proach. A variety of similarity search methods
have been established in addition to BLAST for the
analysis of HSPs [19,20], along with a number of
algorithms to calculate genome-to-genome dis-
tances (GGD) that can be used to infer phylogenies
[21-25]. What remains to be established is wheth-
er these methods/algorithms could turn out to be
more suitable to mimic DDH in silico. For instance,
experience has been gained on how to adapt GGD
approaches such as genome BLAST distance phy-
logenies (GBDP) to conditions such as large num-
bers of genomic repeats and heavily reduced ge-
nomes [21,24]. GGD methods have already been
shown to be rather valuable tools in reconstruct-
ing whole-genome based trees of Archaea and
Bacteria [25]. Therefore it would be very interest-
ing to see if the same methods could also be used
to estimate species boundaries, much like the 16S
rRNA gene sequences are used for both inferring
phylogenies and calculating pairwise dissimilari-
ties between strains, in order to assess whether
they need to be subjected to DDH for drawing
conclusions about their species status [4].
In the present study, we compare the major state-
of-the-art programs for determining high-scoring
segment pairs (HSPs) and maximally unique
matches (MUMs) [20], as well as previously de-
scribed approaches for calculating GGD from such
sets of HSPs or MUMs, regarding their perfor-
mance in an in-silico framework to replace DDH in
comparison to ANI [17]. We also aim at enlarging
the empirical set of data and at improving the sta-
tistics used for assessing the performance of such
methods. As a further important selection crite-
rion for GGD approaches, we also examine their
relative computational running times and memory
requirements. Correlation between 16S rRNA and
DDH data is of practical interest because 16S se-
quencing is a less tedious and error-prone task
Page 3
Auch et al.
http://standardsingenomics.org 119
than DDH, and sufficiently high 16S distances can
predict DDH similarities below 70% [26]. Moreo-
ver, while 16S rRNA gene sequences are them-
selves limited in estimating evolutionary distances
(after all they represent only about 0.1% of the
coding part of microbial genomes), it can never-
theless be used according to the ceteris paribus
principle to assess the precision of either GGD or
DDH. We thus were interested in the correlation of
either method with 16S rRNA distances. Finally,
we investigate the performance of GGD on artifi-
cially incomplete genomes. This is of considerable
practical relevance, because gap closure in draft
genome sequences is a very time-consuming
process involving primer walking to create finish-
ing reads, as well as frequent rounds of re-
assemblies [27,28]. Therefore, it is of interest to
elaborate which minimal fraction of a genome se-
quence might be required for a reliable estimation
of GGD.
This work is the basis for an accompanying stan-
dard operating procedure for conducting HSP- or
MUM-based genomic comparisons [29] and for a
web service that implements this procedure
(http://ggdc.gbdp.org/).
Material and Methods
Empirical data
The first part of the dataset used in the empirical
tests, i.e. pairs of completely sequenced genomes
and corresponding DDH values, is the one used in
[17], which comprises distinct 'hybridization
groups', i.e. sets of strains from one or few genera
that have been compared to each other. Additional
data were obtained by (i) determining a set of
type strains for which whole genomes are availa-
ble. This was done by reconciling the Genomes On
Line Database [30] and the DSMZ database
(http://www.dsmz.de/microorganisms/); and by (ii)
screening the International Journal of Systematic and
Evolutionary Microbiology (http://ijs.sgmjournals.org/)
for articles containing DDH values of these strains.
Consequently, because the size of the dataset
could be increased by 50%, the final list of ge-
nomes and DDH values comprised 93 ge-
nome/DDH pairs. Some of the additional genomes
were not completely sequenced at the time of
downloading but comprised distinct contigs from
shotgun sequencing. These details are included in
the full genomes list contained in the Electronic
Supplementary Material (ESM). Unfortunately,
DDH information is usually only available for type
strains, whose genomes comprise only a minor
proportion of the currently available fully se-
quenced microbial genomes [31,32].
Determining HSPs and MUMs
The software packages used for determining HSPs
were NCBI-BLAST version 2.2.18, WU-BLAST ver-
sion 2.0MP-WashU (04-May-2006) [18], BLAT
version 34 [19] and BLASTZ version 7 [33]; MUMs
were determined with MUMmer version 3.0 [20].
In the case of BLASTZ, we additionally investi-
gated alternative settings of the “K” parameter
(2000, 2500, 3000, 3500); this parameter deter-
mines the minimum raw score required for a HSP
for further consideration. In the case of BLAT, we
used either 0%, 50%, or 90% (default) as mini-
mum sequence identity required within HSPs and
0 (for 0%) or 30 (for 50% and 90% minimum
identity) as corresponding minimum scores. Lo-
wering these values was expected to provide more
accurate results. For the most sensitive setting, we
additionally lowered the tile size from 12 (default)
to 8; the tile size approximately behaves like the
word length parameter of the BLAST programs.
Regarding the settings of MUMmer, we applied
minimum match lengths ranging between 16 and
50 and the three possible settings for the treat-
ment of matches in both forward and reverse
strand (command-line switches -mum, -
mumreference and -maxmatch). The modified
command-line switches are also shown in Table 1.
In either case, the resulting data were stored in
CGVIZ format [34] for further proceeding with
GBDP as described below. For MUMmer, we used
the MUM length as a replacement for the HSP
score, which is, of course, not available from that
program.
Distance calculation
Pairwise distances between genomes were calcu-
lated with GBDP. All programs determining HSPs
were run with or without HSP filtering, i.e. remov-
ing all HSPs with an e-value larger than 10-2 prior
to calculating distances. The next step is to re-
move overlapping parts of HSPs in either genome
using the so-called greedy-with-trimming algo-
rithm [24]. This procedure proved to be valuable
in phylogenetic inference from genomes with
large numbers of repeats, but trimming may also
be omitted (the resulting distance formula being
called 'coverage distance') [24]. Finally, distances
are calculated from the sets of (remaining) HSPs
using one of several approaches.
Page 4
Digital DNA-DNA hybridization
120 Standards in Genomic Sciences
Let Hxy denote the total length of all HSPs and Ixy
denote the sum of the number of identical base
pairs over all HSPs found by BLASTing genome x
against genome y, whereas Hyx and Iyx are obtained
by using y as the query and x as the subject se-
quence. GGD can then be defined as follows
[21,24,35]:
Here λ(x,y) is a function of the lengths of the two
genomes; in the simplest case, lambda is equal to
the sum of the genome lengths.


1(1) ( , ) 1
( , )
xy yxH H
d x y
x y
+
= −
λ

2(2) ( , ) 1
xy yx
xy yx
I I
d x y
H H
+
= −
+

3( , ) 1
( , )
xy yxI I
d x y
x y
+
(3) = −
λ



Other distance functions can be derived from the
previously mentioned ones by applying a loga-
rithmic transformation or by using twice the
length of the shorter genome instead of the sum of
the genome lengths, resulting in a total number of
ten distance functions to be tested. (The internally
applied numbering of the GBDP software is: 0-3:
formula (1); 4-5: formula (2); 6-9: formula (3); 0,
2, 6, 8: sum of genome lengths in denominator; 1,
3, 7, 9: twice the minimum genome length in de-
nominator; 0, 1, 4, 6, 7: no logarithm; 2, 3, 5, 8, 9:
logarithm. Further details are provided in the
ESM.) Using the minimum genome length im-
proved phylogenetic accuracy if heavily reduced
genomes were considered [21,24]; logarithmic
transformation of the data was also useful in such
cases, but had no effect on the non-parametric
correlation analyses in the present study (see be-
low; one of the advantages of our approach). Note
that we here examine methods based on the com-
parisons of the underlying nucleotide sequence
only; while distance approaches using the trans-
lated amino acids may perform better in phyloge-
netic inference of deep nodes [21], DDH is con-
cerned with closely related organisms only, and
mimicking it in silico based on direct genome se-
quence comparisons is straightforward.
Quantifying method performance
Goris et al [17] used linear regression to deter-
mine the suitability of their ANI algorithm to mim-
ic DDH values. This regression procedure has two
disadvantages. First, it presupposes a linear rela-
tionship between DDH values and genome dis-
tances, which may or may not hold for current
GGD approaches. For instance, some GGD formu-
las result in distances more rapidly saturated than
others, that is, distances that do not show signifi-
cant additional increase in spite of further increas-
ing genomic differences [21,24]. This problem can
be overcome by replacing regression with correla-
tion and Pearson's with Kendall's non-parametric
correlation coefficient [36,37] which uses the val-
ues' ranks only (in the following, Pearson's coeffi-
cients are included for selected values for compar-
ison; full results for either coefficient are available
in the ESM). Moreover, non-parametric statistics
are more robust against outliers. In contrast to
Goris et al. [17], we calculate distance functions;
that is, a correlation of -1.0 is optimal. As in the
following, all correlations, as well as all plots (see
below), were computed with the R package [38].
Secondly, many users will only be interested in the
error ratio of the whole-genome distance regard-
ing the question whether the DDH value is lower
than 70%. This problem can be solved by applying
a two-step procedure: (i) determining, for each
GGD approach, the distance threshold T resulting
in the smallest error ratio, and (ii) reporting this
optimal error ratio. Here, error ratio is defined as
the sum of the number of false positives (distances
at most as large as T corresponding to DDH values
lower than 70%) and false negatives (distances
larger than T corresponding to DDH values at least
as large as 70%) divided by the total number of
pairwise distances. The optimal T can then be
used in real genome comparisons for replacing the
DDH approach, too. We estimated the optimal T by
assessing all values between the maximum and
Page 5
Auch et al.
http://standardsingenomics.org 121
the minimum for each GGD variant, applying a
step width of 1/1000 of the range.
To compare the results obtained with GBDP to
those obtained with the ANI and 'Percentage Con-
served DNA' methods as reported by Goris et al.
[17], we reduced the dataset to the genome pairs
examined in the latter study. We could not apply
ANI to the full dataset, because when applying the
ANI Perl script of Konstantinidis to the genome
pairs analyzed in the latter study, we could not
adequately corroborate the results reported in
[17]. For the sake of convenience we used the re-
sults for the 62 genome pairs analyzed with ANI
directly as published in [17] to compare the per-
formance of ANI with the GGD methods assessed
in the present study. Deloger et al [16] apparently
reimplemented the ANI method but did not dis-
close whether they obtained the same results as in
[17] when applied to the same strains.
In order to correlate GGD and DDH with pairwise
distances inferred from the 16S rRNA, this gene
was extracted from all completed and annotated
genomes under study, resulting in a set of 59 pairs
of genomes. The 16S rRNA gene sequences were
aligned with Poa v2 in progressive alignment
mode [39], and uncorrected (“p”) distances were
calculated from the aligned sequences using
PAUP* v4b10 [40] under the MISSDIST=IGNORE
setting. Correlations were calculated as described
above.
Run-time and memory consumption measurements
Computation time of the programs determining
HSPs or MUMs as well as of GBDP applied to these
data was measured using a reduced dataset com-
prising a selection of eight Genomes of the Esche-
richia/Shigella group. Plasmids were removed
from the dataset, only chromosomal data was
used. On the one hand, this allowed us to compare
job running times, since all FASTA files had ap-
proximately the same size (5 Mbp). On the other
hand, using closely related strains leads to the de-
tection of a considerable amount of HSPs by the
different local alignment search tools, thus allow-
ing estimates of an upper bound for the search
time. These measurements were performed on a
AMD Quad-Core Opteron System equipped with a
2.3 GHz CPU and 20 GB RAM.
Simulation of incomplete genome sequencing
In order to measure method performance on in-
completely sequenced genomes, artificial gaps
were incorporated into the fully sequenced ge-
nomes of the empirical datasets (that is, 62 pairs
of complete genomes formed the basis of our si-
mulation). This was based on the well-known
Lander-Waterman formula [41], which is usually
applied to estimate the sequencing effort neces-
sary to obtain a given coverage, as follows. Based
on a realistic value of 700 bp as the fixed read
length (http://www.jgi.doe.gov/sequencing/stat-
istics.html), the real length of the fully sequenced
chromosome or plasmid, and the proportion of the
genome to be retained, the number of reads ne-
cessary to achieve this proportion is calculated
using the Lander-Waterman approach [41]. An
array of all positions in the original genome is
created, and all positions are marked as 'not se-
quenced'. A starting position in the genome for
each of the calculated number of reads is then
drawn at random, and this one as well as the cor-
responding 699 downstream array positions are
marked as 'sequenced'. After all reads have been
considered, positions remaining in 'not se-
quenced' state are then removed from the input
genome, creating disjoint contigs, which are out-
put. Applied several times, this procedure creates
modifications of input genomes whose lengths are
dispersed around an expected value equal to the
original genome length times the input sequencing
proportion.
Based on this algorithm, a total number of 100
simulation runs was conducted for sequencing
proportions of 0.99, 0.95, 0.90, 0.85, 0.80, 0.75,
0.70, 0.60, 0.50, 0.40, 0.30, 0.20, and 0.10. Note
that this approach corresponds to simulating ge-
nomes that are incompletely sequenced, but are
nevertheless lacking sequencing errors in all reads
and are correctly assembled. A simulation includ-
ing an assembly of artificially created reads in
each replicate has been rejected for reasons of
running time. On account of this, only a single,
reasonably fast and well-performing HSP deter-
mination approach was examined in simulation.
Method performance on incomplete genomes was
quantified in two ways: First, error ratios were
determined after applying the optimal threshold
as determined for the corresponding distance
function and complete genomes (see above).
Second, Euclidean distances between the GGD cal-
culated from the original genomes and the GGD
inferred from the respective incomplete genomes
were calculated using the Eukdis program [42].
End of preview.
Preview full-text

Science & Research Jobs

Keywords

16S rRNA gene sequence distances
 
bioinformatics methods
 
distance formulas
 
DNA-DNA hybridization
 
error ratios
 
examined distance functions
 
genome-based species delineation
 
genome-informed taxonomy
 
genome-to-genome distances
 
genomic information
 
high-scoring segment pairs
 
in-silico genome-to-genome comparison
 
incompletely sequenced genomes
 
inferring intergenomic distances
 
inferring whole-genome distances
 
investigated methods
 
mimic DDH
 
pragmatic species concept
 
repetitive sequence regions
 
state-of-the-art methods