Page 1
Standards in Genomic Sciences (2010) 2:142-148 DOI:10.4056/sigs.541628
The Genomic Standards Consortium
Standard operating procedure for calculating
genome-to-genome distances based on high-scoring
segment pairs
Alexander F. Auch1, Hans-Peter Klenk2*, Markus Göker2
1 Center for Bioinformatics Tübingen, Eberhard-Karls-Universität, Tübingen, Germany
2 DSMZ – German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig,
Germany.
* Corresponding author: Hans-Peter Klenk.
Keywords: BLAST, GBDP, GGDC web server, genomics, MUMmer, phylogeny, species de-
lineation, microbial taxonomy.
DNA-DNA hybridization (DDH) is a widely applied wet-lab technique to obtain an estimate
of the overall similarity between the genomes of two organisms. To base the species concept
for prokaryotes ultimately on DDH was chosen by microbiologists as a pragmatic approach
for deciding about the recognition of novel species, but also allowed a relatively high degree
of standardization compared to other areas of taxonomy. However, DDH is tedious and er-
ror-prone and first and foremost cannot be used to incrementally establish a comparative da-
tabase. Recent studies have shown that in-silico methods for the comparison of genome se-
quences can be used to replace DDH. Considering the ongoing rapid technological progress
of sequencing methods, genome-based prokaryote taxonomy is coming into reach. However,
calculating distances between genomes is dependent on multiple choices for software and
program settings. We here provide an overview over the modifications that can be applied to
distance methods based in high-scoring segment pairs (HSPs) or maximally unique matches
(MUMs) and that need to be documented. General recommendations on determining HSPs
using BLAST or other algorithms are also provided. As a reference implementation, we intro-
duce the GGDC web server (http://ggdc.gbdp.org).
Introduction
In a recent study [1], we have investigated state-
of-the-art methods for inferring whole-genome
distances in their ability to emulate DNA-DNA hy-
bridization (DDH), which is the current major
technique in microbiology for assessing whether a
novel strain can be classified as a species of its
own. In almost all groups of Archaea and Bacteria,
a limit of 70% DDH similarity must be under-run
to justify the establishment of a new species (see
references in [1]). The replacement of DDH by ge-
nome-to-genome distances (GGD) is of interest
because (i) DDH is cumbersome and is currently
carried out in relatively few specialized molecular
laboratories only; (ii) distinct DDH methods may
differ in their results; (iii) DNA-DNA re-
association does not grant access to any informa-
tion other than the calculated similarity value. In
contrast, genome sequence information can of
course be re-used in any subsequent comparisons
and be explored in multiple ways beyond mere
taxonomy.
Algorithms to efficiently determine high-scoring
segment pairs (HSPs) or maximally unique
matches (MUMs) are valuable tools for inferring
intergenomic distances for species delimitation
(see [1] and references therein). They correlate
well with DDH, are able to cope with heavily re-
duced genomes and repetitive sequence regions,
are very robust against missing fractions of ge-
nomic information (depending on the distance
formula used), and show a better correlation with
16S rRNA gene sequence distances than do DDH
values. The methods work in three main steps,
namely the determination of a set of HSPs or
Page 2
Auch et al.
http://standardsingenomics.org 143
MUMs between two genomes, the calculation of
distances from these sets, and the conversion of
these distances in percent-wise similarities ana-
logous to DDH. The Genome-To-Genome Distance
Calculator (GGDC) is a web tool to apply these
techniques. It has been devised for, but its use is
not restricted to, genome-based species delinea-
tion. In the following guideline for conducting and
documenting genome distance calculation from
sets of HSPs or MUMs, GGDC will serve as a refer-
ence implementation.
Requirements
The GGDC web server (http://ggdc.gbdp.org) uses
multi-FASTA files as input. One file per genome is
expected, containing each chromosome or plasmid
as a single FASTA entry. Alternatively, users can
provide a set of Genbank accession numbers. A
single query genome can be compared to several
reference genomes; organism names can be en-
tered separately. The user can choose between
several similarity search tools. Presentation of the
results is currently done via an e-mail to a user-
specified address. The message also contains a
brief explanation of the results.
Procedure
Similarity search
Similarities between query and reference ge-
nomes are determined by using well-known tools
for nucleotide-based sequence similarity search.
Currently, NCBI-BLAST [2], WU-BLAST [2], BLAT
[3], BLASTZ [4], and MUMmer [5] are available on
the web server. Command line parameters for
these programs were carefully optimized as do-
cumented in [1]. Currently, it is not possible to
modify the parameters of these tools via the web
interface. An overview of the calculation of in-
silico DDH values is provided in Fig. 1.
Figure 1. Flowchart outlining the steps required to calculate in-silico DDH values. Either Genbank accession
numbers or FASTA files are uploaded on the server. The final values are received via e-mail.
Page 3
Genome-to-genome distances
144 Standards in Genomic Sciences
While we recommend the use of the default set-
tings in general, 'power users' who are interested
in and to establishing their own analysis pipelines
may want to apply distinct settings. Despite the
differences between command line parameters
and the algorithms behind those tools, some gen-
eral propositions can be made as a guideline for
advanced users (Table 1, Table 2). Parameters
that increase sensitivity also increase run time
and memory consumption. Such parameters are
the minimum length for a stretch of DNA used as
starting point (seed word), the minimal number
(or percentage) of identical characters within a
match, and the score (or e-value) threshold.
A peculiarity of NCBI-BLAST and WU-BLAST is the
usage of filters to mask out regions of low com-
plexity (i.e., repeat filtering) during the seed phase
as well as during the extend-phase (when short
matches are prolonged) of the algorithm. While it
is highly advisable to use filters in the seed phase,
resulting in greatly reduced run time, high-scoring
pairs may break apart when using filtering during
the extend phase. The resulting HSPs have a
smaller score (and higher e-value) than a corres-
ponding single HSP would have, and thus, the
HSPs may be discarded depending on any score
(or e-value) threshold. Thus, when using BLAST to
detect orthologous genes, it could be shown that
using the filter only in the seed phase ('soft filter-
ing') increases sensitivity [6]. Even when calculat-
ing intergenomic distances, a noticeable influence
cannot be ruled out since some distance functions
use the HSP length as an implicit filtering criterion
(trimming procedure, see [7]). The default of
NCBI-BLAST is to use the filter for both steps
('hard filtering'), so it is recommended to use the
parameters '-F “m D”' ('soft filtering', the filter is
only used during the initial phase) when using
NCBI-BLASTN, or using '-F ”m S”' with BLASTP,
BLASTX and TBLASTX. Corresponding options for
WU-blast are 'wordmask=dust' (BLASTN) and
'wordmask=seg' (protein blast). Furthermore,
NCBI-BLAST and WU-BLAST limit the number of
HSPs that are reported for a given query sequence,
by default. This may be acceptable for small que-
ries, but it is not when using whole genomic se-
quences. In contrast to NCBI-BLAST, WU-BLAST
allows to entirely dispose of any limitation for the
amount of HSPs, but it has to be considered that
this leads to a severe increase in memory con-
sumption. Hence, we propose to set a limit of
100,000 HSPs, which should be sufficient to cover
all matches even for highly similar genomes, while
memory usage remains feasible (NCBI-BLAST: '-b
100000', WU-BLAST: 'B=100000 hspmax=100000').
When using WU-BLAST, the parameters
'hspsepSmax' and 'hspsepQmax' should be set to
avoid the linkage of distant HSPs. This improves
running time without affecting sensitivity. A thre-
shold of 50 is sufficient for genomic sequences.
HSPs (or MUMs) are determined by performing
similarity searches for each combination of query
genome and reference genome. Due to the asym-
metric nature of heuristic similarity search strate-
gies, the search is performed twice, first using the
reference genome as 'subject sequence' and the
query genome as 'query sequence', and second,
using the reference genome as 'query sequence'
and the query genome as 'subject sequence'. The
HSPs (or MUMs) are stored in condensed form
using the CGVIZ format [8], which comprises the
start and stop coordinates of the matches together
with statistical data (e-value, score, alignment
length, and percentage identical characters for
HSPs, alignment length for MUMs, see Figure 2).
The resulting data is sufficient for the distance
calculation, while preserving storage space.
Distance calculation
Distances between genomes are calculated using
GBDP as described in [7,9,10]. When using NCBI-
BLAST, WU-BLAST, BLAT, and BLASTZ, the gree-
dy-with-trimming algorithm [7] is applied using
distance functions (1), (2), and (3) (see [1]). Dis-
tances for MUMmer are calculated using the cov-
erage algorithm [7] with distance function (1).
These settings currently can not be modified via
the web interface. Considering error ratios and
correlation with DDH (see [1]), we recommend
distance functions (2) or (3) for all similarity
search algorithms except MUMmer, for which dis-
tance function (1) should be used. For the 'power
user', an overview of our propositions regarding
HSP/MUM overlap filtering and distance calcula-
tion is provided in Table 3.
Page 4
Auch et al.
http://standardsingenomics.org 145
Table 1. HSP determination and filtering
Algorithm WU BLASTa NCBI BLASTb BLATc MUMmerd BLASTZe
Run time Very high [M] Low [M] High [M] Very low [M] Moderate [M]
Memory consumption and
output size
High [M] Moderate [M] Moderate [M] Very low [M] Low [M]
Typical effect on correla-
tion with DDH values
decrease [M] increase [M] increase [M]
moderate increase
[M]
decrease [M]
Seed parameter W= -W -tileSize -l T=0 W=
Typical effect on runtime,
RAM usage and file size
higher → speedup
smaller output files [E]
higher → speedup
smaller output files [E]
higher→ speedup;
lower → significant
increase of memory
consumption [M]
higher → speedup
smaller output
files [M]
higher → speedup
smaller output files [E]
Typical effect on correla-
tion with DDH values
N/A N/A lower → decrease of
correlation [M]
higher → increase
of correlation [M]
N/A
Identity parameter
score based, i.e., iden-
tical to initial word
length
score based, i.e., iden-
tical to initial word
length
-minIdentity 100% (fixed)
score based, i.e., iden-
tical to initial word
length
Typical effect on runtime,
RAM usage and file size
N/A N/A insignificant [M] (none) N/A
Typical effect on correla-
tion with DDH values
N/A N/A lower → increase of
correlation [M]
N/A N/A
Measure of HSP quality
used for filtering
e-value e-value substitution score (makes no sense) substitution score
Typical effect on subse-
quent runtime and RAM
usage
insignificant [E] insignificant [E]
lower → small in-
crease of runtime
and memory con-
sumption [M]
(none)
lower → small increase
of runtime and memory
consumption [M]
Typical effect on correla-
tion with DDH values
insignificant [E] insignificant [E] lower → small in-
crease of correlation
N/A higher → slight increase
of correlation
The table shows different parameters of the similarity search algorithms and their influence on the correlation with DDH values (for details, see [1]). Note that the best possible
correlation of DDH values (similarities) with GGD (dissimilarities) is -1.0; that is, 'high' correlations indicate more negative ones. Seed parameter: Minimum length for a stretch
of DNA used as HSP starting point. Identity parameter: Minimum identity within HSP for prolongation. Evidence codes: [M] measured; [E] extrapolated.
aVersion 2.0MP-WashU [04-May-2006], website http://blast.wustl.edu/. [2]
bVersion 2.2.18, website ftp://ftp.ncbi.nlm.nih.gov/blast/executables/, [2]
cVersion 34, website http://users.soe.ucsc.edu/~kent/src/, [3]
dVersion 3.0, website http://mummer.sourceforge.net. [5]
eVersion 7, website http://www.bx.psu.edu/miller_lab/, [4]
Page 5
Genome-to-genome distances
146 Standards in Genomic Sciences
Table 2: Command line parameters for similarity search tools as used by the web server. Recommended parameters are in bold.
Similarity search tool Command line parameter
NCBI BLAST blastall -p blastn -i QUERY -d SUBJECT -m 7 -a 1 -S 3 -e 10 -F 'm D' -b 100000
WU-BLAST
blastn SUBJECT QUERY mformat=7 cpus=1 E=10 wordmask=dust B=100000
hspmax=100000 hspsepSmax=50 hspsepQmax=50
BLAT
blat SUBJECT QUERY OUTFILE -t=dna -q=dna -out=blast -minScore=30 -
minIdentity=50
BLASTZ blastz QUERY SUBJECT B=2 C=2 K=3500 Y=700
Mummer mummer -b -c -F -l 44 -mum SUBJECT QUERY
Figure 2. Example of a CGVIZ file. The e-value is stored using its logarithmic value (base 10).
Filtering of HSPs having an e-value above 10-2
should be applied for BLAT, NCBI-BLAST and
MUMmer prior to distance calculation, while it is
not necessary for BLASTZ and WU-BLAST. A
downstream filtering step has the advantage that
it can easily be changed without the necessity to
re-run the costly similarity search with adapted
parameters. This enables one to reuse the data for
further processing.
Conversion to percent-wise similarities
The obtained distance values d are converted into
percent-wise similarities s(d) by using the corres-
ponding values for intercept c and slope m accord-
ing to Table 4:
( )s d = m d + c ⋅
The percent-wise similarity s(d) can be used ana-
logous to a DDH value. Values for intercept and
slope are determined by applying the robust line
fitting procedure as implemented in the R package
(Version 2.6.2 [11], ) to the dataset described in
[1] (or any subsequently enlarged collection of
DDH values and corresponding genomes).
Additionally, the corresponding distance thre-
shold as determined in [1] can be used for species
delimitation. Any distance value above the thre-
shold can be regarded as indication that the two
genomes analyzed represent two distinct species.
Recommended use of the server and
interpretation of the results
The default similarity search program on the web
server is currently NCBI-BLAST, which appears
both reasonably fast and reasonably accurate (Ta-
ble 1). Use of BLAT resulted in somewhat higher
correlations with DDH values from literature [1]
but takes more time to complete. We thus recom-
mend NCBI-BLAST for testing and for large data-
sets and BLAT for the final analysis of a small
number of genomes.
The e-mail sent to the user includes the results for
all three distance formulas. Considering error ra-
tios at 70% DDH, we recommend formula (2). This
formula must be used if incomplete genome se-
quences are submitted to the server [1]. If the
overall correlation with DDH is of interest, and
particularly if the 70% threshold is less relevant,
we strongly recommend formula (3) and the cor-
relations-based DDH estimates.
End of preview.