ArticlePDF Available

Improved metagenomic analysis with Kraken 2

Authors:

Abstract and Figures

Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
This content is subject to copyright. Terms and conditions apply.
S H O R T R E P O R T Open Access
Improved metagenomic analysis with
Kraken 2
Derrick E. Wood
1,2
, Jennifer Lu
2,3
and Ben Langmead
1,2*
Abstract
Although Krakensk-mer-based approach provides a fast taxonomic classification of metagenomic sequence data,
its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by
reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while
maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode,
providing increased sensitivity in viral metagenomics analysis.
Keywords: Metagenomics, Metagenomics classification, Microbiome, Probabilistic data structures, Alignment-free
methods, Minimizers
Assigning taxonomic labels to sequencing reads is an
important part of many computational genomics pipe-
lines for metagenomics projects. Recent years have seen
several approaches to accomplish this task in a time-
efficient manner [13]. One such tool, Kraken [4], uses a
memory-intensive algorithm that associates short gen-
omic substrings (k-mers) with the lowest common an-
cestor (LCA) taxa. Kraken and related tools like
KrakenUniq [5] have proven highly efficient and accur-
ate in independent tool comparisons [6,7]. But Krakens
high memory requirements force many researchers to ei-
ther use a reduced-sensitivity MiniKraken database [8,9]
or to build and use many indexes over subsets of the ref-
erence sequences [10,11]. Its memory requirements can
easily exceed 100 GB [7], especially when the reference
data includes large eukaryotic genomes [12,13]. Here, we
introduce Kraken 2, which provides a major reduction in
memory usage as well as faster classification, a spaced
seed searching scheme, a translated search mode for
matching in amino acid space, and continued compati-
bility with the Bracken [14] species-level sequence abun-
dance estimation algorithm.
Kraken 2 addresses the issue of large memory require-
ments through two changes to Kraken 1s data
structures and algorithms. While Kraken 1 used a sorted
list of k-mer/LCA pairs indexed by minimizers [15], Kra-
ken 2 introduces a probabilistic, compact hash table to
map minimizers to LCAs. This table uses one third of
the memory of a standard hash table, at the cost of some
specificity and accuracy. Additionally, Kraken 2 only
stores minimizers (of length ,k) from the reference
sequence library in its data structure, whereas Kraken 1
stored all k-mers. This change means that, during classi-
fication, the minimizer (-mer) is the substring com-
pared against a reference set in Kraken 2, while Kraken
1 compared k-mers (Fig. 1a, b). Kraken 2s index for a
specific reference database with 9.1 Gbp of genomic se-
quences uses 10.6 GB of memory when classifying. Kra-
ken 1s index for the same reference uses 72.4 GB of
memory for classification (Fig. 2a, Additional file 1:
Table S1). In general, a Kraken 2 database is about 85%
smaller than a Kraken 1 database over the same refer-
ences (Additional file 2: Figure S1).
Kraken 2s approach is faster than Kraken 1s because
only distinct minimizers from the query (read) trigger
accesses to the hash table. A similar minimizer-based
approach has proven useful in accelerating read align-
ment [16]. Kraken 2 additionally provides a hash-based
subsampling approach that reduces the set of
minimizer/LCA pairs included in the table, allowing the
user to specify a target hash table size; smaller hash ta-
bles yield lower memory usage and higher classification
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
* Correspondence: langmea@cs.jhu.edu
1
Department of Computer Science, Whiting School of Engineering, Johns
Hopkins University, Baltimore, MD, USA
2
Center for Computational Biology, Johns Hopkins University, Baltimore, MD,
USA
Full list of author information is available at the end of the article
Wood et al. Genome Biology (2019) 20:257
https://doi.org/10.1186/s13059-019-1891-0
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
throughput at the expense of lower classification accur-
acy (Fig. 1d, Additional file 1: Table S2).
Kraken 2 also features other improvements to accur-
acy and runtime. A new translated search mode (Kraken
2X) uses a reduced amino acid alphabet and increases
sensitivity on viral datasets compared to nucleotide-
based search. Block- and batch-based parsing within the
critical section is used to improve thread scaling, in a
manner similar to that used in recent versions of Bowtie
2[17]. We also added a form of spaced seed search and
automated masking of low-complexity reference se-
quences to improve accuracy.
To assess the accuracy and performance of Kraken 2,
we selected 40 prokaryotic and 10 viral genomes for
which we had reference genomes for at least 2 sister
subspecies and at least 2 sister species (Additional file 1:
Table S3). We then created a reference genome (or pro-
tein) set that excluded the 50 taxa for the genomes we
selected. This reference set and taxonomy were held
constant between the various classifiers we examined,
avoiding any confounding due to the differences in the
reference database. A similar approach has been recently
used for this same purpose in another study [7].
We simulated 1 million Illumina 100 × 100 nt paired-
end reads from each of the 50 selected genomes, for a
total of 50 million reads (25 million fragments). We
processed these data with 4 nucleotide search-based se-
quence classification programs (Centrifuge [1], CLARK
Fig. 1 Differences in operation between the two versions of Kraken. aBoth versions of Kraken begin classifying a k-mer by computing its bp
minimizer (highlighted in magenta). The default values of kand for each version are shown in the figure. bKraken 2 applies a spaced seed
mask of sspaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the
lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer (see the Methodssection for full
details). In Kraken 1, the minimizer is used to accelerate the search for the k-mer, through the use of an offset index and a limited-range binary
search; the association between k-mer and LCA is directly stored in the sorted list. cKraken 2 also achieves lower memory usage than Kraken 1
by using fewer bits to store the LCA and storing a compact hash code of the minimizer rather than the full k-mer. dImpact on speed, memory
usage, and prokaryotic genus F1-measure in Kraken 2 when changing kwith respect to (= 31, s= 7 for all three graphs). eImpact on
prokaryotic genus sensitivity and positive predictive value (PPV) when changing the number of minimizer spaces s(k= 35, = 31 for both
graphs). In dand e, the data are from our parameter sweep results in Additional file 1: Table S2, and the default values of the independent
variables for Kraken 2 are marked with a circle.
Wood et al. Genome Biology (2019) 20:257 Page 2 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
[2], Kraken 1 [4], and KrakenUniq [5]) and a translated
search classifier (Kaiju [3]). We additionally processed
these data with Kraken 2, using several different data-
bases created with different parameters (the Methods
section).
This strain-exclusion approach mimics the real-world
scenario where reads likely originate from strains that
are genetically distinct from those in the database. The
addition of simulated sequencing errors also provides
further genetic distance between the test data and the
reference sequences. Through this approach, we sought
to avoid overly optimistic estimates of a classifiers
performance.
We found that Kraken 2 exhibited similar, and often
superior, per-sequence accuracy to the other nucleotide
classifiers and that Kraken 2X provided similar (though
slightly lower) accuracy compared to Kaiju (Fig. 2b,
Additional file 1: Table S1). The nucleotide-based classi-
fiers exhibited lower accuracy on the viral read data than
did the translated search classifiers, demonstrating the
advantage of translated search in scenarios marked by
high genetic variability and sparsity of available reference
genomes [3].
In some cases, we found that Kraken 2 would not clas-
sify a large proportion of reads correctly at the species
level, despite the presence of at least two sister strains in
the reference database (Additional file 2: Figure. S2).
This was often the result of classifications that were ei-
ther incorrect at the species level or correct but only
made at the genus level (or higher). Such classifications
can occur when genomes from different species or gen-
era share a high genomic identity, which is the case in
multiple places of the taxonomy, including the Shigella
[18], Bacillus [19], and Pseudomonas [20] genera. A re-
definition of the taxonomy based on the phylogeny as
recently proposed [21] would likely improve sensitivity
at the species level.
Following our evaluation of the classifiersaccuracy,
we then examined the runtime and memory require-
ments of each program. Kraken 2 provided substantial
increases in processing speed, classifying paired-end data
at over 93 million reads per minute while using 16
threads, a speed over 5 times faster than Kraken 1, the
next-fastest classifier (Fig. 2a, Additional file 1: Table
S1). Additionally, Kraken 2 exhibited superior thread
scaling to Kraken 1 (Additional file 1: Table S4). Kraken
2s memory requirement is also 15% of Kraken 1s, and
only 2.5 times as much as that of the least memory-
intensive classifier we examined, Centrifuge. With re-
spect to the translated search programs, Kraken 2X is
over 3 times faster and uses 47% less memory than
Kaiju.
Fig. 2 Comparison between Kraken 2 and other sequence classification tools. aProcessing speed (in millions of reads per minute) and memory
usage (measured by maximum resident set size, in gigabytes) are shown for each classifier, as evaluated on 50 million paired-end simulated reads
with 16 threads. Accuracy results are shown for b40 prokaryotic genomes and c10 viral genomes. The results here are shown for sensitivity,
positive predictive value (PPV), and F1-measure as evaluated on a per-fragment basis at the genus rank, with 1000 reads simulated from each
genome. The strains from which reads were simulated were excluded from the reference libraries for each classification tool. Kraken 2Xis Kraken
2 using translated search against a protein database. Full results for these strain-exclusion experiments are available in Additional file 1: Table S1
Wood et al. Genome Biology (2019) 20:257 Page 3 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
To determine if Kraken 2 exhibited similar analytical
performance on real sequencing data, we classified read
data from the FDA-ARGOS project [22]. We compared
the fragment classifications obtained by the various clas-
sification programs to the taxonomic labels attached to
the corresponding ARGOS experiment. Kraken 2 ex-
hibits similar genus-level concordance and discordance
statistics to the other nucleotide search classifiers, while
Kraken 2X exhibits similar but less agreement with the
ARGOS labels than does Kaiju (Additional file 1: Table
S5). These results agree with those obtained in the
strain-exclusion experiment on simulated data.
As a continuation of the strain-exclusion experiments,
we applied Bracken [14] to the Kraken 1 and Kraken 2
results, estimating species- and genus-level sequence
abundance for prokaryotic species. Bracken uses a
Bayesian algorithm to integrate reads Kraken classified
at higher taxonomic levels into the abundance estimates.
Although the true strain-level taxa are excluded from
the database, Bracken recaptured most of the true
genus-level and species-level sequence abundances using
both Kraken 2 and Kraken 1 classification results. Com-
paring the results, the Bracken estimates were more ac-
curate with Kraken 2 than with Kraken 1 at both the
genus and species levels, likely owing to Kraken 2s
higher sensitivity (Additional file 2: Figure S2). Bracken
ran in less than 1 s, a minute fraction of the runtime of
any of the classification programs we examined.
As databases of assembled genomes continue to grow,
databases of reference sequences used for metagenomics
studies will also grow [21,23]. We presented Kraken 2,
an extremely memory-efficient metagenomics classifica-
tion tool that replaces Kraken 1sk-mer database with a
probabilistic data structure that is substantially smaller,
allowing six to seven times more reference data com-
pared to Kraken 1. The algorithms introduced in Kraken
2 to subsample the set of genomic substrings also pro-
vide Kraken 2 with the ability to further reduce the size
of its database and accelerate the processing of sequen-
cing data. We showed Kraken 2s accuracy is comparable
to that of Kraken 1 and other competing tools, consist-
ent with other studies [6,7]. We also showed that its new
translated search mode has accuracy approaching that of
the protein-focused Kaiju tool, while using less memory
and runtime. Also, Kraken 2 is compatible with the
Bracken software for species-level quantification, making
Kraken 2 straightforwardly usable for that application.
In the future, it will be important to consider add-
itional use cases for Kraken 2. For example, other data
structures similar to our compact hash table, such as the
counting quotient filter [24], could be implemented and
used in computing environments and applications that
may benefit from a particular data structures design and
properties. Additionally, the KrakenUniq [5] tool uses
the HyperLogLog sketch [25] to estimate the number of
distinct k-mers matched at each node of the taxonomy,
a statistic that is used in turn to better determine the
presence or absence of individual genomes. We plan to
add this functionality in the future, as it enables applica-
tions in the diagnosis of infections where the infectious
agent is present at low abundance.
Methods
Compact hash table
The hash table used by Kraken 2 to store minimizer/
LCA key-value pairs is very similar to a traditional hash
table that would use linear probing for collision reso-
lution, with some modifications. Kraken 2s compact
hash table (CHT) uses a fixed-size array of 32-bit hash
cells to store key-value pairs. Within a cell, the number
of bits used to store the value of the key-value pair will
vary depending on the number of bits needed to repre-
sent all unique taxonomy ID numbers found in the ref-
erence sequence library; this was 17 bits with the
standard Kraken 2 database in September 2018. The
value is stored in the least significant bits of the hash cell
and must be a positive integer. Values of 0 represent
empty cells. Within the remaining bits of the hash cell,
the most significant bits of the keys hash code (a com-
pact hash code) are stored. Searching for a key Kin the
CHT is done by computing the hash code of the key
h(K) then linearly scanning the table array starting at
position h(K) mod |T| (where |T| is the number of cells
in the array) for a matching key. Examples of this search
processincluding both key/value insertion and query-
ingare shown in Additional file 2: Figure S3. In Kraken
2, the hash function hused is the finalization function
from MurmurHash3 [26].
Compacting hash codes in this way allows Kraken 2 to
use 32 bits for a key-value pair, a reduction compared to
the 96 bits used by Kraken 1 (64 bits for key, 32 for
value) (Fig. 1c). But it also creates a new way in which
keys can collide,in turn impacting the accuracy of
queries to the CHT. Two distinct keys can be treated as
identical by the CHT if they share the same compact
hash code and their starting search positions are close
enough to cause a linear probe to encounter a stored
matching compact hash code before an empty cell is
found. This property gives the CHT its probabilistic na-
ture, in that two types of false-positive query results are
possible: either (a) a key that was not inserted can be re-
ported as present in the table or (b) the values of two
keys can be confused with each other. In Kraken 2, the
former error is indeed a false positive, whereas the latter
results in a less specific LCA being assigned to the
minimizer (Additional file 2: Figure S3). The probability
of either of these errors is < 1% with Kraken 2s default
load factor of 70% (Additional file 2: Figure S4). The
Wood et al. Genome Biology (2019) 20:257 Page 4 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
adverse effect on read-level classification is further miti-
gated by the algorithm Kraken 2 uses to combine infor-
mation from across the read, which is unchanged from
Kraken 1 and utilizes information from all k-mers in a
sequence to counteract low-frequency erroneous LCA
values that could be returned by a key-value store.
The probabilistic nature and comparisons involving
parts of a keys hash code make the CHT similar to the
counting quotient filter (CQF) described by Pandey et al.
[24] Like the CQF, Kraken 2s CHT features high locality
of memory access during an individual query due to the
linear probing that the CHT employs. Unlike the CQF,
however, our CHT does not allow the full hash code to
be recovered from a stored value (the CQFsremainder),
and so we are unable to resize a CHT once it is instanti-
ated. Additionally, our CHT has an additional possibility
of error compared to the CQF, where two keys that do
not have the same full hash code but share a truncated
hash code will be treated as identical. The CQF can
avoid such softhash collisions.
Internal taxonomy of a Kraken 2 database
While Kraken 1 used the taxonomy provided by the user
without modification, Kraken 2 makes some modifica-
tions to its internal representation of the taxonomy that
causes that representation to differ from the user-
provided taxonomy. First, Kraken 2 finds a minimal set
of nodes in the user-provided taxonomy. This minimal
set consists of all nodes to which a reference sequence is
assigned, as well as all of those nodesancestors; vertices
between nodes in this set remain as they were in the
user-provided taxonomy, maintaining the tree structure
in the internal representation. Kraken 2 then assigns
nodes in the minimal set sequentially increasing internal
taxonomy ID numbers using a breadth-first search (BFS)
beginning at the root, with the root having an internal
ID number of 1. This BFS provides a guarantee that an-
cestor nodes will have smaller internal ID numbers than
their descendants; an example of this numbering is
shown in Additional file 2: Figure S3. Kraken 2 stores a
mapping of its internal taxonomy numbers to the exter-
nal taxonomy ID numbers to make its results more eas-
ily interpretable, and performs all output using the
external taxonomy ID numbers.
Kraken 2s use of this internal taxonomy representa-
tion allows for the easier computation of the LCA of two
nodes because the ID numbers themselves give informa-
tion as to their relative depths in the tree, while the Na-
tional Center for Biotechnology Information (NCBI)
taxonomy IDs lack this property. The internal taxonomy
representation also allows Kraken 2 to use the minimal
number of bits for storage of taxonomy ID numbers, giv-
ing maximal space for the compact hash codes and
reducing the probability of CHT errors (or hash table
collisions,as we describe elsewhere in this paper).
A Kraken 2 database consists of a CHT and this in-
ternal taxonomy representation. Typical databases will
be built using the NCBI taxonomy [27], but users can
override this default to create custom databases for atyp-
ical use cases.
Minimizer-based subsampling
In contrast to Kraken 1s use of all k-mers in the stand-
ard use case, Kraken 2 subsamples the set of genomic
substrings and inserts only the distinct minimizers into
its database (Fig. 1b). We define the bp minimizer of a
k-mer (k) to be the lexicographically smallest canon-
ical -mer found within the k-mer. An -mer is called
canonical if it is lexicographically less than or equal to
its reverse complement. Note that if k=, no subsamp-
ling occurs and Kraken 2 inserts the same substrings
into its data structure that Kraken 1 would. Additionally,
as the difference between kand grows, fewer sub-
strings are inserted into the CHT, reducing its size along
with Kraken 2s memory usage and runtime (Fig. 1d,
Additional file 1: Table S2). The default values for Kra-
ken 2, k= 35 and = 31, were determined after the ana-
lysis of the parameter sweep results we show in
Additional file 1: Table S2.
Kraken 2 determines which -mers are minimizers by
the use of a sliding window minimum algorithm, in con-
trast to Kraken 1s implementation which examined each
k-mer anew. This allows for a faster determination of
minimizers, as less work is required when moving from
one k-mer to the next overlapping k-mer (in terms of
computational complexity, the new approach uses an
average of O(1) time to calculate a new minimizer vs.
Θ(k) time with the older algorithm). The sliding window
minimum calculation uses a double-ended queue (or
deque) in which canonicalized candidate -mers are
inserted in the back, along with the candidatesposition
in the original sequence. As a new candidate is encoun-
tered, enqueued candidates are removed from the back
of the deque until the candidate at the back has a greater
value than the new candidate (as determined by lexico-
graphical ordering). The new candidate is then pushed
onto the back of the deque.
Once a k-mers worth of -mers has been processed in
this way, the front of the deque contains the minimizer
of that k-mer. This property is then maintained during
scanning subsequent bases by removing the front elem-
ent in the deque if it is from a position in the original se-
quence that is not in the current k-mer. In this way, the
front element of the deque holds the minimizer of the k-
mer currently being examined.
We further augmented the sliding window algorithm
to include the exclusive or (XOR) shuffling operation
Wood et al. Genome Biology (2019) 20:257 Page 5 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
from Kraken 1. This operation serves to permute the or-
dering of the -mers when calculating minimizers and
helps to avoid a bias toward low-complexity -mers
when selecting the minimizer of a k-mer [4,15]. To shuf-
fle, we calculate the XOR value of the -mer and a pre-
defined constant and use this value as the candidate
that is put in the deque. When the original -mer value
is needed again, the operation is reversed by XORing a
second time with the same constant.
Spaced seed usage
Spaced k-mers, a similar concept to spaced seeds, have
been shown to improve the ability to classify reads
within the Kraken framework [28]. Kraken 2 uses a sim-
ple spaced seed approach where a user specifies an inte-
ger swhen building a database that indicates how many
positions in the minimizer will be masked (i.e., not con-
sidered when searching). Beginning with the next-to-
rightmost position in the minimizer, every other position
is masked until spositions have been masked. For ex-
ample, if s= 3 and = 12, the positions in the bit string
1111 1101 0101 with a 0would be masked. When
using Kraken 2, Kraken 1s classification results can be
most closely approximated by setting k== 31 and s =
0, as these settings will avoid any minimizer-based sub-
sampling and spaced seed usage. Kraken 2s default value
for sis 7 and was determined after the analysis of the
parameter sweep results we show in Additional file 1:
Table S2.
The canonical -mers that are minimizer candidates
are masked with the spaced seed mask prior to their in-
sertion into the deque for the sliding window calcula-
tion. By performing canonicalization of the minimizer
candidates prior to applying the spaced seed mask, we
ensure the result is the same whether applied to the -
mer or its reverse complement.
Kraken 1s sensitivity performance was governed by
the value of k(the length of the searched substring). By
comparison, the use of spaced seeds and minimizer-
based subsampling means that Kraken 2s sensitivity per-
formance will be largely governed by -s(the number of
compared bases in Kraken 2s searched substring). Thus,
increasing swill generally increase sensitivity while de-
creasing positive predictive value (Fig. 1e, Add-
itional file 1: Table S2).
Hash-based subsampling
Kraken 2 estimates the required capacity of the hash
table given the k,, and svalues chosen along with the
sequence data in a databases reference genomic library.
Some users will not have access to large memory com-
puters, and therefore, this estimate may be greater than
the maximum possible hash table size that they can
work with. To aid such users, Kraken 2 allows them to
specify a maximum size when building a database. If the
estimated required capacity is larger than the maximum
requested size, then the minimizers will be subsampled
further using a hash function. Given an estimated re-
quired capacity Sand a maximum user-specified cap-
acity of S(S<S), we can calculate the value f=S/S,
which is the fraction of available minimizers that the
user will be able to hold in their database. A minimum
allowable hash value of v=(1f)Mcan also be calcu-
lated, where Mis the maximum value output by hash
function h. Any minimizer in the reference library with a
hash code less than vwill not be inserted into the hash
table. This value vis also provided to the classifier so
that only minimizers with hash codes greater than or
equal to vwill be used to probe the hash table, saving
the search failure runtime penalty that would be in-
curred by searching for minimizers guaranteed not to be
in the hash table.
Evaluation of k-mer level discordance rates
At a k-mer level, there are two main types of discord-
ance between Kraken 1 and Kraken 2s results: those
caused by two distinct k-mers sharing the same
minimizer (a minimizer collision) and those caused by
two distinct minimizers being indistinguishable by the
CHT (a hash table collision). Minimizer collisions are
not always damaging. When it occurs between k-mers
from very closely related genomes, such a collision
might detect true homology even in the face of single
nucleotide polymorphisms and/or sequencing error.
That said, minimizer collisions between k-mers from
distantly related genomes could produce either elevated
LCA values (if both genomes are in the reference library)
or incorrectly classified k-mers (if one of the genomes is
not in the reference library). Hash table collisions are a
consequence of the probabilistic nature of the CHT and
can also cause either elevated LCA values or incorrectly
classified k-mers (Additional file 2: Figure S3). We note
that these different discordant results are all at a k-mer
level and may not always affect a query sequences classi-
fication due to the many k-mersworth of data that are
used to classify a query sequence; aside from slight mod-
ifications to handle the subsampling methods we use in
Kraken 2, the classification method of Kraken 2 is identi-
cal to Kraken 1.
We wished to estimate the rate at which these colli-
sions would cause discordance at a k-mer level between
the Kraken 1 and Kraken 2 results. To do so, we selected
a specific bacterial genome for which we had neighbor-
ing genomes at each taxonomic rank from species to
phylum. The selected genome was our reference se-
quence,and eight others were progressively more taxo-
nomically distant from the reference sequence. We list
the nine genomes used in these experiments in
Wood et al. Genome Biology (2019) 20:257 Page 6 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Additional file 1: Table S6. We additionally created a
synthetic genome with 4 Mbp of uniformly random
DNA. Together, these ten sequences formed a set of
query sequencesand were the basis for our evaluation
of collision rates. For these experiments, we used the de-
fault Kraken 2 values of k=35, = 31, and s= 7, unless
otherwise noted.
To determine the rates of discordance caused by
minimizer collisions, we compared each of the ten query
sequencesk-mers to the set of reference sequence k-
mers. For each sequence, the minimizer collision rate is
the proportion of distinct k-mers in a query sequence
that (a) are not in the set of reference sequence k-mers
and (b) share a minimizer with a reference sequence k-
mer. The various sequencesminimizer collision rates
are summarized in Additional file 1: Table S7. We hy-
pothesized that the minimizer collision rate would be in-
fluenced by the length of the minimizer used, due to the
lengths direct relationship to the number of possible
minimizers. To test this, we repeated the minimizer col-
lision rate estimation experiment focusing on the refer-
ence genome and using the random synthetic genome as
the sole query sequence. Setting k= 35 and s= 0, we var-
ied the parameter from 8 to 31. Minimizer lengths
greater than 15 had collision rates under 1%. Minimizer
lengths greater than 22 had 0 collisions. The full results
are shown in Additional file 2: Figure S5.
To determine the rates of discordance caused by hash
table collisions, we compared each of the ten query se-
quencesminimizers to a CHT populated with the refer-
ence sequence minimizers. The CHT was created with a
load factor of 70% and 15 bits reserved for the truncated
hash code (the same parameters used in Kraken 2s
standard database in September 2018). For each se-
quence, the hash table collision rate is the proportion of
distinct minimizers in a query sequence that (a) are not
minimizers in the set of reference sequence minimizers
and (b) are reported by the CHT as being inserted in the
hash table. The various sequenceshash table collision
rates are summarized in Additional file 1: Table S8. To
investigate the impact of load factor and truncated hash
code size on hash table collision rates, we repeated the
hash table collision rate experiment, but focused only on
the reference genome and used the random synthetic
genome as the sole query sequence. We used the same
default values of k,, and sas before (35, 31, and 7, re-
spectively) and calculated hash table collision rates while
varying both the load factor and truncated hash code
size. The impact of these two parameters on hash table
collision rates is shown in Additional file 2: Figure S4.
The parameters adopted for Kraken 2s default mode
had an error rate of 0.016%, consistent with the results
seen when comparing genomes of different species
(Additional file 1: Table S8).
Processing of a standard genomic reference library
The CHTs modest memory requirements, and the add-
itional savings yielded by minimizer-based subsampling,
allow more reference genomic data to be included in
Kraken 2s standard reference library. Whereas Kraken
1s default database had data from archeal, bacterial, and
viral genomes, Kraken 2s default database additionally
includes the GRCh38 assembly of the human genome
[29] and the UniVec_Coresubset of the UniVec data-
base [30]. We include these in Kraken 2s default data-
base to allow for easier classification of human
microbiome reads and more accurate classification of
reads containing vector sequences.
Additionally, we have implemented masking of low-
complexity sequences from reference sequences in Kra-
ken 2, by using the dustmasker[31] (for nucleotide se-
quences) and segmasker[32] (for protein sequences)
tools from NCBI. Using the toolsdefault settings, nu-
cleotide and protein sequences are checked for low-
complexity regions, and those regions identified are
masked and not processed further by the Kraken 2 data-
base building process. In this manner, we seek to reduce
false positives resulting from these low-complexity se-
quences, similar to the build process for Centrifuge [1].
Populating the Kraken 2 hash table
Kraken 2 begins building a CHT by first estimating the
number of distinct minimizers present in the reference
library for the selected values of k,,ands. This is done
through a form of zeroth frequency moment estimation
[33] where Kraken 2 creates a small set structure imple-
mented with a traditional hash table. In this set Q,we
insert only the distinct minimizers that satisfy the criter-
ion h(m) mod F<E, where h(m) is the hash code of the
minimizer mand EF(in practice, Kraken 2 uses E=4
and F= 1024). We then find the estimate of the total
number of distinct minimizers by multiplying the num-
ber of satisfactory distinct minimizers (|Q|) by F/E. This
form of estimation requires storing in memory only a
fraction of all distinct minimizers (approximately E/F)
and allows us to quickly set the capacity of our CHT
properly without needing to first store all elements in it.
After estimating the number of distinct minimizers
D= |Q|(F/E) present in the reference library, Kraken 2
then allocates memory for a CHT containing D/0.7 hash
table cells. We selected the divisor of 0.7 so that the re-
sultant hash table will have approximately 30% of its
cells remain empty after the population of the CHT (i.e.,
the CHT will have a load factor of 70%). As stated earl-
ier, the cells of this table are 32 bits each, and so the
total memory required for Kraken 2s CHT is 32D/
0.7 bits or 4D/0.7 bytes.
Kraken 2 then proceeds to scan each genome in the
reference library. Each genome must be associated with
Wood et al. Genome Biology (2019) 20:257 Page 7 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
a taxonomic ID number so that Kraken 2 can calculate
LCA values; genomes without associated taxonomy IDs
are therefore not processed by Kraken 2. For a
minimizer Min a genome G, Kraken 2 attempts to in-
sert a key-value pair containing M(key) and the taxo-
nomic ID T(value) associated with G into the CHT. If
the CHT does not report that Mwas previously inserted,
then the <M,T> key-value pair will be inserted, indicat-
ing that the LCA of Mis currently T.IfMwas previ-
ously inserted into the CHT, with LCA value T*, then its
associated LCA value is updated to equal the LCA of T
and T*. All minimizers are processed in this way; once
the reference librarys minimizers are all processed, the
LCA values are properly set for each of the minimizers
and the database build is complete. The LCA operation
is both commutative and associative, facilitating parallel
index construction.
Classification of a sequence fragment with Kraken 2
Kraken 2 classifies sequence fragments similarly to Kra-
ken 1, with modifications to facilitate minimizer- and
hash-based subsampling. For each k-mer in an input se-
quence, Kraken 2 finds its minimizer and, if it is distinct
from the previous k-mers minimizer, uses it as a key to
probe the CHT. If the minimizer matches a key in the
CHT, Kraken 2 considers the associated LCA value to
be the k-mers LCA (Fig. 1b). Classification then pro-
ceeds in the same manner as Kraken 1, taking note of
how many k-mer hits mapped to each taxon, construct-
ing a pruned classification tree, and using the leaf of the
maximally scoring root-to-leaf path of that tree to clas-
sify the sequence [4]. If hash-based subsampling was
used to build the CHT, each minimizer has its hash code
compared against the tables maximum allowable hash
code, and minimizers with higher-than-allowed hash
codes are not searched against the CHT. Any k-mer
containing an ambiguous nucleotide code is also not
searched against the CHT.
We note that although Kraken 2 only uses the
minimizer to query the CHT, the LCA found via this
query is assigned by Kraken 2 to the k-mer rather than
only the minimizer. This means that a stretch of nover-
lapping k-mers that share a minimizer will all be
assigned the same LCA value by Kraken 2 and that n
hits to that LCA will be part of the classification tree,
even though only one distinct minimizer was present
among the k-mers.
Parsing of input files
Previous work by Langmead et al. [17] has shown the
importance of removing parsing work from critical sec-
tions, i.e., portions of the program that can be executed
by only 1 thread at a time. Kraken 2 uses 2 different
methods to defer a majority of parsing work from the
critical section to thread-local execution. The first
method (referred to as batch deferredparsing by Lang-
mead et al.) reads a set number of lines (40,000 in Kra-
ken 2) of input in a thread-local buffer within the critical
section and then parses the input within a single threads
execution. This method is used to perform reading of
paired-end FASTQ input, where the lengths of a frag-
ments mates can be different and reading a consistent
number of lines from both input files is necessary to en-
sure a thread is working with complete mate pairs. For
FASTA or single-end FASTQ input, Kraken 2 instead
uses a more efficient method that reads in a set number
of bytes (3 MB in Kraken 2) of input into a thread-local
buffer within the critical section and continues reading
input into that buffer until a record boundary is found,
at which point a thread leaves the critical section and
parses its input. These modifications allow Kraken 2 to
more efficiently use multiple threads than did Kraken 1
(Additional file 1: Table S4).
Translated search
To perform a translated search, Kraken 2X first builds a
database from a set of reference proteins in the same
manner that Kraken 2 does for nucleotide sequences.
The usual alphabet of 20 amino acids is reduced to 15
using the 15-character alphabet of Solis [34]; we add a
single additional value representing selenocysteine, pyr-
rolysine, and translation termination (stop codons). This
gives us 16 characters in our reduced alphabet, allowing
us to represent a character with 4 bits. Minimizers of
reference proteins are calculated using the same
methods for nucleotide sequences (i.e., using spaced
seeds if requested and a sliding window minimum algo-
rithm), but reverse complements are not calculated and
by default k= 15, = 12, and s=0.
When searching against a protein minimizer database,
Kraken 2X translates all six reading frames of the input
query DNA sequences into the reduced amino acid al-
phabet. Minimizers from all six frames are pooled and
used to query the CHT, and therefore, all contribute to
the Kraken 2X classification of a query sequence.
Generation of data for strain exclusion experiments
We downloaded the reference genome and protein data
used for the clade exclusion experiments from NCBI in
January 2018 from the archaeal, bacterial, and viral do-
mains. We also downloaded the taxonomy from NCBI at
this same time. Using the taxonomy ID information for
each sequence, we obtained a set of all taxonomy IDs
represented by the reference genomes. From this set, we
selected a subset of eligible strainsthat had both two
sister sub-species taxa present and two sister species
taxa present in the set of reference genomes. We se-
lected this subset by examining only those nucleotide
Wood et al. Genome Biology (2019) 20:257 Page 8 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
sequences with the phrase complete genomein their
FASTA record header but excluding those that were
plasmids or second or third chromosomes. In this man-
ner, we sought to ensure we did not count a genome
multiple times due to multiple sequences being associ-
ated with that genome. From the eligible strain subset,
40 prokaryotic taxonomy IDs and 10 viral taxonomy IDs
were selected arbitrarily to be the strains of origin for
our experiments. The strains selected are listed in Add-
itional file 1: Table S3.
After selecting the taxonomy IDs that represented the
strains of origin, we gathered all of the nucleotide se-
quences we had downloadedincluding chromosome
and plasmid sequences excluded from our examination
when creating the eligible strain subsetinto a single file
and did the same for the protein sequences. For both
the nucleotide and protein files, we placed sequences
with taxonomy IDs that were outside the strains of ori-
gin into a strain exclusion reference file. Then, for each
taxonomy ID in our strain of origin set, we created a sin-
gle strain referencefile containing all nucleotide se-
quences that were associated with that taxonomy ID.
We used Mason 2 [35] to simulate 100-bp paired-end
Illumina sequence data from our strains of origin, with
500,000 fragments being simulated from each strain.
When simulating the reads, we used the default options
for simulating sequencing errors with Mason 2s
mason_simulatorcommand. These defaults caused the
simulator to simulate sequencing errors at rates of 0.4%
for mismatches, 0.005% for insertions, and 0.005% for
deletions. We combined simulated reads from the
strains of origin into a single set of read data. We also
shuffled the order of the fragments in this set to control
for ordering effects that might affect runtime.
Execution of strain exclusion experiments
To evaluate the accuracy and computational perform-
ance of Kraken 2, we compared it to Kraken 1 and sev-
eral other programs. In selecting these programs, we
concentrated on three main properties. First, because
Krakens principal aim is to provide high-speed taxo-
nomic sequence classification, we looked for taxonomic
sequence classification tools that were high in classifica-
tion speed (within approximately an order of magnitude
of Kraken 1). Secondly, because our experiments rely on
holding fixed the reference data between programs, we
selected tools which had the ability to customize the
underlying reference sequence set and taxonomy using
whole-genome reference data. These two requirements
led to our selection of KrakenUniq, CLARK, Centrifuge,
and Kaiju as comparator programs. We note that these
requirements exclude an accuracy evaluation against
programs that are not taxonomic sequence classifiers
(programs that output a mapping of sequences to taxa).
Sequence abundance estimation programs (which map
taxa to sequence counts or frequencies), such as
Bracken, and population abundance estimation pro-
grams (which map taxa to organism counts or frequen-
cies), such as MetaPhlAn [36], are answering related but
different problems than those in our comparator set. For
example, Bracken does not actually change any of the
taxonomic labels associated with the sequenced frag-
ments but rather adjusts the fragment counts associated
with low-rank taxa. We also note that although MetaPh-
lAn does, as part of its operation, classify a small propor-
tion of reads that map to marker genes, this proportion
can be less than 10% of reads [6] in whole-genome shot-
gun metagenomic experiments (such as ours), and thus,
MetaPhlAn would yield far lower per-sequence sensitiv-
ity relative to the tools in our comparison.
In brief, we used the nucleotide search-based classifi-
cation programs (Kraken 1, KrakenUniq, Kraken 2,
CLARK, and Centrifuge) to build a strain-exclusion
database from reference genomes, and we used the
translated search-based classification programs (Kraken
2X and Kaiju) to build a strain-exclusion database from
reference protein sequences. We compared Kraken 2
and Kraken 2X (both using the code base from Kraken
2.0.8) against Kraken 1.1.1, KrakenUniq 0.5.6, CLARK
1.2.4, Centrifuge 1.0.3-beta, and Kaiju 1.5.0. Because
CLARK requires a rank to be specified at the time of
building a database, and our evaluations center on
genus-rank accuracy, we built a CLARK database for the
genus rank for our evaluation work in this paper.
Classifiers received the simulated read data as paired-
end FASTQ input. To evaluate runtime and memory
usage, we sought to eliminate the performance impact of
reading or writing from disk or from a network storage
location. To accomplish this, we copied simulated read
data and classifier databases onto a random access mem-
ory (RAM) filesystem and directed the classifiers to read
input from and write output to that RAM filesystem.
Accuracy was evaluated on a smaller subset of the
simulated data containing 1000 fragments per genome
of origin or 50,000 fragments in total. To obtain process-
ing speed and memory usage information, we ran each
classifier using 16 threads on 25 million sequences
worth of simulated read data. We used the taskset com-
mand to restrict each classifier to the appropriate num-
ber of processors (e.g., taskset -c 0-15was used with
our 16 thread experiments); this ensures that a classifier
that uses an external process to aid in its execution has
that processruntime properly counted against its run-
time here. The /usr/bin/time -vcommand provided us
with elapsed wall clock time and maximum resident set
size data (memory usage) for each experiment and
allowed us to verify that no major page faults were in-
curred by a classifier during its execution (the absence
Wood et al. Genome Biology (2019) 20:257 Page 9 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
of which indicates minimal disk- or network-related in-
put/output effects on the runtime). Classifiers were run
on a computer with 32 Xeon 2.3 GHz CPUs (16 hyper-
threaded cores) and 244 GB of RAM.
Evaluation of accuracy in strain exclusion experiments
We evaluated the accuracy of each classifier at a per-
fragment level, with respect to a particular taxonomic
rank. Each fragment had a known true subspecies taxon
of origin, which implied a true taxon of origin at both
the species and genus ranks, which is where we mea-
sured accuracy. We now describe how we counted true-
positive (TP), false-negative (FN), vague positive (VP),
and false-positive (FP) results at the genus and species
levels. We describe this at the genus level specifically,
but the analogous procedure was also used at the species
level. For a given true genus of origin, a TP classification
is a classification at that genus or at a descendant of that
genus. Because we excluded the strains of origin from
our reference databases, we expected all classifiers to
make incorrect strain-level classifications and so allow
classifications of descendants of the true genus to be
judged as TP. We define an FN classification as a failure
of a classifier to assign any classification to a sequence
and a VP classification as a classification at an ancestor
of the true genus of origin. Finally, we define an FP clas-
sification as a classification that is incorrect, that is, not
at the true genus of origin nor an ancestor or descend-
ant of that true genus. These four categories are mutu-
ally exclusive, and all fragments run through a classifier
will have their classification (or lack thereof) categorized
by one of these categories.
These categories are different from those typically
used for binary classification problems; they are used
here because these methods can make classifications that
are not at leaves of the taxonomic tree but are still cor-
rect. For example, a classification of an Escherichia coli
fragment as Escherichia would be evaluated as TP for
genus-rank accuracy, but as VP for species-rank accur-
acy. Classification of that same fragment as Vibrio would
be evaluated as FP at any rank below class (because the
LCA of Vibrio and Escherichia is the class taxon Gam-
maproteobacteria) and would be evaluated as TP for the
class rank and above.
Using these categories, we define rank-level sensitivity
as the proportion of input fragments that were true-
positive classifications, or TP/(TP + VP + FN + FP). We
define rank-level positive predictive value (PPV) as the
proportion of classifications that were true positives (ex-
cluding vague positives), or TP/(TP + FP). Along with
these definitions of rank-level sensitivity and PPV, we
also define an F1-measure as the harmonic mean of
those two values.
Evaluation of thread scaling efficiency
To evaluate Kraken 1s and Kraken 2s ability to effi-
ciently use multiple threads, we performed an experi-
ment using the strain exclusion databases and simulated
read data we describe previously in this section. We ran
both Kraken 1 and Kraken 2 on the same data using 1,
4, and 16 threads. The 2 programs were run once on the
data as paired-end read data and once as single-end read
data. Read data and Kraken database files were all placed
on a RAM filesystem, and the tasksetcommand was
used to limit the classifier programs to only as many
cores as the number of threads being used. These condi-
tions mirror those of our main strain exclusion experi-
ments, only varying the number of threads between the
various runs of the classifiers. The results for this experi-
ment are shown in Additional file 1: Table S4. In short,
Kraken 2 exhibits superior speedup with respect to the
number of threads allocated compared to Kraken 1. This
is especially true for paired-end reads.
FDA-ARGOS experimental concordance evaluation
The FDA-ARGOS (dAtabase for Reference Grade mi-
crObial Sequences) project provides sequencing experi-
ments for many microbial isolates [22]. We used the
NCBIs Sequence Read Archive [37] to find all 1392 ex-
periments related to the FDA-ARGOS project (accession
PRJNA231221). Because some tools are unable to prop-
erly process reads of differing lengths, we selected only
those 263 experiments that were run on an Illumina
HiSeq 4000 instrument and produced 151-bp reads. We
then randomly selected 1 experiment from each genus
to download and used reservoir sampling to select a sub-
set of 10,000 paired-end fragments from each selected
experiment. We also removed experiments for which
our strain-exclusion reference genome set did not have a
reference genome of the same species as the sequenced
isolate. These steps yielded 25 experimentsworth of
data, for 250,000 paired-end fragments in total. Using
the strain-exclusion databases created earlier, we then
used each classifier to classify the data and examined the
percentage of each experiments fragments that were
classified.
Because the FDA-ARGOS data are from real sequen-
cing experiments, several factors could explain discord-
ance between a classifiers results and the experiments
assigned taxa, including the evolutionary distance be-
tween sequences and reference data, low-quality sequen-
cing runs, and contamination. The true causes of such
discordance may not be discernable, and even when they
are, they often require an in-depth examination of the
sequencing and reference data. For these reasons, we do
not report sensitivity and PPV for these data because we
cannot be certain of the true taxonomic origin of each
individual fragment of real sequencing data. Rather, we
Wood et al. Genome Biology (2019) 20:257 Page 10 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
evaluated the concordance of the SRA-assigned taxa
with the fragmentsclassifications at the genus rank and
report for each classifier the following quantities: (a) the
percentage of fragments with a concordant classification
at the genus rank, (b) the percentage of fragments with a
discordant classification at the genus rank, (c) the per-
centage of fragments with a classification of an ancestor
of the SRA-assigned genus taxon, and (d) the percentage
of fragments that were not classified. The results of this
concordance evaluation are provided in full in Add-
itional file 1: Table S5.
Parameter sweeps
We examined various values for parameters to ensure
Kraken 2s default parameters would provide an advanta-
geous balance of accuracy, classification speed, and
memory usage. Specifically, we looked at parameters re-
lating to minimizer-based subsampling (kand ), hash-
based subsampling (f=S/S), and spaced seed usage (s).
For Kraken 2, we performed two parameter sweeps, with
one focused on minimizer-based subsampling and one
focused on hash-based subsampling. The first parameter
sweep looked at values for in the interval [25,31],
values for kin the interval [,+ 10], and values for sin
the interval [0, 7]; the second parameter sweep looked at
values of in the interval [25,31], fixed k=, values for
fin the set {0.125, 0.25, 0.5}, and values for sin the
interval [0, 7]. We also performed a third parameter
sweep, focused on translated search (Kraken 2X), where
we looked at values for in the interval [11,15], values
for kin the interval [,+ 3], and values for sin the
interval [0, 3].
Each parameter sweep used the strain exclusion data
that we previously created to build databases, and we
used the same accuracy and timing methods for these
databases that we did in the cross-classifier comparison.
The results of the first two parameter sweeps, run on
nucleotide databases, are provided in Additional file 1:
Table S2, while the results of the third parameter sweep,
run on protein databases, are provided in Add-
itional file 1: Table S9. We note that the parameter
sweeps yielded a large number of parameter combina-
tions giving approximately the same, near-optimal levels
of accuracy. This suggests performance is not overly sen-
sitive to particular parameter settings.
Evaluation of database sizes of Kraken 1 and Kraken 2
We began by shuffling the reference DNA sequences in
our strain exclusion set and recorded the total number
of bases in each sequence. We modified Kraken 2s cap-
acity estimator to report an estimate of the number of
distinct minimizers after each sequence processed, ra-
ther than only after all sequences are processed. Finally,
we ran the capacity estimator twice on the shuffled
genomic data, once with k= 31, = 31, s= 0 (corre-
sponding to Kraken 1s defaultseffectively counting the
number of distinct k-mers) and again with k=35, = 31,
s= 7 (Kraken 2s defaults).
The size of a Kraken 1 database is a function of the
number of distinct k-mers in the reference data. If there
are Xdistinct k-mers, the size of Kraken 1s database.kdb
(sorted list of k-mer/LCA pairs) file will be 1072 + 12X
bytes; the 1072-byte term is the size of the Jellyfish/Kra-
ken header data, and 12 bytes are used for each k-mer/
LCA pair. The database.idx (minimizer offset index) file
is 8,589,934,608 bytes, a function of Kraken 1s default
minimizer length of 15. The full database size is the sum
of the sizes of those two files.
Similarly, the size of a Kraken 2 hash table is a func-
tion of the estimate of the number of distinct minimizers
in the reference data. If there are an estimated Ydistinct
minimizers, Kraken 2s hash table will be 32 + 4Y/0.7
bytes in size (representing 32 bytes of metadata and
using 4 bytes per cell and a load factor of 0.7).
We used the estimates of the numbers of distinct k-
mers and distinct minimizers to calculate the database
sizes of Kraken 1 and Kraken 2 for successively larger
subsets of the strain exclusion set. The results of this
evaluation are shown in Additional file 2: Figure S1, with
raw data available in Additional file 1: Table S10.
Reviewing the results when all genomic sequences
were added, our results indicate that the number of dis-
tinct k-mers is approximately 3.1 times the number of
distinct minimizers for the settings we have selected for
Kraken 1 and Kraken 2. It is not possible to draw a dir-
ect relationship between the number of distinct k-mers
or minimizers and the number of sequence bases proc-
essed. For example, homology between similar strains
and species will cause the number of distinct k-mers/
minimizers to grow slower than the total number of
bases. Examining the linear-term coefficients from the
database-size expressions (12Xand 4Y/0.7) indicates a
Kraken 2 database will be approximately 15% of the size
of a Kraken 1 database of the same reference data; this is
because X3.1Y, and (4/0.7)/(12 × 3.1) = 0.15. When we
examine the full reference set, the 15% estimate is con-
sistent with the ratio of Kraken 2s hash table size
(10.456 GB) to Kraken 1s database.kdb file size (77.490
8.589 = 68.901 GB), which is 10.456/68.901 = 0.152.
Bracken experiments on strain exclusion data
We first generated Bracken metadata from each of the
Kraken 1 and Kraken 2 reference libraries used in the
strain exclusion experiments. We then used Bracken to
estimate genus- and species-level abundance from the
Kraken 1 and Kraken 2 classification results on the pro-
karyotic strain exclusion read data. Due to the low se-
quence similarity between our simulated viral reads and
Wood et al. Genome Biology (2019) 20:257 Page 11 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the strain-exclusion reference data, none of the nucleo-
tide search programs exhibited high sensitivity on these
reads, including Kraken 1 and Kraken 2. Such low classi-
fication rates prevent Bracken from inferring taxonomy
for a large proportion of the viral reads. Additionally, the
taxonomy for viruses has several examples where species
are not grouped by ancestry and lack similarity in both
gene organization and genomic sequence [38]. For these
reasons, we chose to exclude the simulated viral reads
from our analysis of Bracken.
For overall evaluation of the accuracy of Bracken in
these strain exclusion experiments, we calculated the
mean absolute percentage error (MAPE):
MAPE ¼
100%
nX
n
x¼1
TxSx
Tx
where S
x
is the estimated number of reads and T
x
is the
true number of reads for taxon x. In this strain exclusion
experiment, n= 40, the total number of distinct prokary-
otic species and genera in the sample and T
x
= 1000 for
each taxon.
Supplementary information
Supplementary information accompanies this paper at https://doi.org/10.
1186/s13059-019-1891-0.
Additional file 1: Table S1. Comparison of accuracy and computational
performance. Table S2. Comparison of Kraken 2 with other classifiers,
using various parameter values. Table S3. Genomes excluded in strain-
exclusion simulation. Table S4. Thread scaling evaluation results. Table
S5. Evaluation of FDA-ARGOS sequencing data. Table S6. Sequences
used for evaluation of collision rates. Table S7. Minimizer collision
evaluation results. Table S8. Hash table collision evaluation results. Table
S9. Comparison of Kraken 2X with other classifiers, using various
parameter values. Table S10. Database size evaluation results.
Additional file 2: Figure. S1. Estimation of database sizes for Kraken 1
and Kraken 2 as sequences are added to the reference set. Figure S2.
Bracken performance on strain exclusion simulated prokaryotic data.
Figure S3. Examples of compact hash table usage with Kraken 2. Figure
S4. Evaluation of compact hash table error rates as a function of two
variables. Figure S5. Evaluation of minimizer collision rates as a function
of minimizer length.
Additional file 3: Review history.
Acknowledgements
The authors would like to thank James R. White and Steven Salzberg for the
helpful discussions about the manuscript.
Peer review information
Barbara Cheifet was the primary editor of this article and managed its
editorial process and peer review in collaboration with the rest of the
editorial team.
Review history
The review history is available as Additional file 3.
Authorscontributions
DEW and BL designed the algorithms for Kraken 2. DEW developed the
Kraken 2 software. DEW, JL, and BL designed the experiments. DEW and JL
performed the experiments. DEW, JL, and BL prepared and reviewed the
manuscript. All authors read and approved the final manuscript.
Funding
BL and DEW were supported by NSF grant IIS-1349906. BL was additionally
supported by NIH grant R01-GM118568. JL was supported by NIH grant R35-
GM130151.
Availability of data and materials
We have made the data for our strain exclusion experiments publicly
available for download, including all reference sequences, taxonomy, and
simulated read data [39]. Code to generate all databases from these
reference sequences, to generate simulated read data, and to run the
comparison of classifiersaccuracy and performance is also available for
public download in a GitHub repository [40] and via permanent storage at
https://doi.org/10.5281/zenodo.3520278 .
Kraken 2s source code is open-source, licensed under the MIT License, and
available in a GitHub repository [41]. The specific version of Kraken 2 evalu-
ated here, version 2.0.8, is also permanently available at https://doi.org/10.
5281/zenodo.3520272.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Author details
1
Department of Computer Science, Whiting School of Engineering, Johns
Hopkins University, Baltimore, MD, USA.
2
Center for Computational Biology,
Johns Hopkins University, Baltimore, MD, USA.
3
Department of Biomedical
Engineering, Whiting School of Engineering, Johns Hopkins University,
Baltimore, MD, USA.
Received: 20 September 2019 Accepted: 18 November 2019
References
1. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive
classification of metagenomic sequences. Genome Res. 2016;26:17219.
2. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate
classification of metagenomic and genomic sequences using discriminative
k-mers. BMC Genomics. 2015;16:236.
3. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for
metagenomics with Kaiju. Nat Commun. 2016;7:11257.
4. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence
classification using exact alignments. Genome Biol. 2014;15:R46.
5. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast
metagenomics classification using unique k-mer counts. Genome Biol. 2018;
19:198.
6. Lindgreen S, Adair KL, Gardner PP. An evaluation of the accuracy and speed
of metagenome analysis tools. Sci Rep. 2016;6:19233.
7. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for
taxonomic classification. Cell. 2019;178:77994.
8. Eyice Ö, et al. SIP metagenomics identifies uncultivated Methylophilaceae as
dimethylsulphide degrading bacteria in soil and lake sediment. ISME J. 2015;
9:2336.
9. Merelli I, et al. Low-power portable devices for metagenomics analysis: fog
computing makes bioinformatics ready for the Internet of Things. Futur
Gener Comput Syst. 2018;88:46778.
10. Lu J, Salzberg SL. Removing contaminants from databases of draft
genomes. PLoS Comput Biol. 2018;14:e1006277.
11. Donovan PD, Gonzalez G, Higgins DG, Butler G, Ito K. Identification of fungi
in shotgun metagenomics datasets. PLoS One. 2018;13:e0192898.
12. Meiser A, Otte J, Schmitt I, Grande FD. Sequencing genomes from mixed
DNA samples - evaluating the metagenome skimming approach in
lichenized fungi. Sci Rep. 2017;7:14881.
13. Knutson TP, Velayudhan BT, Marthaler DG. A porcine enterovirus G
associated with enteric disease contains a novel papain-like cysteine
protease. J Gen Virol. 2017;98:130510.
Wood et al. Genome Biology (2019) 20:257 Page 12 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
14. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species
abundance in metagenomics data. PeerJ Comput Sci. 2017;3:e104.
15. Roberts M, Hayes W, Hunt B, Mount S, Yorke J. Reducing storage
requirements for biological sequence comparison. Bioinformatics. 2004;20:
33639.
16. Li H. Minimap2: pairwise alignment for nucleotide sequences.
Bioinformatics. 2018;34:3094100.
17. Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to
hundreds of threads on general-purpose processors. Bioinformatics. 2018;
35(3):42132.
18. Pettengill EA, Pettengill JB, Binet R. Phylogenetic analyses of Shigella and
enteroinvasive Escherichia coli for the identification of molecular
epidemiological markers: whole-genome comparative analysis does not
support distinct genera designation. Front Microbiol. 2016;6:1573.
19. Helgason E, et al. Bacillus anthracis,Bacillus cereus, and Bacillus
thuringiensisone species on the basis of genetic evidence. Appl Environ
Microbiol. 2000;66:2627 LP2630.
20. Gomila M, Peña A, Mulet M, Lalucat J, García-Valdés E. Phylogenomics and
systematics in Pseudomonas. Front Microbiol. 2015;6:214.
21. Parks DH, et al. A standardized bacterial taxonomy based on genome
phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996.
22. Sichtig H, et al. FDA-ARGOS: a public quality-controlled genome database
resource for infectious disease sequencing diagnostics and regulatory
science research. bioRxiv. 2018;482059. https://doi.org/10.1101/482059.
23. Stewart RD, et al. Assembly of 913 microbial genomes from metagenomic
sequencing of the cow rumen. Nat Commun. 2018;9:870.
24. Pandey, P., Bender, M. A., Johnson, R. & Patro, R. A general-purpose
counting filter: making every bit count. in Proc 2017 ACM Int Conf Manag
Data 775787 (2017). doi:https://doi.org/10.1145/3035918.3035963
25. Flajolet P, Fusy É, Gandouet O, Meunier F. Hyperloglog: the analysis of a
near-optimal cardinality estimation algorithm. Discret Math Theor Comput
Sci Proc. 2007;AH:12746.
26. Appleby, A. SMHasher GitHub repository. at <https://github.com/aappleby/
smhasher>
27. Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2011;40:
D13643.
28. Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based
metagenomic classification. Bioinformatics. 2015;31:358492.
29. Church DM, et al. Extending reference assembly models. Genome Biol. 2015;
16:13.
30. The UniVec Database. at <https://www.ncbi.nlm.nih.gov/tools/vecscreen/
univec/>
31. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST
implementation to mask low-complexity DNA sequences. J Comput Biol.
2006;13:102840.
32. Wootton JC, Federhen S. Analysis of compositionally biased regions in
sequence databases. Methods Enzymol. 1996;266:55471.
33. Flajolet P, Martin GN. Probabilistic counting algorithms for data base
applications. J Comput Syst Sci. 1985;31:182209.
34. Solis AD. Amino acid alphabet reduction preserves fold information
contained in contact interactions in proteins. Proteins Struct Funct
Bioinforma. 2015;83:2198216.
35. Holtgrewe, M. Mason - a read simulator for second generation sequencing
data. Technical Report TRB1006 (2010).
36. Segata N, et al. Metagenomic microbial community profiling using unique
clade-specific marker genes. Nat Methods. 2012;9:8114.
37. Kodama Y, et al. The sequence read archive: explosive growth of
sequencing data. Nucleic Acids Res. 2011;40:D546.
38. Lawrence JG, Hatfull GF, Hendrix RW. Imbroglios of viral taxonomy: genetic
exchange and failings of phenetic approaches. J Bacteriol. 2002;184:4891
LP4905.
39. Wood, D. E. Kraken 2 Manuscript Data. doi:https://doi.org/10.5281/zenodo.
3365797
40. Wood, D. E. Kraken 2 Experiment GitHub repository. at <https://github.com/
DerrickWood/kraken2-experiment-code>
41. Wood, D. E. Kraken 2 GitHub repository. at <https://github.com/
DerrickWood/kraken2>
PublishersNote
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Wood et al. Genome Biology (2019) 20:257 Page 13 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Kraken2 v2.1.2 was used to determine the constituents of sequence data [56]. ...
Article
Full-text available
Using target enrichment, RNA baits designed around a panel of bacterial sexually transmitted infections were used to fish target DNA from the clinical sample, leading to complete bacterial genome sequences.
... UpSet plots were plotted using UpSetPlot (Lex et al., 2014) and sashimi plots were plotted using ggsashimi (Garrido-Martín et al., 2018). Metagenomic analysis of unmapped complete transcripts (contaminations) per tissue type was analysed using Kraken2 (Wood et al., 2019). ...
Thesis
Full-text available
This thesis presents a comprehensive study of the genomic architecture and population genetics of Ostrea chilensis, a species of significant ecological, cultural, and economic importance in Aotearoa New Zealand. Through an integrative approach combining third-generation sequencing technologies and novel bioinformatics tools, the research aimed to construct the first annotated reference genome and transcriptome drafts of O. chilensis and to assess the levels of diversity of the populations of Foveaux Strait, Manukau Harbour, and the Chatham Islands.
... Taxonomic classification was conducted with the Kraken2 tool version 2.1.3 (Wood et al., 2019). Read alignment and variant calling were performed using the MTBseq pipeline version 1.1.0 ...
Article
Full-text available
Background The emergence of drug-resistant Mycobacterium tuberculosis ( M. tb ) strains remains a threat to tuberculosis (TB) prevention and care. Understanding the drug resistance profiles of circulating strains is crucial for effective TB control. This study aimed to describe the genetic diversity of rifampicin-resistant M. tb strains circulating in Botswana using whole genome sequencing (WGS). Methods This study included 202 stored M. tb isolates from people diagnosed with rifampicin-resistant TB (RR-TB) between January 2016 and June 2023. Genomic DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method. Library preparation was performed using the Illumina DNA prep kit following the manufacturer's instructions. Sequencing was done on Illumina NextSeq2000. TBProfiler software was used to identify known M. tb lineages and drug resistance profiles. Statistical analyses were performed on STATA version 18. Results WGS analysis revealed multidrug resistance (57.9%: 95% CI; 50.7–64.8), Pre-XDR (16.8%, 95% CI: 11.9–22.7), RR-TB (20.2%: 95% CI: 14.98–26.5), and HR-TB (0.5%, 95% CI; 0.01–2.7). We identified a high genetic diversity with three predominant lineages: lineage 4 (60.9%, 95% CI; 53.8–67.7), lineage 1 (22.8%: 95% CI; 17.2–29.2), and lineage 2 (13.9%, 95% CI: 9.4–19.4). The most frequently observed drug resistance mutations for rifampicin, isoniazid, ethambutol, streptomycin, pyrazinamide, and fluoroquinolones were rpoB S450L (28.6%), katG S315T (60.5%), embA _c.-29_-28delCT, embB Q497R (31.7%), rrs _n.517C>T (47.1%), pncA _c.375_389delCGATGAGGTCGATGT (36.0%) and gyrA A90V (79.4%), respectively. No bedaquiline and delamanid resistance-associated mutations were detected. Conclusions This study highlights the high genetic diversity of M. tb strains, with a predominance of lineage 4 among people with RR-TB in Botswana. It provides valuable insights into the genetic diversity of rifampicin-resistant M. tb strains circulating in Botswana.
Preprint
Full-text available
While the bloom-forming cyanobacterium Microcystis can exist as free-living single cells or within dense mucilaginous colonies, the drivers and consequences of colony formation remain unclear. Here, we integrated metatranscriptomic datasets from two Microcystis bloom events in Lake Taihu, China, to analyze and validate the functional differences between colonial and single-cell Microcystis . Our results confirmed colony expression profiles were disproportionately enriched in Microcystis transcripts (and functions) compared to other prokaryotic taxa. Concomitantly, viral infection strategies diverged by Microcystis community morphology: colony-associated cells expressed lysogeny-associated genes, while single cells exhibited increased signatures of lytic infection. These data are consistent with the hypothesis that Microcystis colonies foster conditions favorable to lysogen formation, likely due to local high cell densities and the resulting advantage of superinfection immunity, whereas solitary cells experience stronger lytic pressure. On a broader scale, our findings refine the understanding of bloom dynamics by identifying how community morphological states coincide with distinct host-virus interactions. Cumulatively, this work underscores the importance of colony formation in shaping Microcystis ecology and highlights the need for mechanistic studies that disentangle the interplay between phage infection modes, colony formation, and microbial community structure.
Article
Nematode-trapping fungi, renowned for their specialized predatory structures that ensnare nematodes, offer a promising biological approach to managing plant-parasitic nematodes. However, the efficacy of these fungi is frequently hampered by biotic and abiotic factors within the soil, which can significantly impede fungal germination (fungistasis). To counteract these environmental challenges, certain nematode-trapping fungi have evolved to produce traps from their conidia, referred to as conidial traps. This adaptation allows them to bypass the inhibitory effects of their surroundings, enhancing their predatory capabilities. In this study, we explored how soil affects conidial trap formation in Drechslerella dactyloides . Our findings revealed that Acinetobacter spp. and Pantoea spp. present in soil extracts play pivotal roles in triggering the development of these traps. Using metagenomic sequencing, we mapped the shifts in bacterial communities and their relative abundances before and after incubation for up to 24 hours to optimize soil induction effects. This analysis highlighted the enrichment of specific functional genes in soil microbes and provided insights into the mechanisms driving conidial trap formation, based on changes in soil characteristics. Furthermore, through bacterial isolation procedures, we successfully cultured and characterized the bacteria responsible for this phenomenon, confirming their potent ability to stimulate the production of conidial traps in nematode-trapping fungi. This study not only underscores the critical role of bacterial diversity in modulating the life cycle transitions of nematode-trapping fungi but also sets the stage for the development of more effective and sustainable strategies to harness these fungi in the battle against pathogenic nematodes. IMPORTANCE Predatory nematode-trapping fungi are important microbial antagonists of nematodes and can be developed into biocontrol agents. However, microbial biocontrol agents often suffer from inconsistent efficacy, primarily due to biotic and abiotic stresses in the rhizosphere soil. Drechslerella dactyloides , a nematode-trapping fungus, produces conidial traps in soil, serving as a survival strategy to overcome these stresses. In this study, we optimized soil suspensions to efficiently induce the formation of conidial traps. We found that bacteria in the soil directly trigger this formation. Metagenomic sequencing revealed bacterial enrichment during optimization, and we isolated and purified these bacteria with inducible activity. Our research deepens the understanding of this survival strategy of nematode-trapping fungi in nature, laying the foundation for enhancing the effectiveness of nematode biocontrol using this mechanism.
Article
Gelatinous zooplankton (GZ) represents an important component of marine food webs, capable of generating massive blooms with severe environmental impact. When these blooms collapse, considerable amounts of organic matter (GZ-OM) either sink to the seafloor or can be introduced into the ocean’s interior, promoting bacterial growth and providing a colonizable surface for microbial interactions. We hypothesized that GZ-OM is an overlooked marine hotspot for transmitting antimicrobial resistance genes (ARGs). To test this, we first re-analyzed metagenomes from two previous studies that experimentally evolved marine microbial communities in the presence and absence of OM from Aurelia aurita and Mnemiopsis leidyi recovered from bloom events and thereafter performed additional time-resolved GZ-OM degradation experiments to improve sample size and statistical power of our analysis. We analyzed these communities for composition, ARG, and mobile genetic element (MGE) content. Communities exposed to GZ-OM displayed up to fourfold increased relative ARG and up to 10-fold increased MGE abundance per 16S rRNA gene copy compared to the controls. This pattern was consistent across ARG and MGE classes and independent of the GZ species, indicating that nutrient influx and colonizable surfaces drive these changes. Potential ARG carriers included genera containing potential pathogens raising concerns of ARG transfer to pathogenic strains. Vibrio was pinpointed as a key player associated with elevated ARGs and MGEs. Whole-genome sequencing of a Vibrio isolate revealed the genetic capability for ARG mobilization and transfer. This study establishes the first link between two emerging issues of marine coastal zones, jellyfish blooms and ARG spread, both likely increasing with future ocean change. Hence, jellyfish blooms are a quintessential “One Health” issue where decreasing environmental health directly impacts human health. IMPORTANCE Jellyfish blooms are, in the context of human health, often seen as mainly problematic for oceanic bathing. Here we demonstrate that they may also play a critical role as marine environmental hotspots for the transmission of antimicrobial resistance (AMR). This study employed (re-)analyses of microcosm experiments to investigate how particulate organic matter introduced to the ocean from collapsed jellyfish blooms, specifically Aurelia aurita and Mnemiopsis leidyi , can significantly increase the presence of antimicrobial resistance genes and mobile genetic elements in marine microbial communities by up to one order of magnitude. By providing abundant nutrients and surfaces for bacterial colonization, organic matter from these blooms enhances ARG proliferation, including transfer to and mobility in potentially pathogenic bacteria like Vibrio . Understanding this connection highlights the importance of monitoring jellyfish blooms as part of marine health assessments and developing strategies to mitigate the spread of AMR in coastal ecosystems.
Preprint
Full-text available
Comparative genomic studies of Marek's disease virus (MDV) have suggested that attenuated and virulent strains share >98% sequence identity. However, these estimates fail to account for variation in regions of the MDV genome harboring tandem repeats. To resolve these loci and enable assessments of intrapopulation diversity, we used a PacBio Sequel II platform to sequence MDV strains CVI988/Rispens (attenuated), HPRS-B14 (virulent), Md5 (very virulent) and 675A (very virulent plus). This approach enabled us to identify patterns of variation in tandem repeat regions that are consistent with known phenotypic differences across these strains. We also found CVI988/Rispens variants showing a 4.3-kb deletion in the Unique Short (US) region, resulting in the loss of five genes. These findings support a potential link between MDV tandem repeats and phenotypic traits like virulence and attenuation, and demonstrate that DNA viruses can harbor high levels of intrapopulation diversity in tandem repeat regions.
Preprint
Full-text available
Infectious disease next generation sequencing (ID-NGS) diagnostics are on the cusp of revolutionizing the clinical market. To facilitate this transition, FDA proactively invested in tools to support innovation of emerging technologies. FDA and collaborators established a publicly available database, FDA dAtabase for Regulatory-Grade micrObial Sequences (FDA-ARGOS), as a tool to fill reference database gaps with quality-controlled genomes. This manuscript discusses quality control metrics for the proposed FDA-ARGOS genomic resource and outlines the need for quality-controlled genome gap filling in the public domain. Here, we also present three case studies showcasing potential applications for FDA-ARGOS in infectious disease diagnostics, specifically: assay design, reference database and in silico sequence comparison in combination with representative microbial organism wet lab testing; a novel composite validation strategy for ID-NGS diagnostics. The use of FDA-ARGOS as an in silico comparator tool could reduce the burden for completing ID-NGS clinical trials. In addition, use cases identifying Enterococcus avium and Ebola virus (Zaire ebolavirus variant Makona) demonstrate the utility of FDA-ARGOS as a reference database for independent performance validation of new tests and for documenting how one would use this database as an in silico sequence target comparator tool for ID-NGS validation, respectively.
Article
Full-text available
False-positive identifications are a significant problem in metagenomics classification. We present KrakenUniq, a novel metagenomics classifier that combines the fast k-mer-based classification of Kraken with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory. KrakenUniq is freely available at https://github.com/fbreitwieser/krakenuniq. Electronic supplementary material The online version of this article (10.1186/s13059-018-1568-0) contains supplementary material, which is available to authorized users.
Article
Full-text available
Taxonomy is an organizing principle of biology and is ideally based on evolutionary relationships among organisms. Development of a robust bacterial taxonomy has been hindered by an inability to obtain most bacteria in pure culture and, to a lesser extent, by the historical use of phenotypes to guide classification. Culture-independent sequencing technologies have matured sufficiently that a comprehensive genome-based taxonomy is now possible. We used a concatenated protein phylogeny as the basis for a bacterial taxonomy that conservatively removes polyphyletic groups and normalizes taxonomic ranks on the basis of relative evolutionary divergence. Under this approach, 58% of the 94,759 genomes comprising the Genome Taxonomy Database had changes to their existing taxonomy. This result includes the description of 99 phyla, including six major monophyletic units from the subdivision of the Proteobacteria, and amalgamation of the Candidate Phyla Radiation into a single phylum. Our taxonomy should enable improved classification of uncultured bacteria and provide a sound basis for ecological and evolutionary studies.
Article
Full-text available
Motivation General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie: http://bowtie-bio.sourceforge.net. Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2. HISAT: http://www.ccb.jhu.edu/software/hisat. Supplementary information Supplementary data are available at Bioinformatics online.
Article
Full-text available
Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.
Article
Full-text available
The cow rumen is adapted for the breakdown of plant material into energy and nutrients, a task largely performed by enzymes encoded by the rumen microbiome. Here we present 913 draft bacterial and archaeal genomes assembled from over 800 Gb of rumen metagenomic sequence data derived from 43 Scottish cattle, using both metagenomic binning and Hi-C-based proximity-guided assembly. Most of these genomes represent previously unsequenced strains and species. The draft genomes contain over 69,000 proteins predicted to be involved in carbohydrate metabolism, over 90% of which do not have a good match in public databases. Inclusion of the 913 genomes presented here improves metagenomic read classification by sevenfold against our own data, and by fivefold against other publicly available rumen datasets. Thus, our dataset substantially improves the coverage of rumen microbial genomes in the public databases and represents a valuable resource for biomass-degrading enzyme discovery and studies of the rumen microbiome.
Article
Full-text available
Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies.
Article
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
Article
Portable sequencing machines, such as the Oxford Nanopore MinION, are making the genome sequencing ubiquitous. This can be particularly interesting for identifying specific bacteria in air-filters or waters and for monitoring the microbioma composition in cultivated soils or in different animal samples, using a simple and portable approach. However, a main problem of these portable sequencing devices is that they stream huge amounts of data, which management can be actually challenging. Low-power System-on-Chip architectures represent a feasible way for designing a solution, based on the Fog computing paradigm, for processing locally the raw data, considering both the base calling step and the genome alignment part, and for sending only meaningful results over Internet. Cloud services can be then used to collect and integrate results in a Internet of Things framework, in order to trigger notifications or alarms and, in perspective, for more sophisticated applications based on statistical or machine learning approaches.
Article
Motivation: Recent advances in sequencing technologies promise ultra-long reads of ∼100 kilo bases (kb) in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥ 100bp in length, ≥1kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions (INDELs) and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation: https://github.com/lh3/minimap2. Contact: hengli@broadinstitute.org.