ArticlePDF Available

Critical Assessment of Metagenome Interpretation - A benchmark of metagenomics software

Authors:

Abstract and Figures

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
This content is subject to copyright. Terms and conditions apply.
ANALYSIS
NATURE METHODS|  VOL.14  NO.11  |  NOVEMBER 2017  |1063
Methods for assembly, taxonomic profiling and binning are
key to interpreting metagenome data, but a lack of consensus
about benchmarking complicates performance assessment.
The Critical Assessment of Metagenome Interpretation (CAMI)
challenge has engaged the global developer community to
benchmark their programs on highly complex and realistic data
sets, generated from ~700 newly sequenced microorganisms
and ~600 novel viruses and plasmids and representing common
experimental setups. Assembly and genome binning programs
performed well for species represented by individual genomes
but were substantially affected by the presence of related
strains. Taxonomic profiling and binning programs were
proficient at high taxonomic ranks, with a notable performance
decrease below family level. Parameter settings markedly
affected performance, underscoring their importance for
program reproducibility. The CAMI results highlight current
challenges but also provide a roadmap for software selection to
answer specific research questions.
The biological interpretation of metagenomes relies on sophisti-
cated computational analyses such as read assembly, binning and
taxonomic profiling. Tremendous progress has been achieved1,
but there is still much room for improvement. The evaluation of
computational methods has been limited largely to publications
presenting novel or improved tools. These results are extremely
difficult to compare owing to varying evaluation strategies,
benchmark data sets and performance criteria. Furthermore, the
state of the art in this active field is a moving target, and the
assessment of new algorithms by individual researchers consumes
substantial time and computational resources and may introduce
unintended biases.
We tackle these challenges with a community-driven initia-
tive for the Critical Assessment of Metagenome Interpretation
(CAMI). CAMI aims to evaluate methods for metagenome anal-
ysis comprehensively and objectively by establishing standards
through community involvement in the design of benchmark data
sets, evaluation procedures, choice of performance metrics and
questions to focus on. To generate a comprehensive overview,
we organized a benchmarking challenge on data sets of unprece-
dented complexity and degree of realism. Although benchmarking
has been performed before2,3, this is the first community-driven
effort that we know of. The CAMI portal is also open to submis-
sions, and the benchmarks generated here can be used to assess
and develop future work.
We assessed the performance of metagenome assembly, bin-
ning and taxonomic profiling programs when encountering major
challenges commonly observed in metagenomics. For instance,
microbiome research benefits from the recovery of genomes for
individual strains from metagenomes4–7, and many ecosystems
have a high degree of strain heterogeneity8,9. To date, it is not clear
how much assembly, binning and profiling software are influ-
enced by the evolutionary relatedness of organisms, community
complexity, presence of poorly categorized taxonomic groups
(such as viruses) or varying software parameters.
Critical Assessment of Metagenome Interpretation—a
benchmark of metagenomics software
Alexander Sczyrba1,2,48, Peter Hofmann3–5,48, Peter Belmann1,2,4,5,48, David Koslicki6, Stefan Janssen4,7,8,
Johannes Dröge3–5, Ivan Gregor3–5, Stephan Majda3,47, Jessika Fiedler3,4, Eik Dahms3–5, Andreas Bremges1,2,4,5,9,
Adrian Fritz4,5, Ruben Garrido-Oter3–5,10,11, Tue Sparholt Jørgensen12–14, Nicole Shapiro15, Philip D Blood16,
Alexey Gurevich17, Yang Bai10,47, Dmitrij Turaev18, Matthew Z DeMaere19, Rayan Chikhi20,21,
Niranjan Nagarajan22, Christopher Quince23, Fernando Meyer4,5, Monika Balvocˇiūtė24, Lars Hestbjerg Hansen12,
Søren J Sørensen13, Burton K H Chia22, Bertrand Denis22, Jeff L Froula15, Zhong Wang15, Robert Egan15,
Dongwan Don Kang15, Jeffrey J Cook25, Charles Deltel26,27, Michael Beckstette28, Claire Lemaitre26,27,
Pierre Peterlongo26,27, Guillaume Rizk27,29, Dominique Lavenier21,27, Yu-Wei Wu30,31, Steven W Singer30,32,
Chirag Jain33, Marc Strous34, Heiner Klingenberg35, Peter Meinicke35, Michael D Barton15, Thomas Lingner36,
Hsin-Hung Lin37, Yu-Chieh Liao37, Genivaldo Gueiros Z Silva38, Daniel A Cuevas38, Robert A Edwards38,
Surya Saha39, Vitor C Piro40,41, Bernhard Y Renard40, Mihai Pop42,43, Hans-Peter Klenk44, Markus Göker45,
Nikos C Kyrpides15, Tanja Woyke15, Julia A Vorholt46, Paul Schulze-Lefert10,11, Edward M Rubin15,
Aaron E Darling19 , Thomas Rattei18 & Alice C McHardy3–5,11
A full list of affiliations appears at the end of the paper.
Received 29 decembeR 2016; accepted 25 august 2017; published online 2 octobeR 2017; doi:10.1038/nmeth.4458
OPEN
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1064|  VOL.14  NO.11  |  NOVEMBER 2017  |NATURE METHODS
ANALYSIS
RESULTS
We generated extensive metagenome benchmark data sets from
newly sequenced genomes of ~700 microbial isolates and 600
circular elements that were distinct from strains, species, genera
or orders represented by public genomes during the challenge.
The data sets mimicked commonly used experimental settings
and properties of real data sets, such as the presence of multiple,
closely related strains, plasmid and viral sequences and realis-
tic abundance profiles. For reproducibility, CAMI challenge
participants were encouraged to provide predictions along with
an executable Docker biobox10 implementing their software
and specifying the parameter settings and reference databases
used. Overall, 215 submissions, representing 25 programs and 36
biobox implementations, were received from 16 teams worldwide,
with consent to publish (Online Methods).
Assembly challenge
Assembling genomes from metagenomic short-read data is very
challenging owing to the complexity and diversity of micro-
bial communities and the fact that closely related genomes may
represent genome-sized approximate repeats. Nevertheless,
sequence assembly is a crucial part of metagenome analysis,
and subsequent analyses—such as binning—depend on the
assembly quality.
Overall performance trends
Developers submitted reproducible results for six assemblers:
MEGAHIT11, Minia12, Meraga (Meraculous13 + MEGAHIT),
A* (using the OperaMS Scaffolder)14, Ray Meta15 and Velour16.
Several are dedicated metagenome assemblers, while others are
more broadly used (Supplementary Tables 1 and 2). Across all
data sets (Supplementary Table 3) the assembly statistics (Online
Methods) varied substantially by program and parameter settings
(Supplementary Figs. 112). The gold-standard co-assembly of
the five samples constituting the high-complexity data set has
2.80 Gbp in 39,140 contigs. The assembly results ranged from
12.32 Mbp to 1.97 Gbp in size (0.4% and 70% of the gold stand-
ard co-assembly, respectively), 0.4% to 69.4% genome fraction,
11 to 8,831 misassemblies and 249 bp to 40.1 Mbp unaligned
contigs (Supplementary Table 4 and Supplementary Fig. 1).
MEGAHIT11 produced the largest assembly, of 1.97 Gbp, with
587,607 contigs, 69.3% genome fraction and 96.9% mapped
reads. It had a substantial number of unaligned bases (2.28 Mbp)
and the most misassemblies (8,831). Changing the parameters
of MEGAHIT (Megahit_ep_mtl200) substantially increased the
unaligned bases, to 40.89 Mbp, whereas the total assembly length,
genome fraction and fraction of mapped reads remained almost
identical (1.94 Gbp, 67.3% and 97.0%, respectively, with 7,538
misassemblies). Minia12 generated the second largest assembly
(1.85 Gbp in 574,094 contigs), with a genome fraction of 65.7%,
only 0.12 Mbp of unaligned bases and 1,555 misassemblies. Of all
reads, 88.1% mapped to the Minia assembly. Meraga generated an
assembly of 1.81 Gbp in 745,109 contigs, to which 90.5% of reads
mapped (2.6 Mbp unaligned, 64.0% genome fraction and 2,334
misassemblies). Velour (VELOUR_k63_C2.0) produced the most
contigs (842,405) in a 1.1-Gbp assembly (15.0% genome fraction),
with 381 misassemblies and 56 kbp unaligned sequences. 81% of
the reads mapped to the Velour assembly. The smallest assembly
was produced by Ray6 using k-mer of 91 (Ray_k91) with 12.3 Mbp
assembled into 13,847 contigs (genome fraction <0.1%). Only
3.2% of the reads mapped to this assembly.
Altogether, MEGAHIT, Minia and Meraga produced results
of similar quality when considering these various metrics;
they generated a higher contiguity than the other assemblers
(Supplementary Figs. 1012) and assembled a substantial frac-
tion of genomes across a broad abundance range. Analysis of the
low- and medium-complexity data sets delivered similar results
(Supplementary Figs. 49).
Closely related genomes
To assess how the presence of closely related genomes affects
assemblies, we divided genomes according to their average nucle-
otide identity (ANI)17 into ‘unique strains’ (genomes with <95%
ANI to any other genome) and ‘common strains’ (genomes with an
ANI 95% to another genome in the data set). Meraga, MEGAHIT
and Minia recovered the largest fraction of all genomes (Fig. 1a).
For unique strains, Minia and MEGAHIT recovered the highest
percentages (median over all genomes 98.2%), followed by Meraga
(median 96%) and VELOUR_k31_C2.0 (median 62.9%) (Fig. 1b).
Notably, for the common strains, all assemblers recovered a
substantially lower fraction (Fig. 1c). MEGAHIT (Megahit_ep_
mtl200; median 22.5%) was followed by Meraga (median 12.0%)
and Minia (median 11.6%), whereas VELOUR_k31_C2.0 recov-
ered only 4.1% (median). Thus, the metagenome assemblers pro-
duced high-quality results for genomes without close relatives,
while only a small fraction of the common strain genomes was
assembled, with assembler-specific differences. For very high
ANI groups (>99.9%), most assemblers recovered single genomes
(Supplementary Fig. 13). Resolving strain-level diversity posed
a substantial challenge to all programs evaluated.
Effect of sequencing depth
To investigate the effect of sequencing depth on the assemblies,
we compared the genome recovery rate (genome fraction) to the
genome sequencing coverage (Fig. 1d and Supplementary Fig. 2
for complete results). Assemblers using multiple k-mers (Minia,
MEGAHIT and Meraga) substantially outperformed single k-
mer assemblers. The chosen k-mer size affects the recovery rate
(Supplementary Fig. 3): while small k-mers improved recovery
of low-abundance genomes, large k-mers led to a better recovery
of highly abundant ones. Most assemblers except for Meraga and
Minia did not recover very-high-copy circular elements (sequenc-
ing coverage >100×) well, though Minia lost all genomes with
80–200× sequencing coverage (Fig. 1d). Notably, no program
investigated contig topology to determine whether these were
circular and complete.
Binning challenge
Metagenome assembly programs return mixtures of variable-
length fragments originating from individual genomes. Binning
algorithms were devised to classify or bin these fragments—contigs
or reads—according to their genomic or taxonomic origins, ideally
generating draft genomes (or pan-genomes) of a strain (or higher-
ranking taxon) from a microbial community. While genome bin-
ners group sequences into unlabeled bins, taxonomic binners
group the sequences into bins with a taxonomic label attached.
Results were submitted together with software bioboxes for five
genome binners and four taxonomic binners: MyCC18, MaxBin
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS|  VOL.14  NO.11  |  NOVEMBER 2017  |1065
ANALYSIS
2.0 (ref. 19), MetaBAT20, MetaWatt 3.5 (ref. 21), CONCOCT22,
PhyloPythiaS+23, taxator-tk24, MEGAN6 (ref. 25) and Kraken26.
Submitters ran their programs on the gold-standard co-assem-
blies or on individual read samples (MEGAN6), according to
their suggested application. We determined their performance
for addressing important questions in microbiome studies.
Recovery of individual genome bins
We investigated program performance when recovering indi-
vidual genome (strain-level) bins (Online Methods). For the
genome binners, average genome completeness (34% to 80%)
and purity (70% to 97%) varied substantially (Supplementary
Table 5 and Supplementary Fig. 14). For the medium- and low-
complexity data sets, MaxBin 2.0 had the highest values (70–80%
completeness, >92% purity), followed by other programs with com-
parably good performance in a narrow range (completeness rang-
ing with one exception from 50–64%, >75% purity). Notably, other
programs assigned a larger portion of the data sets than MaxBin 2.0
measured in bp, though with lower adjusted Rand index (ARI; Fig.
2a). For applications where binning a larger fraction of the data set
at the cost of some accuracy is important, MetaWatt 3.5, MetaBAT
and CONCOCT could be good choices. The high-complexity data
set was more challenging to all programs, with average completeness
decreasing to ~50% and more than 70% purity, except for MaxBin 2.0
and MetaWatt 3.5, which showed purity of above 90%. The programs
either assigned only a smaller data set portion (>50%, in the case of
MaxBin 2.0) with high ARI or a larger fraction with lower ARI (more
than 90% with less than 0.5 ARI, all except MaxBin and MetaBat).
0 25 50 75 1000 25 50 75 100
Strains (ANI 95%) Circular elements
Unique (ANI < 95%)
Gold standard
Minia k21−k91
Ray k51
Ray blacklight k64
Ray k71
Ray k91
Megahit k21−k91
Megahit ep k21−k91
Megahit ep mtl200 k21−k91
Velour k31 c2.0
Velour k31 c4.01
Velour k63 c2.0
Velour k63 c4.01
Meraga k33−k63
A* k63
0 25 50 75 100
Genome groups
100
50
0
Megahit (k21–k91)
Meraga (k33–k63)
100
50
0
100
50
0
Minia (k21–k91)
100
50
0
Gold standard
10 100 1,000
Sequencing coverage
Velour (k63, c2.0)
A* (k63)
Ray (k51)
10 100 1,000
Sequencing coverage
Genome fraction (%)
Genome fraction (%) Genome fraction (%) Genome fraction (%)
a b c
d
Figure 1 | Assembly results for the CAMI high-complexity data set. (ac) Fractions of reference genomes assembled by each assembler for all genomes
(a), genomes with ANI < 95% (b) and genomes with ANI 95% (c). Colors indicate results from the same assembler incorporated in different pipelines
or parameter settings (see Supplementary Table 2 for details). Dots indicate individual data points (genomes); boxes, interquartile range; center lines,
median. (d) Genome recovery fraction versus genome sequencing depth (coverage). Data were classified as unique genomes (ANI <95%, brown), genomes
with related strains present (ANI 95%, blue) or high-copy circular elements (green). The gold standard includes all genomic regions covered by at least
one read in the metagenome data set.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1066|  VOL.14  NO.11  |  NOVEMBER 2017  |NATURE METHODS
ANALYSIS
The exception was MetaWatt 3.5, which assigned more than 90%
of the data set with an ARI larger than 0.8, thus best recovering
abundant genomes from the high-complexity data set. Accordingly,
MetaWatt 3.5, followed by MaxBin 2.0, recovered the most genomes
with high purity and completeness from all data sets (Fig. 2b).
Effect of strain diversity
For unique strains, the average purity and completeness per
genome bin was higher for all genome binners (Fig. 2c). For
the medium- and low-complexity data sets, all had a purity of
above 80%, while completeness was more variable. MaxBin 2.0
performed best across all data sets, with more than 90% purity
and completeness of 70% or higher. MetaBAT, CONCOCT and
MetaWatt 3.5 performed almost as well for two data sets.
For the common strains, however, completeness decreased sub-
stantially (Fig. 2d), similarly to purity for most programs. MaxBin
2.0 still stood out, with more than 90% purity on all data sets.
Notably, when we considered the value of taxon bins for genome
reconstruction, taxon bins had lower completeness but reached
a similar purity, thus delivering high-quality, partial genome bins
(Supplementary Note 1 and Supplementary Fig. 15). Overall,
very high-quality genome bins were reconstructed with genome
binning programs for unique strains, whereas the presence of
closely related strains presented a notable hurdle.
Performance in taxonomic binning
We next investigated the performance of taxonomic binners in
recovering taxon bins at different ranks (Online Methods). These
results can be used for taxon-level evolutionary or functional pan-
genome analyses and conversion into taxonomic profiles.
For the low-complexity data set, PhyloPythiaS+ had the highest
sample assignment accuracy, average taxon bin completeness and
purity, which were all above 75% from domain to family level.
Kraken followed, with average completeness and accuracy still
above 50% to the family level. However, purity was notably lower,
owing mainly to prediction of many small false bins, which affects
purity more than overall accuracy (Supplementary Fig. 16).
Removing the smallest predicted bins (1% of the data set)
increased purity for Kraken, MEGAN and, most strongly, for taxa-
tor-tk, for which it was close to 100% until order level, and above
75% until family level (Supplementary Fig. 17). Thus, small bins
predicted by these programs are not reliable, but otherwise, high
purity can be reached for higher ranks. Below the family level, all
programs performed poorly, either assigning very little data (low
completeness and accuracy, accompanied by a low misclassifica-
tion rate) or assigning more, with substantial misclassification.
Notably, Kraken and MEGAN performed similarly. These
programs utilize different data properties (Supplementary
Table 1) but rely on similar algorithms.
The results for the medium-complexity data set agreed quali-
tatively with those for the low-complexity data set, except that
Kraken, MEGAN and taxator-tk performed better (Fig. 2e).
With the smallest predicted bins removed, both Kraken and
PhyloPythiaS+ reached above 75% for accuracy, with average
completeness and purity until family rank (Fig. 2f). Similarly,
taxator-tk showed an average purity of almost 75% even at genus
level (almost 100% until order level), and MEGAN showed an
average purity of more than 75% at order level while maintaining
accuracy and average completeness of around 50%. The results of
high-purity taxonomic predictions can be combined with genome
bins to enable their taxonomic labeling. The performances
on the high-complexity data set were similar (Supplementary
Figs. 18 and 19).
Analysis of low-abundance taxa
We determined which programs delivered high completeness for
low-abundance taxa. This is relevant when screening for pathogens
in diagnostic settings27 or for metagenome studies of ancient DNA
samples. Even though PhyloPythiaS+ and Kraken had high com-
pleteness until family rank (Fig. 2e,f), completeness degraded at
lower ranks and for low-abundance bins (Supplementary Fig. 20),
which are most relevant for these applications. It therefore remains
a challenge to further improve predictive performance.
Deep branchers
Taxonomic binners commonly rely on comparisons to reference
sequences for taxonomic assignment. To investigate the effect of
increasing evolutionary distances between a query sequence and
available genomes, we partitioned the challenge data sets by their
taxonomic distances to public genomes as genomes of new strains,
species, genus or family (Supplementary Fig. 21). For new strain
genomes from sequenced species, all programs performed well,
with generally high purity and, often, high completeness, or with
characteristics also observed for other data sets (such as low com-
pleteness for taxator-tk). At increasing taxonomic distances to the
reference, both purity and completeness for MEGAN and Kraken
dropped substantially, while PhyloPythiaS+ decreased most nota-
bly in purity, and taxator-tk, in completeness. For genomes at
larger taxonomic distances (‘deep branchers’), PhyloPythiaS+
maintained the best purity and completeness.
Influence of plasmids and viruses
The presence of plasmid and viral sequences had almost no
effect on binning performance. Although the copy numbers were
high, in terms of sequence size, the fraction was small (<1.5%,
Supplementary Table 6). Only Kraken and MEGAN made pre-
dictions for the viral fraction of the data or predicted viruses to be
present, albeit with low purity (<30%) and completeness (<20%).
Profiling challenge
Taxonomic profilers predict the taxonomic identities and relative
abundances of microbial community members from metagen-
ome samples and are used to study the composition, diversity
and dynamics of microbial communities in a variety of environ-
ments28–30. In contrast to taxonomic binning, profiling does not
assign individual sequences. In some use cases, such as identifica-
tion of potentially pathogenic organisms, accurate determination
of the presence or absence of a particular taxon is important.
In comparative studies (such as quantifying the dynamics of a
microbial community over an ecological gradient), accurately
determining the relative abundance of organisms is paramount.
Challenge participants submitted results for ten profilers:
CLARK31; Common Kmers (an early version of MetaPalette)32;
DUDes33; FOCUS34; MetaPhlAn 2.0 (ref. 35); MetaPhyler36;
mOTU37; a combination of Quikr38, ARK39 and SEK40 (abbrevi-
ated Quikr); Taxy-Pro41 and TIPP42. Some programs were sub-
mitted with multiple versions or different parameter settings,
bringing the number of unique submissions to 20.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS|  VOL.14  NO.11  |  NOVEMBER 2017  |1067
ANALYSIS
Performance trends
We employed commonly used metrics (Online Methods) to
assess the quality of taxonomic profiling submissions with regard
to the biological questions outlined above. The reconstruction
fidelity for all profilers varied markedly across metrics, taxo-
nomic ranks and samples. Each had a unique error profile and
different strengths and weaknesses (Fig. 3a,b), but the profil-
ers fell into three categories: (i) profilers that correctly predicted
Gold standard Taxator−tk 1.4pre1e Taxator−tk 1.3.0e
MEGAN 6.4.9 MEGAN 6.4.9 PhyloPythiaS+
PhyloPythiaS+ mg Kraken 0.10.5 Kraken 0.10.6−unreleased
Gold standard Taxator−tk 1.4pre1e Taxator−tk 1.3.0e
MEGAN 6.4.9 MEGAN 6.4.9 PhyloPythiaS+
PhyloPythiaS+ mg Kraken 0.10.5 Kraken 0.10.6−unreleased
0%
0%
50%
100%
50%
100%
0%
50%
100%
Taxonomic binners (100% of data) medium complexity Taxonomic binners (99% of data) medium complexity
Accuracy Misclassification Av. precisionMetric:
ARI (%)
Recall (%)
Precision (%)
Genome binners
(unique strains)
Precision (%)
Genome binners
(common strains)
a b
c d
e f
Av. recall
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
Super-kingdom
Phylum
Class
Order
Family
Genus
Species
100
80
60
40
20
0
0 10080604020
0 10080604020
0 10080604020
Recall (%)
100
80
60
40
20
0
Assigned base pairs (%)
100
80
60
40
20
0
Software
Gold standard
Gold standard
MyCC
MyCC
MetaWatt 3.5
MetaWatt 3.5
MetaBAT
MetaBAT
CONCOCT CONCOCT
MaxBin 2.0
MaxBin 2.0
Low
Medium
High
Data set (complexity)
Genome binner
(% contamination)
Recovered genomes
(% completeness)
>50% >70% >90%
753 753 753
275
<10%
<10%
<10%
<10%
<10%
<5%
<5%
<5%
<5%
<5%
272 262
267 265
475
452
228
216
240
211
385
352
256
405500
476
247
234
250
220
390
356
393
195
186
197
173
343
316
Software
Gold standard
MyCC
MetaWatt 3.5
MetaBAT
CONCOCT
MaxBin 2.0
Low
Medium
High
Data set (complexity)
Figure 2 | Binning results for the CAMI data sets. (a) ARI in relation to the fraction of the sample assigned (in bp) by the genome binners. The ARI
was calculated excluding unassigned sequences and thus reflects the assignment accuracy for the portion of the data assigned. (b) Number of genomes
recovered with varying completeness and contamination (1-purity). (c,d) Average purity (precision) and completeness (recall) for genomes reconstructed
by genome binners for genomes of unique strains with ANI <95% to others (c) and common strains with ANI 95% to each other (d). For each program
and complexity data set (Supplementary Table 2), the submission with the largest sum of purity and completeness is shown. In each case, small
bins adding up to 1% of the data set size were removed. Error bars, s.e.m. (e,f) Taxonomic binning performance metrics across ranks for the medium-
complexity data set, with results for the complete data set (e) and smallest predicted bins summing up to 1% of the data set (f) removed. Shaded areas,
s.e.m. in precision (purity) and recall (completeness) across taxon bins.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1068|  VOL.14  NO.11  |  NOVEMBER 2017  |NATURE METHODS
ANALYSIS
relative abundances, (ii) precise profilers and (iii) profilers with
high recall. We quantified this with a global performance sum-
mary score (Online Methods, Fig. 3c, Supplementary Figs. 2228
and Supplementary Table 7).
Quikr, CLARK, TIPP and Taxy-Pro had the highest recall,
indicating their suitability for pathogen detection, where failure
to identify an organism can have severe negative consequences.
These were also among the least precise (Supplementary
Figs. 2933), typically owing to prediction of a large number
of low-abundance organisms. MetaPhlAn 2.0 and Common
Kmers were most precise, suggesting their use when many false
positives would increase cost and effort in downstream analysis.
MetaPhyler, FOCUS, TIPP, Taxy-Pro and CLARK best recon-
structed relative abundances. On the basis of the average of pre-
cision and recall, over all samples and taxonomic ranks, Taxy-Pro
version 0 (mean = 0.616), MetaPhlAn 2.0 (mean = 0.603) and
DUDes version 0 (mean = 0.596) performed best.
Performance at different taxonomic ranks
Most profilers performed well only until the family level
(Fig. 3a,b and Supplementary Figs. 2933). Over all samples
and programs at the phylum level, recall was 0.85 ± 0.19 (mean
± s.d.), and L1 norm, assessing abundance estimate quality at
a particular rank, was 0.38 ± 0.28, both close to these metrics’
optimal values (ranging from 1 to 0 and from 0 to 2, respectively),
whereas precision was highly variable, at 0.53 ± 0.55. Precision and
recall were high for several methods (DUDes, Common Kmers,
mOTU and MetaPhlAn 2.0) until order rank. However, accurately
reconstructing a taxonomic profile is still difficult below family
level. Even for the low-complexity sample, only MetaPhlAn 2.0
maintained its precision at species level, while the largest recall
at genus rank for the low-complexity sample was 0.55, for Quikr.
Across all profilers and samples, there was a drastic decrease in
average performance between the family and genus levels, of
0.48 ± 0.15% and 0.52 ± 0.18% for recall and precision, respec-
tively, but comparatively little change between order and fam-
ily levels, with a decrease of only 0.1 ± 0.07% and 0.1 ± 0.26%,
for recall and precision, respectively. The other error metrics
showed similar trends for all samples and methods (Fig. 3a and
Supplementary Figs. 3438).
Parameter settings and software versions
Several profilers were submitted with multiple parameter settings
or versions (Supplementary Table 2). For some, this had little
effect: for instance, the variance in recall among seven versions
of FOCUS on the low-complexity sample at family level was only
0.002. For others, this caused large performance changes: for
instance, one version of DUDes had twice the recall as that of
another at phylum level on the pooled high-complexity sample
(Supplementary Figs. 3438). Notably, several developers sub-
mitted no results beyond a fixed taxonomic rank, as was the case
for Taxy-Pro and Quikr. These submissions performed better than
default program versions submitted by the CAMI team—indicating,
not surprisingly, that experts can generate better results.
Performance for viruses and plasmids
We investigated the effect of including plasmids, viruses and other
circular elements (Supplementary Table 6) in the gold-standard
taxonomic profile (Supplementary Figs. 3941). Here, the term
Best method
(score)
2nd best method
(score)
3rd best method
(score)
Recall Quikr
(35)
CLARK
(43)
TIPP
(46)
Precision MetaPhlAn 2.0
(16)
Common Kmers v0
(25)
mOTU
(41)
L1 Norm MetaPhyler
(28)
FOCUS v5
(45)
TIPP
(66)
UniFrac MetaPhyler
(4)
Taxy-Pro v0
(4)
CLARK
(5)
CLARK
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0
T-P_v1
mOTU
Genus SpeciesFamily
Genus SpeciesFamily
Phylum Class Order
Phylum Class Order
Recall Precision
CK_v0
CK_v1
0.20.4 0.6 0.8 1.0
CLARK
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0
T-P_v1
mOTU
CK_v0
CK_v1
0.20.4 0.6 0.8 1.0
CLARK*
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0*
T-P_v1
mOTU
CK_v0
CK_v1
0.20.4 0.60.8 1.0
CLARK
CLARK
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
T-P_v1
mOTU
mOTU
CK_v0
CK_v0
CLARK
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
mOTU
CK_v0
CLARK
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
mOTU
CK_v0
CLARK
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
mOTU
CK_v0
CLARK
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
mOTU
CK_v0
CLARK
T-P_v0
FS_v5
MP2.0
MPr TIPP
Quikr
T-P_v1
mOTU
CK_v0
CK_v1
0.20.4 0.6 0.8 1.0
CLARK
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0
T-P_v1
mOTU
CK_v0
CK_v1
0.20.4 0.6 0.8 1.0
CLARK*
D_v0
D_v1
FS_v0
FS_v1
FS_v2
FS_v3
FS_v4FS_v5
FS_v6
MP2.0
MPr
Quikr
TIPP
T-P_v0*
T-P_v1
mOTU
CK_v0
CK_v1
0.2 0.40.6 0.8 1.0
UniFrac error Recall L1norm error Precision False positives
a
b
c
Figure 3 | Profiling results for the CAMI data sets. (a) Relative
performance of profilers for different ranks and with different error
metrics (weighted UniFrac, L1 norm, recall, precision and false positives)
for the bacterial and archaeal portion of the first high-complexity sample.
Each error metric was divided by its maximal value to facilitate viewing
on the same scale and relative performance comparisons. (b) Absolute
recall and precision for each profiler on the microbial (filtered) portion
of the low-complexity data set across six taxonomic ranks. Red text and
asterisk indicate methods for which no predictions at the corresponding
taxonomic rank were returned. FS, FOCUS; T-P, Taxy-Pro; MP2.0, MetaPhlAn
2.0; MPr, MetaPhyler; CK, Common Kmers; D, DUDes. (c) Best scoring
profilers using different performance metrics summed over all samples
and taxonomic ranks to the genus level. A lower score indicates that
a method was more frequently ranked highly for a particular metric.
The maximum (worst) score for the UniFrac metric is 38 (18 + 11 + 9
profiling submissions for the low, medium and high-complexity data sets,
respectively), while the maximum score for all other metrics is 190 (5
taxonomic ranks × (18 + 11 + 9) profiling submissions for the low, medium
and high-complexity data sets, respectively).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS|  VOL.14  NO.11  |  NOVEMBER 2017  |1069
ANALYSIS
‘filtered’ indicates the gold standard without these data. The
affected metrics were the abundance-based metrics (L1 norm
at the superkingdom level and weighted UniFrac) and precision
and recall (at the superkingdom level): all methods correctly
detected Bacteria and Archaea, but only MetaPhlAn 2.0 and
CLARK detected viruses in the unfiltered samples. Averaging
over all methods and samples, L1 norm at the superkingdom
level increased from 0.05, for the filtered samples, to 0.29, for the
unfiltered samples. Similarly, the UniFrac metric increased from
7.21, for the filtered data sets, to 12.36, for the unfiltered data sets.
Thus, the fidelity of abundance estimates decreased notably when
viruses and plasmids were present.
Taxonomic profilers versus profiles from taxonomic binning
Using a simple coverage-approximation conversion algo-
rithm, we derived profiles from the taxonomic binning results
(Supplementary Note 1 and Supplementary Figs. 4245).
Overall, precision and recall of the taxonomic binners were compa-
rable to that of the profilers. At the order level, the mean precision
over all taxonomic binners was 0.60 (versus 0.40 for the profil-
ers), and the mean recall was 0.82 (versus 0.86 for the profilers).
MEGAN6 and PhyloPythiaS+ had better recall than the profil-
ers at family level, though PhyloPythiaS+ precision was below
that of Common Kmers and MetaPhlAn 2.0 as well as the bin-
ner taxator-tk (Supplementary Figs. 42 and 43), and—similarly
to the profilers—recall also degraded below family level.
Abundance estimation at higher ranks was more problematic
for the binners, as L1 norm error at the order level was 1.07 when
averaged over all samples, whereas for the profilers it was only
0.68. Overall, though, the binners delivered slightly more accurate
abundance estimates, as the binning average UniFrac metric was
7.03, whereas the profiling average was 7.23. These performance
differences may be due in part to the gold-standard contigs used
by the binners (except for MEGAN6), though Kraken is also often
applied to raw reads, while the profilers used the raw reads.
DISCUSSION
A lack of consensus about benchmarking data and evaluation
metrics has complicated metagenomic software comparisons and
their interpretation. To tackle this problem, the CAMI challenge
engaged 19 teams with a series of benchmarks, providing per-
formance data and guidance for applications, interpretation of
results and directions for future work.
Assemblers using a range of k-mers clearly outperformed sin-
gle k-mer assemblers (Supplementary Table 1). While the latter
reconstructed only low-abundance genomes (with small k-mers)
or high-abundance genomes (with large k-mers), using multiple
k-mers substantially increased the recovered genome fraction.
Two programs also reconstructed high-copy circular elements
well, although none detected their circularities. An unsolved
challenge for all programs is the assembly of closely related
genomes. Notably, poor or failed assembly of these genomes will
negatively affect subsequent contig binning and further compli-
cate their study.
All genome binners performed well when no closely related
strains were present. Taxonomic binners reconstructed taxon bins
of acceptable quality down to the family rank (Supplementary
Table 1). This leaves a gap in species and genus-level reconstruc-
tion—even when taxa are represented by single strains—that
needs to be closed. Notably, taxonomic binners were more precise
when reconstructing genomes than for species or genus bins, indi-
cating that the decreased performance for low ranks is due partly
to limitations of the reference taxonomy. A sequence-derived
phylogeny might thus represent a more suitable reference frame-
work for “phylogenetic” binning. When comparing the average
taxon binner performance for taxa with similar surroundings in
the SILVA and NCBI taxonomies to those with less agreement,
we observed significant differences—primarily as a decrease in
performance for low-ranking taxa in discrepant surroundings
(Supplementar y Note 1 and Supplementary Table 8). Thus, the
use of SILVA might further improve taxon binning, but the lack
of associated genome sequences represents a practical hurdle43.
Another challenge for all programs is to deconvolute strain-level
diversity. For the typical covariance of read coverage–based
genome binners, it may require many more samples than those
analyzed here (up to five) for satisfactory performance.
Despite variable performance, particularly for precision
(Supplementary Table 1), most taxonomic profilers had good
recall and low error in abundance estimates until family rank. The
use of different classification algorithms, reference taxonomies,
databases and information sources (for example, marker genes,
k-mers) probably contributes to observed performance differ-
ences. To enable systematic analyses of their individual impacts,
developers could provide configurable rather than hard-coded
parameter options. Similarly to taxonomic binners, performance
across all metrics dropped substantially below family level. When
including plasmids and viruses, all programs gave worse abun-
dance estimates, indicating a need for better analysis of data sets
with such content, as plasmids are likely to be present, and viral
particles are not always removed by size filtration44.
Additional programs can still be submitted and evaluated with
the CAMI benchmarking platform. Currently, we are further auto-
mating the benchmarking and comparative result visualizations. As
sequencing technologies such as long-read sequencing and metage-
nomics programs continue to evolve rapidly, CAMI will provide
further challenges. We invite members of the community to con-
tribute actively to future benchmarking efforts by CAMI.
METHODS
Methods, including statements of data availability and any associated
accession codes and references, are available in the online version
of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
ACKNOWLEDGMENTS
We thank C. Della Beffa, J. Alneberg, D. Huson and P. Grupp for their input,
and the Isaac Newton Institute for Mathematical Sciences for its hospitality
during the MTG program (supported by UK Engineering and Physical Sciences
Research Council (EPSRC) grant EP/K032208/1). Sequencing at the US
Department of Energy Joint Genome Institute was supported under contract
DE-AC02-05CH11231. R.G.O. was supported by the Cluster of Excellence on Plant
Sciences program of the Deutsche Forschungsgemeinschaft; A.E.D. and M.Z.D.,
through the Australian Research Council’s Linkage Projects (LP150100912);
J.A.V., by the European Research Council advanced grant (PhyMo); D.B., B.K.H.C.
and N.N., by the Agency for Science, Technology and Research (A*STAR),
Singapore; T.S.J., by the Lundbeck Foundation (project DK nr R44-A4384);
L.H.H. by a VILLUM FONDEN Block Stipend on Mobilomics; and P.D.B. by the
National Science Foundation (NSF, grant DBI-1458689). This work used the
Bridges and Blacklight systems, supported by NSF awards ACI-1445606 and
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1070|  VOL.14  NO.11  |  NOVEMBER 2017  |NATURE METHODS
ANALYSIS
ACI-1041726, respectively, at the Pittsburgh Supercomputing Center (PSC), under
the Extreme Science and Engineering Discovery Environment (XSEDE), supported
by NSF grant OCI-1053575.
AUTHOR CONTRIBUTIONS
R.C., N.N., C.Q., B.K.H.C., B.D., J.L.F., Z.W., R.E., D.D.K., J.J.C., C.D., C.L., P.P.,
G.R., D.L., Y.-W.W., S.W.S., C.J., M.S., H.K., P.M., T.L., H.-H.L., Y.-C.L., G.G.Z.S.,
D.A.C., R.A.E., S.S., V.C.P., B.Y.R., D.K., J.D. and I.G. participated in challenge;
P.S.-L., J.A.V., Y.B., T.S.J., L.H.H., S.J.S., N.C.K., E.M.R., T.W., H.-P.K., M.G. and
N.S. generated and contributed data; P.H., S.M., J.F., E.D., D.T., M.Z.D., A.S.,
A.B., A.E.D., T.R. and A.C.M. generated benchmark data sets; P.B. implemented
the benchmarking platform; M. Beckstette and P.D.B. provided computational
support; D.K., P.H., S.J., J.D., I.G., R.G.-O., C.Q., A.F., F.M., P.B., M.D.B.,
M. Balvocˇiūtė and A.G. implemented benchmarking metrics and bioboxes and
performed evaluations; D.K., A.S., A.C.M., C.Q., J.D., P.H., T.S.J., D.D.K.,
Y.-W.W., A.B., A.F., R.C., M.P. and P.B. interpreted results with comments from
many authors; A.C.M., A.S. and D.K. wrote the paper with comments from many
authors; A.S., T.R. and A.C.M. conceived research with input from many authors.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.
com/reprints/index.html. Publisher’s note: Springer Nature remains neutral
with regard to jurisdictional claims in published maps and institutional
affiliations.
This work is licensed under a Creative Commons Attribution 4.0
International (CC BY 4.0) license. The images or other third party
material in this article are included in the article’s Creative Com-
mons license, unless indicated otherwise in the credit line; if the material is not
included under the Creative Commons license, users will need to obtain permis-
sion from the license holder to reproduce the material. To view a copy of this
license, visit http://creativecommons.org/licenses/by/4.0/.
1. Turaev, D. & Rattei, T. High definition for systems biology of microbial
communities: metagenomics gets genome-centric and strain-resolved. Curr.
Opin. Biotechnol. 39, 174–181 (2016).
2. Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of
metagenomic processing methods. Nat. Methods 4, 495–500 (2007).
3. Lindgreen, S., Adair, K.L. & Gardner, P.P. An evaluation of the accuracy
and speed of metagenome analysis tools. Sci. Rep. 6, 19233 (2016).
4. Marx, V. Microbiology: the road to strain-level identification. Nat. Methods
13, 401–404 (2016).
5. Sangwan, N., Xia, F. & Gilbert, J.A. Recovering complete and draft
population genomes from metagenome datasets. Microbiome 4, 8 (2016).
6. Yassour, M. et al. Natural history of the infant gut microbiome and impact
of antibiotic treatment on bacterial strain diversity and stability. Sci.
Transl. Med. 8, 343ra81 (2016).
7. Bendall, M.L. et al. Genome-wide selective sweeps and gene-specific
sweeps in natural bacterial populations. ISME J. 10, 1589–1601
(2016).
8. Bai, Y. et al. Functional overlap of the Arabidopsis leaf and root
microbiota. Nature 528, 364–369 (2015).
9. Kashtan, N. et al. Single-cell genomics reveals hundreds of coexisting
subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014).
10. Belmann, P. et al. Bioboxes: standardised containers for interchangeable
bioinformatics software. Gigascience 4, 47 (2015).
11. Li, D., Liu, C.M., Luo, R., Sadakane, K. & Lam, T.W. MEGAHIT: an ultra-fast
single-node solution for large and complex metagenomics assembly via
succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
12. Chikhi, R. & Rizk, G. Space-efficient and exact de Bruijn graph
representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).
13. Chapman, J.A. et al. Meraculous: de novo genome assembly with short
paired-end reads. PLoS One 6, e23501 (2011).
14. Gao, S., Sung, W.K. & Nagarajan, N. Opera: reconstructing optimal
genomic scaffolds with high-throughput paired-end sequences. J. Comput.
Biol. 18, 1681–1691 (2011).
15. Boisvert, S., Laviolette, F. & Corbeil, J. Ray: simultaneous assembly of
reads from a mix of high-throughput sequencing technologies. J. Comput.
Biol. 17, 1519–1533 (2010).
16. Cook, J.J. Scaling Short Read de novo DNA Sequence Assembly to Gigabase
Genomes. PhD thesis, Univ. Illinois at Urbana–Champaign, (2011).
17. Konstantinidis, K.T. & Tiedje, J.M. Genomic insights that advance the
species definition for prokaryotes. Proc. Natl. Acad. Sci. USA 102, 2567–
2572 (2005).
18. Lin, H.H. & Liao, Y.C. Accurate binning of metagenomic contigs via
automated clustering sequences using information of genomic signatures
and marker genes. Sci. Rep. 6, 24175 (2016).
19. Wu, Y.W., Simmons, B.A. & Singer, S.W. MaxBin 2.0: an automated binning
algorithm to recover genomes from multiple metagenomic datasets.
Bioinformatics 32, 605–607 (2016).
20. Kang, D.D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for
accurately reconstructing single genomes from complex microbial
communities. PeerJ 3, e1165 (2015).
21. Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H.E. The binning of
metagenomic contigs for microbial physiology of mixed cultures. Front.
Microbiol. 3, 410 (2012).
22. Alneberg, J. et al. Binning metagenomic contigs by coverage and
composition. Nat. Methods 11, 1144–1146 (2014).
23. Gregor, I., Dröge, J., Schirmer, M., Quince, C. & McHardy, A.C.
PhyloPythiaS+: a self-training method for the rapid reconstruction of low-
ranking taxonomic bins from metagenomes. PeerJ 4, e1603 (2016).
24. Dröge, J., Gregor, I. & McHardy, A.C. Taxator-tk: precise taxonomic
assignment of metagenomes by fast approximation of evolutionary
neighborhoods. Bioinformatics 31, 817–824 (2015).
25. Huson, D.H. et al. MEGAN community edition—interactive exploration and
analysis of large-scale microbiome sequencing data. PLoS Comput. Biol.
12, e1004957 (2016).
26. Wood, D.E. & Salzberg, S.L. Kraken: ultrafast metagenomic sequence
classification using exact alignments. Genome Biol. 15, R46 (2014).
27. Miller, R.R., Montoya, V., Gardy, J.L., Patrick, D.M. & Tang, P.
Metagenomics for pathogen detection in public health. Genome Med. 5,
81 (2013).
28. Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature
473, 174–180 (2011).
29. Human Microbiome Project Consortium. Structure, function and diversity of
the healthy human microbiome. Nature 486, 207–214 (2012).
30. Koren, O. et al. A guide to enterotypes across the human body: meta-
analysis of microbial community structures in human microbiome datasets.
PLOS Comput. Biol. 9, e1002863 (2013).
31. Ounit, R., Wanamaker, S., Close, T.J. & Lonardi, S. CLARK: fast and
accurate classification of metagenomic and genomic sequences using
discriminative k-mers. BMC Genomics 16, 236 (2015).
32. Koslicki, D. & Falush, D. MetaPalette: a k-mer Painting Approach for
Metagenomic Taxonomic Profiling and Quantification of Novel Strain
Variation. mSystems 1, e00020–16 (2016).
33. Piro, V.C., Lindner, M.S. & Renard, B.Y. DUDes: a top-down taxonomic
profiler for metagenomics. Bioinformatics 32, 2272–2280 (2016).
34. Silva, G.G., Cuevas, D.A., Dutilh, B.E. & Edwards, R.A. FOCUS: an
alignment-free model to identify organisms in metagenomes using non-
negative least squares. PeerJ 2, e425 (2014).
35. Segata, N. et al. Metagenomic microbial community profiling using unique
clade-specific marker genes. Nat. Methods 9, 811–814 (2012).
36. Liu, B., Gibbons, T., Ghodsi, M., Treangen, T. & Pop, M. Accurate and fast
estimation of taxonomic profiles from metagenomic shotgun sequences.
BMC Genomics 12 (Suppl. 2), S4 (2011).
37. Sunagawa, S. et al. Metagenomic species profiling using universal
phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
38. Koslicki, D., Foucart, S. & Rosen, G. Quikr: a method for rapid
reconstruction of bacterial communities via compressive sensing.
Bioinformatics 29, 2096–2102 (2013).
39. Koslicki, D. et al. ARK: Aggregation of Reads by k-Means for estimation of
bacterial community composition. PLoS One 10, e0140644 (2015).
40. Chatterjee, S. et al. SEK: sparsity exploiting k-mer-based estimation of
bacterial community composition. Bioinformatics 30, 2423–2431 (2014).
41. Klingenberg, H., Aßhauer, K.P., Lingner, T. & Meinicke, P. Protein
signature-based estimation of metagenomic abundances including all
domains of life and viruses. Bioinformatics 29, 973–980 (2013).
42. Nguyen, N.P., Mirarab, S., Liu, B., Pop, M. & Warnow, T. TIPP: taxonomic
identification and phylogenetic profiling. Bioinformatics 30, 3548–3555
(2014).
43. Balvocˇiūtė, M. & Huson, D.H. SILVA, RDP, Greengenes, NCBI and OTT—how
do these taxonomies compare? BMC Genomics 18 (Suppl 2), 114 (2017).
44. Thomas, T., Gilbert, J. & Meyer, F. Metagenomics—a guide from sampling
to data analysis. Microb. Inform. Exp. 2, 3 (2012).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS|  VOL.14  NO.11  |  NOVEMBER 2017  |1071
ANALYSIS
1Faculty of Technology, Bielefeld University, Bielefeld, Germany. 2Center for Biotechnology, Bielefeld University, Bielefeld, Germany. 3Formerly Department of
Algorithmic Bioinformatics, Heinrich Heine University (HHU), Duesseldorf, Germany. 4Department of Computational Biology of Infection Research, Helmholtz
Centre for Infection Research (HZI), Braunschweig, Germany. 5Braunschweig Integrated Centre of Systems Biology (BRICS), Braunschweig, Germany. 6Mathematics
Department, Oregon State University, Corvallis, Oregon, USA. 7Department of Pediatrics, University of California, San Diego, California, USA. 8Department of
Computer Science and Engineering, University of California, San Diego, California, USA. 9German Center for Infection Research (DZIF), partner site Hannover-
Braunschweig, Braunschweig, Germany. 10Department of Plant Microbe Interactions, Max Planck Institute for Plant Breeding Research, Cologne, Germany. 11Cluster
of Excellence on Plant Sciences (CEPLAS). 12Department of Environmental Science, Section of Environmental microbiology and Biotechnology, Aarhus University,
Roskilde, Denmark. 13Department of Microbiology, University of Copenhagen, Copenhagen, Denmark. 14Department of Science and Environment, Roskilde University,
Roskilde, Denmark. 15Department of Energy, Joint Genome Institute, Walnut Creek, California, USA. 16Pittsburgh Supercomputing Center, Carnegie Mellon University,
Pittsburgh, Pennsylvania, USA. 17Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg,
Russia. 18Department of Microbiology and Ecosystem Science, University of Vienna, Vienna, Austria. 19The ithree institute, University of Technology Sydney, Sydney,
New South Wales, Australia. 20Department of Computer Science, Research Center in Computer Science (CRIStAL), Signal and Automatic Control of Lille, Lille, France.
21National Centre of the Scientific Research (CNRS), Rennes, France. 22Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore.
23Department of Microbiology and Infection, Warwick Medical School, University of Warwick, Coventry, UK. 24Department of Computer Science, University of
Tuebingen, Tuebingen, Germany. 25Intel Corporation, Hillsboro, Oregon, USA. 26GenScale—Bioinformatics Research Team, Inria Rennes—Bretagne Atlantique Research
Centre, Rennes, France. 27Institute of Research in Informatics and Random Systems (IRISA), Rennes, France. 28Department of Molecular Infection Biology, Helmholtz
Centre for Infection Research, Braunschweig, Germany. 29Algorizk–IT consulting and software systems, Paris, France. 30Joint BioEnergy Institute, Emeryville, California,
USA. 31Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan. 32Biological Systems and
Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA. 33School of Computational Science and Engineering, Georgia Institute of
Technology, Atlanta, Georgia, USA. 34Energy Engineering and Geomicrobiology, University of Calgary, Calgary, Alberta, Canada. 35Department of Bioinformatics,
Institute for Microbiology and Genetics, University of Goettingen, Goettingen, Germany. 36Genevention GmbH, Goettingen, Germany. 37Institute of Population Health
Sciences, National Health Research Institutes, Zhunan Town, Taiwan. 38Computational Science Research Center, San Diego State University, San Diego, California,
USA. 39Boyce Thompson Institute for Plant Research, New York, New York, USA. 40Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany.
41Coordination for the Improvement of Higher Education Personnel (CAPES) Foundation, Ministry of Education of Brazil, Brasília, Brazil. 42Center for Bioinformatics
and Computational Biology, University of Maryland, College Park, Maryland, USA. 43Department of Computer Science, University of Maryland, College Park, Maryland,
USA. 44School of Biology, Newcastle University, Newcastle upon Tyne, UK. 45Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures,
Braunschweig, Germany. 46Institute of Microbiology, ETH Zurich, Zurich, Switzerland. 47Present addresses: Department of Biodiversity, University of Duisburg-Essen,
Essen, Germany (S.M.); State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Beijing, China, and Centre of Excellence for Plant and
Microbial Sciences (CEPAMS), Beijing, China (Y.B.). 48These authors contributed equally to this work. Correspondence should be addressed to A.C.M.
(alice.mchardy@helmholtz-hzi.de) or A.S. (asczyrba@cebitec.uni-bielefeld.de).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS doi:10.1038/nmeth.4458
ONLINE METHODS
Community involvement. We organized public workshops,
roundtables, hackathons and a research program around
CAMI at the Isaac Newton Institute for Mathematical Sciences
(Supplementary Fig. 46) to decide on the principles realized in
data set generation and challenge design. To determine the most
relevant metrics for performance evaluation, a meeting with
developers of evaluation software and commonly used binning,
profiling and assembly software was organized. Subsequently we
created biobox containers implementing a range of commonly
used performance metrics, including the ones decided to be most
relevant in this meeting (Supplementary Table 9). Computational
support for challenge participants was provided by the Pittsburgh
Supercomputing Center.
Standardization and reproducibility. For performance assess-
ment, we developed several standards: we defined output for-
mats for profiling and binning tools, for which no widely accepted
standard existed. Second, we defined standards for submitting
the software itself, along with parameter settings and required
databases and implemented them in Docker container templates
(bioboxes)10. These enable the standardized and reproducible
execution of submitted programs from a particular category.
Participants were encouraged to submit results together with their
software in a Docker container following the bioboxes standard.
In addition to 23 bioboxes submitted by challenge participants,
we generated 13 other bioboxes and ran them on the challenge
data sets (Supplementary Table 2), working with the developers
to define the most suitable execution settings, if possible. For
several submitted programs, bioboxes using default settings were
created to compare performance with default and expert chosen
parameter settings. If required, the bioboxes can be rerun on the
challenge data sets.
Genome sequencing and assembly. Draft genomes of 310 type
strain isolates were generated for the Genomic Encyclopedia
of Type Strains at the DOE Joint Genome Institute (JGI) using
Illumina standard shotgun libraries and the Illumina HiSeq
2000 platform. All general aspects of library construction and
sequencing performed at the JGI can be found at http://www.jgi.
doe.gov. Raw sequence data were passed through DUK, a filter-
ing program developed at JGI, which removes known Illumina
sequencing and librar y preparation artifacts (L. Mingkun,
A. Copeland and J. Han (Department of Energy Joint Genome
Institute, personal communication). The genome sequences of
isolates from culture collections are available in the JGI genome
portal (Supplementary Table 10). Additionally, 488 isolates from
the root and rhizosphere of Arabidopsis thaliana were sequenced8.
All sequenced environmental genomes were assembled using the
A5 assembly pipeline (default parameters, version 20141120)45
and are available for download at https://data.cami-challenge.
org/participate. A quality control of all assembled genomes
was performed on the basis of tetranucleotide content analysis
and taxonomic analyses (Supplementar y Note 1), resulting in
689 genomes that were used for the challenge (Supplementary
Table 10). Furthermore, we generated 1.7 Mbp or 598 novel
circular sequences of plasmids, viruses and other circular ele-
ments from multiple microbial community samples of rat cecum
(Supplementary Note 1 and Supplementar y Table 11).
Challenge data sets. We simulated three metagenome data sets of
different organismal complexities and sizes by generating 150-bp
paired-end reads with an Illumina HighSeq error profile from the
genome sequences of 689 newly sequenced bacterial and archaeal
isolates and 598 sequences of plasmids, viruses and other circular
elements (Supplementary Note 1, Supplementary Tables 3, 6 and
12 and Supplementary Figs. 47 and 48). These data sets represent
common experimental setups and specifics of microbial communi-
ties. They consist of a 15-Gbp single sample data set from a low-
complexity community with log normal abundance distribution (40
genomes and 20 circular elements), a 40-Gbp differential log nor-
mal abundance data set with two samples of a medium-complexity
community (132 genomes and 100 circular elements) and long and
short insert sizes, as well as a 75-Gbp time series data set with five
samples from a high-complexity community with correlated log
normal abundance distributions (596 genomes and 478 circular
elements). The benchmark data sets had some notable properties;
all included (i) species with strain-level diversity (Supplementary
Fig. 47) to explore its effect on program performance; (ii) viruses,
plasmids and other circular elements, for assessment of their
impact on program performances; and (iii) genomes at different
evolutionary distances to those in reference databases, to explore
the effect of increasing taxonomic distance on taxonomic binning.
Gold-standard assemblies, genome bin and taxon bin assignments
and taxonomic profiles were generated for every individual metage-
nome sample and for the pooled samples of each data set.
Challenge organization. The first CAMI challenge benchmarked
software for sequence assembly, taxonomic profiling and (taxo-
nomic) binning. To allow developers to familiarize themselves
with the data types, biobox containers and in- and output formats,
we provided simulated data sets from public data together with
a ‘standard of truth’ before the start of the challenge. Reference
data sets of RefSeq, NCBI bacterial genomes, SILVA46 and the
NCBI taxonomy from 30 April 2014 were prepared for taxonomic
binning and profiling tools, to enable performance comparisons
for reference-based tools based on the same reference data sets.
For future benchmarking of reference-based programs with the
challenge data sets, it will be important to use these reference
data sets, as the challenge data have subsequently become part of
public reference data collections.
The CAMI challenge started on 27 March 2015 (Supplementary
Figs. 46 and 49). Challenge participants had to register on the
website to download the challenge data sets, and 40 teams regis-
tered at that time. They could then submit their predictions for
all data sets or individual samples thereof. They had the option
of providing an executable biobox implementing their software
together with specifications of parameter settings and reference
databases used. Submissions of assembly results were accepted
until 20 May 2015. Subsequently, a gold-standard assembly was
provided for all data sets and samples, which was suggested as
input for taxonomic binning and profiling. This includes all
genomic regions from the genome reference sequences and cir-
cular elements covered by at least one read in the pooled metage-
nome data sets or individual samples (Supplementary Note 1).
Provision of this assembly gold standard allowed us to decouple
the performance analyses of binning and profiling tools from
assembly performance. Developers could submit their binning
and profiling results until 18 July 2015.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
NATURE METHODS
doi:10.1038/nmeth.4458
Overall, 215 submissions representing 25 different programs
were obtained for the three challenge data sets and samples,
from initially 19 external teams and CAMI developers, 16 of
which consented to publication (Supplementary Table 2). The
genome data used to generate the simulated data sets was kept
confidential until the end of the challenge and then released8.
To ensure a more unbiased assessment, we required that chal-
lenge participants had no knowledge of the nature of the challenge
data sets. Program results displayed in the CAMI portal were
given anonymous names in an automated manner (only known
to the respective challenge submitter) until a first consensus on
performances was reached in the public evaluation workshop. In
particular, this was considered relevant for evaluation of taxa-
tor-tk and PhyloPythiaS+, which were from the lab of one of the
organizers (A.C.M.) but submitted without her involvement.
Evaluation metrics. We briefly outline the metrics used to evalu-
ate the four software categories. All metrics discussed, and several
others, are described more in depth in Supplementary Note 1.
Assemblies. The assemblies were evaluated with MetaQUAST47
using a mapping of assemblies to the genome and circular ele-
ment sequences of the benchmark data sets (Supplementary
Table 4). As metrics, we focused on genome fraction and
assembly size, the number of unaligned bases and misassem-
blies. Genome fraction measures the assembled percentage of
an individual genome, assembly size denotes the total assembly
length in bp (including misassemblies), and the number of mis-
assemblies and unaligned bases are error metrics reflective of
the assembly quality. Combined, they provide an indication of
the program performance. For instance, although assembly size
might be large, a high-quality assembly also requires the number
of misassemblies and unaligned bases to be low. To assess how
much metagenome data was included in each assembly, we also
mapped all reads back to them.
Genome binning. We calculated completeness and purity for
every bin relative to the genome with the highest number of base
pairs in that bin. We measured the assignment accuracy for the
portion of the assigned data by the programs with the ARI. This
complements consideration of completeness and purity aver-
aged over genome bins irrespectively of their sizes (Fig. 2c,d), as
large bins contribute more than smaller bins in the evaluation.
As not all programs assigned all data to genome bins, the ARI
should be interpreted under consideration of the fraction of data
assigned (Fig. 2a).
Taxonomic binning. As performance metrics, the average
purity (precision) and completeness (recall) per taxon bin were
calculated for individual ranks under consideration of the taxon
assignment. In addition, we determined the overall classification
accuracy for each data set, as measured by total assigned sequence
length, and misclassification rate for all assignments. While the
former two measures allow assessing performance averaged over
bins, where all bins are treated equally, irrespective of their size,
the latter are influenced by the taxonomic constitution, with large
bins having a proportionally larger influence.
Taxonomic profiling. We determined abundance metrics (L1
norm and weighted UniFrac)48 and binary classification measures
(recall and precision). The first two assess how well a particular
method reconstructs the relative abundances in comparison to
the gold standard, with the L1 norm using the sum of differ-
ences in abundances (ranges between 0 and 2) and UniFrac using
differences weighted by distance in the taxonomic tree (ranges
between 0 and 16). The binary classification metrics assess how
well a particular method detects the presence or absence of an
organism in comparison to the gold standard, irrespectively of
their abundances. All metrics except the UniFrac metric (which
is rank independent) are defined at each taxonomic rank. We
also calculated the following summary statistic: for each metric,
on each sample, we ranked the profilers by their performance.
Each was assigned a score for its ranking (0 for first place among
all tools at a particular taxonomic rank for a particular sample,
1 for second place, etc.). These scores were then added over the
taxonomic ranks to the genus level and summed over the samples,
to give a global performance score.
Data availability. A Life Sciences Reporting Summary for this
paper is available. The plasmid assemblies, raw data and metadata
have been deposited in the European Nucleotide Archive (ENA)
under accession number PRJEB20380. The challenge and toy data
sets including the gold standard, the assembled genomes used
to generate the benchmark data sets (Supplementary Table 10),
NCBI and ARB public reference sequence collections without
the benchmark data and the NCBI taxonomy version used for
taxonomic binning and profiling are available in GigaDB under
data set identifier (100344) and on the CAMI analysis site for
download and within the benchmarking platform (https://data.
cami-challenge.org/participate). Further information on the
CAMI challenge, results and scripts is provided at https://github.
com/CAMI-challenge/. Supplementary Tables 2 and 9 specify the
Docker Hub locations of bioboxes for the evaluated programs and
used metrics. Source data for Figures 13 are available online.
45. Coil, D., Jospin, G. & Darling, A.E. A5-miseq: an updated pipeline to
assemble microbial genomes from Illumina MiSeq data. Bioinformatics 31,
587–589 (2015).
46. Pruesse, E. et al. SILVA: a comprehensive online resource for quality
checked and aligned ribosomal RNA sequence data compatible with ARB.
Nucleic Acids Res. 35, 7188–7196 (2007).
47. Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of
metagenome assemblies. Bioinformatics 32, 1088–1090 (2016).
48. Lozupone, C. & Knight, R. UniFrac: a new phylogenetic method
for comparing microbial communities. Appl. Environ. Microbiol. 71,
8228–8235 (2005).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature
in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... An alternative approach conducts comparisons at different taxonomic levels (e.g. species, genus or phylum), where disparities between catalogues tend to be less pronounced [45][46][47][48]. In the third approach, pipelines use their native catalogues but project their results onto a shared feature space. ...
... Median values (across the samples) of lost abundance, lost features, richness and Shannon diversityAlthough still widely used, MetaPhlAn3[44,63,64] (along with its previous versions[36,37,[45][46][47][65][66][67][68]; see ...
Article
Full-text available
Microbiome profiling tools rely on reference catalogues, which significantly affect their performance. Comparing them is, however, challenging, mainly due to differences in their native catalogues. In this study, we present a novel standardized benchmarking framework that makes such comparisons more accurate. We decided not to customize databases but to translate results to a common reference to use the tools with their native environment. Specifically, we conducted two realistic simulations of gut microbiome samples, each based on a specific taxonomic profiler, and used two different taxonomic references to project their results, namely the Genome Taxonomy Database and the Unified Human Gastrointestinal Genome. To demonstrate the importance of using such a framework, we evaluated four established profilers as well as the impact of the simulations and that of the common taxonomic references on the perceived performance of these profilers. Finally, we provide guidelines to enhance future profiler comparisons for human microbiome ecosystems: (i) use or create realistic simulations tailored to your biological context (BC), (ii) identify a common feature space suited to your BC and independent of the catalogues used by the profilers and (iii) apply a comprehensive set of metrics covering accuracy (sensitivity/precision), overall representativity (richness/Shannon) and quantification (UniFrac and/or Aitchison distance).
... The taxonomic names generated by community sequencing are not always reliable. Taxonomy is assigned by comparing sequencing reads to a reference database, but the algorithms used for this process are imperfect [136,137]. An even more significant issue is the inaccuracy and incompleteness of reference databases, particularly when generalpurpose databases are used. ...
Article
Full-text available
The Human Microbiome Project (HMP), initiated in 2007, aimed to gather comprehensive knowledge to create a genetic and metabolic map of human-associated microorganisms and their contribution to physiological states and predisposition to certain diseases. Research has revealed that the human microbiome is highly diverse and exhibits significant interpersonal variability; consequently, its exact impact on health remains unclear. With the development of next-generation sequencing (NGS) technologies, the broad spectrum of microbial communities has been better characterized. The lower female genital tract, particularly the vagina, is colonized by various bacterial species, with Lactobacillus spp. predominating. The upper female genital tract, especially the uterus, was long considered sterile. However, recent studies have identified a distinct endometrial microbiome. A Lactobacillus-dominated microbiome of the female genital tract is associated with favorable reproductive outcomes, including higher success rates in natural conception and assisted reproductive technologies (ART). Conversely, microbial imbalances, or dysbiosis, marked by reduced Lactobacilli as well as an increased diversity and abundance of pathogenic species (e.g., Gardnerella vaginalis or Prevotella spp.), are linked to infertility, implantation failure, and pregnancy complications such as miscarriage and preterm birth. Dysbiosis can impair the vaginal or endometrial mucosal barrier and also trigger pro-inflammatory responses, disrupting essential reproductive processes like implantation. Despite growing evidence supporting the associations between the microbiome of the female genital tract and certain gynecological and obstetric conditions, clear microbial biomarkers have yet to be identified, and there is no consensus on the precise composition of a normal or healthy microbiome. The lack of standardized protocols and biomarkers limits the routine use of microbiome screening tests. Therefore, larger patient cohorts are needed to facilitate comparative studies and improve our understanding of the physiological microbiome profiles of the uterus and vagina, as well as how dysbiosis may influence clinical outcomes. Further research is required to refine diagnostic tools and develop personalized therapeutic strategies to improve fertility and pregnancy outcomes.
... Trustworthy taxonomic classification of each sequencing read is an ongoing challenge, and many bioinformatic tools have been and continue to be developed to address this issue [17][18][19][20][21][22][23]. Metagenomic read classification algorithms primarily rely on identifying species by comparing them to the closest matches in existing databases. ...
Article
Full-text available
Background Culture-independent diagnostic tests are gaining popularity as tools for detecting pathogens in food. Shotgun sequencing holds substantial promise for food testing as it provides abundant information on microbial communities, but the challenge is in analyzing large and complex sequencing datasets with a high degree of both sensitivity and specificity. Falsely classifying sequencing reads as originating from pathogens can lead to unnecessary food recalls or production shutdowns, while low sensitivity resulting in false negatives could lead to preventable illness. Results We used simulated and published shotgun sequencing datasets containing Salmonella-derived reads to explore the appearance and mitigation of false positive results using the popular taxonomic annotation softwares Kraken2 and Metaphlan4. Using default parameters, Kraken2 is sensitive but prone to false positives, while Metaphlan4 is more specific but unable to detect Salmonella at low abundance. We then developed a bioinformatic pipeline for identifying and removing reads falsely identified as Salmonella by Kraken2 while retaining high sensitivity. Carefully considering software parameters and database choices is essential to avoiding false positive sample calls. With well-chosen parameters plus additional steps to confirm the taxonomic origin of reads, it is possible to detect pathogens with very high specificity and sensitivity.
Article
The experimental methods employed during metagenomic sequencing analyses of microbiome samples significantly impact the resulting data and typically vary substantially between laboratories. In this study, a full factorial experimental design was used to compare the effects of a select set of methodological choices (sample, operator, lot, extraction kit, variable region, and reference database) on the analysis of biologically diverse stool samples. For each parameter investigated, a main effect was calculated that allowed direct comparison both between methodological choices (bias effects) and between samples (real biological differences). Overall, methodological bias was found to be similar in magnitude to real biological differences while also exhibiting significant variations between individual taxa, even between closely related genera. The quantified method biases were then used to computationally improve the comparability of data sets collected under substantially different protocols. This investigation demonstrates a framework for quantitatively assessing methodological choices that could be routinely performed by individual laboratories to better understand their metagenomic sequencing workflows and to improve the scope of the datasets they produce. IMPORTANCE Method-specific bias is a well-recognized challenge in metagenomic sequencing characterization of microbiome samples, but rigorous bias quantification is challenging. This report details a full factorial exploration of 48 experimental protocols by systematically varying microbiome sample, iterations of material production, laboratory personnel, DNA extraction kit, marker gene selection, and reference databases. Quantification of the biases associated with each parameter revealed similar magnitudes of variation arising from real biological differences and from varied analysis procedures. Furthermore, these measurement biases varied substantially with taxa, even between closely related genera. However, computational correction of method bias using a reference material was demonstrated that significantly harmonized metagenomic sequencing results collected using different analysis protocols.
Article
Full-text available
Background The study of the human microbiome is crucial for understanding disease mechanisms, identifying biomarkers, and guiding preventive measures. Advances in sequencing platforms, particularly 16S rRNA sequencing, have revolutionized microbiome research. Despite the benefits, large microbiome reference databases (DBs) pose challenges, including computational demands and potential inaccuracies. This study aimed to determine if full-length 16S rRNA sequencing data produced by PacBio could be used to optimize reference DBs and be applied to Illumina V3-V4 targeted sequencing data for microbial study. Methods Oral and gut microbiome data (PRJNA1049979) were retrieved from NCBI. DADA2 was applied to full-length 16S rRNA PacBio data to obtain amplicon sequencing variants (ASVs). The RDP reference DB was used to assign the ASVs, which were then used as a reference DB to train the classifier. QIIME2 was used for V3-V4 targeted Illumina data analysis. BLAST was used to analyze alignment statistics. Linear discriminant analysis Effect Size (LEfSe) was employed for discriminant analysis. Results ASVs produced by PacBio showed coverage of the oral microbiome similar to the Human Oral Microbiome Database. A phylogenetic tree was trimmed at various thresholds to obtain an optimized reference DB. This established method was then applied to gut microbiome data, and the optimized gut microbiome reference DB provided improved taxa classification and biomarker discovery efficiency. Conclusion Full-length 16S rRNA sequencing data produced by PacBio can be used to construct a microbiome reference DB. Utilizing an optimized reference DB can increase the accuracy of microbiome classification and enhance biomarker discovery.
Article
The human respiratory microbiome plays a crucial role in respiratory health, but there is no comprehensive respiratory genome catalogue (RGC) for studying the microbiome. In this study, we collected whole-metagenome shotgun sequencing data from 4067 samples and sequenced long reads of 124 samples, yielding 9.08 and 0.42 Tbp of short- and long-read data, respectively. By submitting these data with a novel assembly algorithm, we obtained a comprehensive human RGC. This high-quality RGC contains 190,443 contigs over 1 kbps and an N50 length exceeding 13 kbps; it comprises 159 high-quality and 393 medium-quality genomes, including 117 previously uncharacterized respiratory bacteria. Moreover, the RGC contains 209 respiratory-specific species not captured by the unified human gastrointestinal genome. Using the RGC, we revisited a study on a pediatric pneumonia dataset and identified 17 pneumonia-specific respiratory pathogens, reversing an inaccurate etiological conclusion due to the previous incomplete reference. Furthermore, we applied the RGC to the data of 62 participants with a clinical diagnosis of infection. Compared to the Nucleotide database, the RGC yielded greater specificity (0 versus 0.444, respectively) and sensitivity (0.852 versus 0.881, respectively), suggesting that the RGC provides superior sensitivity and specificity for the clinical diagnosis of respiratory diseases.
Chapter
The total complement of Earth’s prokaryotic and eukaryotic microbial life represents the Earth’s mycobiome. Ocean soil and sea soil also consist of the Earth’s mycobiome. The healthy soil mycobiome is needed for the growth of plants. The average soil of 1 g consists of 40,000 species of microorganisms, including bacteria, fungi, nematodes, algae, protists, and viruses (RNA viruses). Soil mycobiome is essential in plant growth, healthy development, and sustainable agriculture. The soil mycobiome plays a critical role in nutrient recycling by decomposing organic matter and providing those nutrients to plants. The application of chemical fertilizers causes disturbances in the soil mycobiome. The arbuscular mycorrhizal fungi (AMF) (belonging to Glomeomycota) is a symbiont that grows with the association of plants. This improves the fitness of both plants and AMF; this is called mutualism. The AMF can get nutrients to plants unavailable to the plant root system and releases plant growth hormones that help plant growth. AMF develops networks from plants to grow in the rhizosphere. According to studies, the symbiotic relationship between plants and AMF is 460 million years old. The root exudates of plants serve as a nutrient source for AMF. The AMF has a symbiotic relationship with 200,000 plant species. Three hundred forty-two species of AMF were identified in the phylum Glomeromycota. Excessive fertilizer use and climate changes stress plants and reduce productivity. The AMF produces growth hormones, and it makes nutrients to plants, minimizes stress to plants, and makes plants productive by providing tolerance to extreme temperature, drought, heavy metals, salinity, and heat. AMF’s remarkable abilities extend beyond nutrient acquisition. They produce growth hormones, provide stress tolerance, and bolster plant productivity by mitigating the impacts of extreme temperatures, drought, heavy metals, salinity, and heat. This chapter explores the intricate tapestry of Earth’s mycobiome, delving into the enduring partnership between AMF and plants, their complex networks, and their pivotal role in the sustainability of agriculture and the resilience of our planet’s ecosystems.
Article
Full-text available
BackgroundA key step in microbiome sequencing analysis is read assignment to taxonomic units. This is often performed using one of four taxonomic classifications, namely SILVA, RDP, Greengenes or NCBI. It is unclear how similar these are and how to compare analysis results that are based on different taxonomies. ResultsWe provide a method and software for mapping taxonomic entities from one taxonomy onto another. We use it to compare the four taxonomies and the Open Tree of life Taxonomy (OTT). Conclusions While we find that SILVA, RDP and Greengenes map well into NCBI, and all four map well into the OTT, mapping the two larger taxonomies on to the smaller ones is problematic.
Article
Full-text available
There is increasing interest in employing shotgun sequencing, rather than amplicon sequencing, to analyze microbiome samples. Typical projects may involve hundreds of samples and billions of sequencing reads. The comparison of such samples against a protein reference database generates billions of alignments and the analysis of such data is computationally challenging. To address this, we have substantially rewritten and extended our widely-used microbiome analysis tool MEGAN so as to facilitate the interactive analysis of the taxonomic and functional content of very large microbiome datasets. Other new features include a functional classifier called InterPro2GO, gene-centric read assembly, principal coordinate analysis of taxonomy and function, and support for metadata. The new program is called MEGAN Community Edition (CE) and is open source. By integrating MEGAN CE with our high-throughput DNA-to-protein alignment tool DIAMOND and by providing a new program MeganServer that allows access to metagenome analysis files hosted on a server, we provide a straightforward, yet powerful and complete pipeline for the analysis of metagenome shotgun sequences. We illustrate how to perform a full-scale computational analysis of a metagenomic sequencing project, involving 12 samples and 800 million reads, in less than three days on a single server. All source code is available here: https://github.com/danielhuson/megan-ce.
Article
Full-text available
The gut microbial community is dynamic during the first 3 years of life, before stabilizing to an adult-like state. However, little is known about the impact of environmental factors on the developing human gut microbiome. We report a longitudinal study of the gut microbiome based on DNA sequence analysis of monthly stool samples and clinical information from 39 children, about half of whom received multiple courses of antibiotics during the first 3 years of life. Whereas the gut microbiome of most children born by vaginal delivery was dominated by Bacteroides species, the four children born by cesarean section and about 20% of vaginally born children lacked Bacteroides in the first 6 to 18 months of life. Longitudinal sampling, coupled with whole-genome shotgun sequencing, allowed detection of strain-level variation as well as the abundance of antibiotic resistance genes. The microbiota of antibiotic-treated children was less diverse in terms of both bacterial species and strains, with some species often dominated by single strains. In addition, we observed short-term composition changes between consecutive samples from children treated with antibiotics. Antibiotic resistance genes carried on microbial chromosomes showed a peak in abundance after antibiotic treatment followed by a sharp decline, whereas some genes carried on mobile elements persisted longer after antibiotic therapy ended. Our results highlight the value of high-density longitudinal sampling studies with high-resolution strain profiling for studying the establishment and response to perturbation of the infant gut microbiome.
Article
Full-text available
Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k = 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods. Author Video: An author video summary of this article is available.
Article
Full-text available
Metagenomics, the application of shotgun sequencing, facilitates the reconstruction of the genomes of individual species from natural environments. A major challenge in the genome recovery domain is to agglomerate or â € binâ € sequences assembled from metagenomic reads into individual groups. Metagenomic binning without consideration of reference sequences enables the comprehensive discovery of new microbial organisms and aids in the microbial genome reconstruction process. Here we present MyCC, an automated binning tool that combines genomic signatures, marker genes and optional contig coverages within one or multiple samples, in order to visualize the metagenomes and to identify the reconstructed genomic fragments. We demonstrate the superior performance of MyCC compared to other binning tools including CONCOCT, GroopM, MaxBin and MetaBAT on both synthetic and real human gut communities with a small sample size (one to 11 samples), as well as on a large metagenome dataset (over 250 samples). Moreover, we demonstrate the visualization of metagenomes in MyCC to aid in the reconstruction of genomes from distinct bins.
Article
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.
Article
Persistence of fecal transplants Fecal microbiota transplantation is a successful way of treating the distressing symptoms of irritable bowel disease or Clostridium difficile infection. The procedure is done by administering a concentrate of colonic bacteria from a healthy donor. Li et al. used metagenomic data to look at single-nucleotide variants after transplants in humans. Donor and recipient strains coexisted for at least 3 months. Some donor strains replaced related strains of the same species, but totally novel species from a donor were unlikely to thrive in a recipient. Rational design of personalized fecal transplant “cocktails” will therefore rely on resolution beyond the species level. Science , this issue p. 586
Article
Tools are emerging to help labs trawl for sequences that reveal microbial strains and their functional potential in deep pools of metagenomic data.
Article
The systems biology of microbial communities, organismal communities inhabiting all ecological niches on earth, has in recent years been strongly facilitated by the rapid development of experimental, sequencing and data analysis methods. Novel experimental approaches and binning methods in metagenomics render the semi-automatic reconstructions of near-complete genomes of uncultivable bacteria possible, while advances in high-resolution amplicon analysis allow for efficient and less biased taxonomic community characterization. This will also facilitate predictive modeling approaches, hitherto limited by the low resolution of metagenomic data. In this review, we pinpoint the most promising current developments in metagenomics. They facilitate microbial systems biology towards a systemic understanding of mechanisms in microbial communities with scopes of application in many areas of our daily life.
Article
Motivation: Species identification and quantification are common tasks in metagenomics and pathogen detection studies. The most recent techniques are built on mapping the sequenced reads against a reference database (e.g. whole genomes, marker genes, proteins) followed by application-dependent analysis steps. Although these methods have been proven to be useful in many scenarios, there is still room for improvement in species and strain level detection, mainly for low abundant organisms. Results: We propose a new method: DUDes, a reference-based taxonomic profiler that introduces a novel top-down approach to analyze metagenomic Next-generation sequencing (NGS) samples. Rather than predicting an organism presence in the sample based only on relative abundances, DUDes first identifies possible candidates by comparing the strength of the read mapping in each node of the taxonomic tree in an iterative manner. Instead of using the lowest common ancestor we propose a new approach: the deepest uncommon descendent. We showed in experiments that DUDes works for single and multiple organisms and can identify low abundant taxonomic groups with high precision. Availability and implementation: DUDes is open source and it is available at http://sf.net/p/dudesSupplementary information: Supplementary data are available at Bioinformatics online. Contact: renardB@rki.de.