ArticlePDF Available

Greengenes2 unifies microbial data in a single reference tree

Authors:

Abstract and Figures

Studies using 16S rRNA and shotgun metagenomics typically yield different results, usually attributed to PCR amplification biases. We introduce Greengenes2, a reference tree that unifies genomic and 16S rRNA databases in a consistent, integrated resource. By inserting sequences into a whole-genome phylogeny, we show that 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree.
This content is subject to copyright. Terms and conditions apply.
Nature Biotechnology
nature biotechnology
https://doi.org/10.1038/s41587-023-01845-1
Brief Communication
Greengenes2 unifies microbial data in a
single reference tree
Daniel McDonald1, Yueyu Jiang2, Metin Balaban3, Kalen Cantrell4,
Qiyun Zhu  5,6, Antonio Gonzalez1, James T. Morton7, Giorgia Nicolaou8,
Donovan H. Parks  9, Søren M. Karst10, Mads Albertsen  11,
Philip Hugenholtz  9, Todd DeSantis12, Se Jin Song13, Andrew Bartko  13,
Aki S. Havulinna  14,15, Pekka Jousilahti14, Susan Cheng16,17, Michael Inouye18,19,
Teemu Niiranen14,20, Mohit Jain21, Veikko Salomaa  14, Leo Lahti22,
Siavash Mirarab  2 & Rob Knight  1,4,13,23
Studies using 16S rRNA and shotgun metagenomics typically yield
dierent results, usually attributed to PCR amplication biases. We
introduce Greengenes2, a reference tree that unies genomic and 16S rRNA
databases in a consistent, integrated resource. By inserting sequences
into a whole-genome phylogeny, we show that 16S rRNA and shotgun
metagenomic data generated from the same samples agree in principal
coordinates space, taxonomy and phenotype eect size when analyzed
with the same tree.
Shotgun metagenomics and 16S rRNA gene amplicon (16S) studies are
widely used in microbiome research, but investigators using these dif-
ferent methods typically find their results hard to reconcile. This lack
of standardization across methods limits the utility of the microbiome
for reproducible biomarker discovery.
A key problem is that whole-genome resources and rRNA resources
depend on different taxonomies and phylogenies. For example, Web
of Life (WoL)
1
and the Genome Taxonomy Database (GTDB)
2
provide
whole-genome trees that cover only a small fraction of known bacteria
and archaea, while SILVA3 and Greengenes4 are more comprehensive
but are most often not linked to genome records.
We reasoned that an iterative approach could yield a single mas-
sive reference tree that unifies these different data layers (for example,
genome and 16S rRNA records), which we call Greengenes2. We began
with a whole-genome catalog of 15,953 bacterial and archaeal genomes
that were evenly sampled from NCBI, and we reconstructed an accu-
rate phylogenomic tree by summarizing evolutionary trajectories of
380 global marker genes using the new workflow uDance5. This work,
namely WoL version 2 (WoL2), represents a substantial upgrade from
the previously released WoL1 (10,575 genomes)1,6. We then added 18,356
full-length 16S rRNA sequences from the Living Tree Project (LTP)
January 2022 release
7
, 1,725,274 near-complete 16S rRNA genes from
Received: 16 December 2022
Accepted: 25 May 2023
Published online: xx xx xxxx
Check for updates
1Department of Pediatrics, University of California San Diego School of Medicine, La Jolla, CA, USA. 2Department of Electrical and Computer Engineering,
University of California San Diego, La Jolla, CA, USA. 3Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA,
USA. 4Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA. 5School of Life Sciences, Arizona State
University, Tempe, AZ, USA. 6Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA. 7Biostatistics &
Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD,
USA. 8Halicioglu Data Science Institute, University of California San Diego, La Jolla, CA, USA. 9Australian Centre for Ecogenomics, School of Chemistry
and Molecular Biosciences, The University of Queensland, St Lucia, Queensland, Australia. 10Department of Obstetrics and Gynecology, Columbia
University, New York, NY, USA. 11Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark. 12Department of Informatics, Second
Genome, Brisbane, CA, USA. 13Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, CA,
USA. 14Finnish Institute for Health and Welfare, Helsinki, Finland. 15Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland. 16Division
of Cardiology, Brigham and Women’s Hospital, Boston, MA, USA. 17Cedars-Sinai Medical Center, Los Angeles, CA, USA. 18Cambridge Baker Systems
Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia. 19Cambridge Baker Systems Genomics Initiative, Department of
Public Health and Primary Care, University of Cambridge, Cambridge, UK. 20Division of Medicine, Turku University Hospital and University of Turku, Turku,
Finland. 21Sapient Bioanalytics, LLC, San Diego, CA, USA. 22Department of Computing, University of Turku, Turku, Finland. 23Department of Bioengineering,
University of California San Diego, La Jolla, CA, USA. e-mail: robknight@eng.ucsd.edu
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
(Fig. 1a). Our use of uDance ensured that the genome-based relation-
ships are kept fixed, and relationships between full-length 16S rRNA
sequences are inferred. For short fragments, we kept genome and
full-length relationships fixed and inserted fragments independently
from each other. Following deduplication and quality control on frag-
ment placement, this yielded a tree covering 21,074,442 sequences
from 31 different EMP Ontology 3 (EMPO3) environments, of which
46.5% of species-level leaves were covered by a complete genome. Taxo-
nomic labels were decorated onto the phylogeny using tax2tree v1.1
(ref. 4). The input taxonomy for decoration used GTDB r207, combined
Karst et al.8 and the Earth Microbiome Project 500 (EMP500)9 and all
full-length 16S rRNA sequences from GTDB r207 to the genome-based
backbone with uDance v1.1.0, producing a genome-supported phylog-
eny with 16S rRNA explicitly represented. Finally, we inserted 23,113,447
short V4 16S rRNA Deblur v1.1.0 (ref. 10) amplicon sequence variants
(ASVs) from Qiita (retrieved 14 December 2021)
11
and mitochondria and
chloroplast 16S rRNA from SILVA v138 using deep-learning-enabled
phylogenetic placement (DEPP) v0.3 (ref. 12). This final step represents
ASVs from over 300,000 public and private samples in Qiita, includ-
ing the entirety of the EMP
13
and American Gut Project/Microsetta
14
Not in SILVA 138
SILVA 138
EMPO3
0.006
0.004
Branch length
0.002
0
PC1 (30.73%)
PC2 (5.81%)
WGS
d e f
16S
WGS
Plant corpus
Sediment (saline)
Animal corpus
Animal surface
Animal secretion
Animal (non-saline)
Solid (non-saline)
Soil (non-saline)
Water (saline)
Plant rhizosphere
Surface (saline)
Sterile water blank
Sediment (non-saline)
Animal distal gut
Plant surface
Aerosol (non-saline)
Water (non-saline)
Animal proximal gut
Surface (non-saline)
16S
PC1 (48.94%)
PC2 (11.34%)
PC1 (46.07%)
PC2 (9.76%)
Firmicutes_D
Myxococcota_A
Nanoarchaeota
Omnitrophota
Patescibacteria
Planctomycetota
Proteobacteria
Spirochaetota
Verrucomicrobiota
Other
Chlorolexota
Cyanobacteria
Desulfobacterota_I
Firmicutes_A
Firmicutes_C
Acidobacteriota
Actinobacteriota
Bacteroidota
Bdellovibrionota_E
Chlamydiota
AGP
EMP
Both
Neither
a b
c
Fig. 1 | Greengenes2 overview and harmonization of 16S rRNA ASVs with
shotgun metagenomic data. a, The Greengenes2 phylogeny rendered using
Empress23, with ASV multifurcations collapsed; tip color indicates representation
in the American Gut Project (AGP), the EMP, both or neither, with the top 20
represented phyla depicted in the outer bar. b, The same collapsed phylogeny
colored by the presence or absence of the best BLAST24 hit from SILVA 138. The
bar depicts the same coloring as the tips. c, EMP samples and the amount of
novel branch length (normalized by the total backbone branch length) added
to the tree through ASV fragment placement. Note that sample counts are not
even across EMPO3 categories. d, Bray–Curtis applied to paired 16S V4 rRNA
ASVs and whole-genome shotgun samples from THDMI subset of The Microsetta
Initiative; PC, principal coordinate. e, Same data as d but computing Bray–Curtis
on collapsed genus data. f, Same data as d and e but using weighted UniFrac at the
ASV and genome identifier levels.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
with the LTP January 2022 release. Taxonomy was harmonized prioritiz-
ing GTDB, including preserving the polyphyletic labels of GTDB (see
also Methods). The taxonomy will be updated every 6 months using
the latest versions of GTDB and LTP.
Greengenes2 is much larger than past resources in its phylogenetic
coverage, as compared to SILVA (Fig. 1b), Greengenes (Supplementary
Fig. 1a) and GTDB (Supplementary Fig. 1b). Moreover, because our
amplicon library is linked to environments labeled with EMPO cat-
egories, we can easily identify the environments that contain samples
that can fill out the tree. Because metagenome assembled genome
(MAG) assembly efforts can only cover abundant taxa, for each EMPO
category, we plotted the amount of new branch length added to the
tree by taxa whose minimum abundance is 1% in each sample (Fig. 1c).
The results show, on average, which environment types will best yield
new MAGs and which environments harbor individual samples that
will have a large impact when sequenced.
Past efforts to reconcile 16S and shotgun datasets have led to
non-overlapping distributions, and only techniques such as Pro-
crustes analysis can show relationships between the results15. In
two large human stool cohorts
14,16
where both 16S and shotgun data
were generated on the same samples, we find that Bray–Curtis17
(non-phylogenetic) ordination fails to reconcile at the feature level
(Fig. 1d) and is poor at the genus level (Fig. 1e and Supplementary
Fig. 1c). However, UniFrac
18
, a phylogenetic method, used with our
Greengenes2 tree provides better concordance (Fig. 1f and Supplemen-
tary Fig. 1d). To examine applicability of Greengenes2 to non-human
environments, we next computed both Bray–Curtis and weighted
UniFrac at the feature level on the 16S and shotgun data from the EMP
9
.
As with the human data, we observe better concordance with the use
of the Greengenes2 phylogeny (Supplementary Fig. 2) despite limited
representation of whole genomes from non-human sources, as these
environments are not as well characterized in general.
Population
White wine
Population
White wine
Red wine
Red wine
Pearson correlation
Eect size WGS
Eect size 16S
1.0
a
b
c
d
e
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
00
Eect size 16S
0.250.200.150.100.050
Eect size WGS
0.25
0.20
0.15
0.10
0.05
0
0.8
0.6
0.4
0.2
0
Class Order
SILVA 138 naive Bayes versus GG2 WGS Bray–Curtis eect sizes (Cohen’s d and f)
Weighted UniFrac eect sizes (Cohen’s d and f)
Pearson r2 = 0.57; P = 1.35 × 10–16
Pearson r2 = 0.86; P = 2.67 × 10–37
GG 13_8 naive Bayes versus GG2 WGS
GG2 16S versus GG2 WGS
Family Genus
Class Order Family Genus Species
Class Order Family Genus Species
Pearson correlation
1.0
0.8
0.6
0.4
0.2
0
Pearson correlation
1.0
0.8
0.6
0.4
0.2
0
Naive Bayes
Phylogenetic taxonomy
Fig. 2 | Taxonomic and effect size consistency between 16S rRNA ASVs and
shotgun metagenomic data. ac, Per-sample taxonomy comparisons between
16S and whole-genome shotgun profiles from THDMI. The solid bar depicts the
50th percentile, and the dashed lines are 25th and 75th percentiles. a, Assessment
of 16S taxonomy with SILVA 138 using the default q2-feature-classifier naive Bayes
model (note, SILVA does not annotate at the species level); GG2, Greengenes2.
b, Assessment of 16S taxonomy with Greengenes 13_8 (GG13_8) using the
default q2-feature-classifier naive Bayes model. c, Assessment of 16S taxonomy
performed by reading the lineages directly from the phylogeny or through naive
Bayes trained on the V4 regions of the Greengenes2 backbone. d,e, Effect size
calculations performed with Evident on paired 16S and whole-genome shotgun
samples from THDMI. Calculations were performed at maximal resolution using
ASVs for 16S and genome identifiers for shotgun samples. The data represented
here are human gut microbiome samples. The stars denote variables that are
drawn out specifically in the plot (for example, population) and were arbitrarily
selected as comparison points to help highlight differences between d and e.
Bray–Curtis distances (d) and weighted normalized UniFrac (e) are shown.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
We also find that the per-sample shotgun and 16S taxonomy rela-
tive abundance profiles are concordant even to the species level. We
first computed taxonomy profiles for shotgun data using the Woltka
pipeline
19
. Using a naive Bayes classifier from q2-feature-classifier
v2022.2 (ref. 20) to compare GTDB r207 taxonomy results at each
level down to the genus level against SILVA v138 (Fig. 2a) or down to
the species level against Greengenes v13_8 (Fig. 2b), no species-level
reconciliation was possible. By contrast, Greengenes2 provided excel-
lent concordance at the genus level (Pearson r = 0.85) and good con-
cordance at the species level (Pearson r = 0.65; Fig. 2c). Interestingly,
the tree is now sufficiently complete such that exact matching of 16S
ASVs followed by reading the taxonomy off the tree performs even
better than the naive Bayes classifier (naive Bayes, Pearson r = 0.54 at
the species level and r = 0.84 at the genus level).
Finally, a critical reason to assign taxonomy is downstream use of
biomarkers and indicator taxa. Microbiome science has been described
as having a reproducibility crisis21, but much of this problem stems from
incompatible methods
22
. We initially used the The Human Diet Microbi-
ome Initiative (THDMI) dataset, which is a multipopulation expansion
of The Microsetta Initiative14 that contains samples with paired 16S and
shotgun preparations, to test whether a harmonized resource would
provide concordant rankings for the variables that affect the human
microbiome similarly. Using Greengenes2, the concordance was good
with Bray–Curtis (Fig. 2d; Pearson r
2
 = 0.57), better using UniFrac with
different phylogenies (SILVA 138 and Greengenes2; Supplementary
Fig. 1e; Pearson r2 = 0.77) and excellent with UniFrac on the same phy-
logeny (Fig. 2e; Pearson r2 = 0.86). We confirmed these results with an
additional cohort
16
(Supplementary Fig. 1f,g). Intriguingly, the ranked
effect sizes across different cohorts were concordant.
Taken together, these results show that use of a consistent, inte-
grated taxonomic resource dramatically improves the reproducibility
of microbiome studies using different data types and allows varia-
bles of large versus small effect to be reliably recovered in different
populations.
Online content
Any methods, additional references, Nature Portfolio reporting sum-
maries, source data, extended data, supplementary information,
acknowledgements, peer review information; details of author con-
tributions and competing interests; and statements of data and code
availability are available at https://doi.org/10.1038/s41587-023-01845-1.
References
1. Zhu, Q. et al. Phylogenomics of 10,575 genomes reveals
evolutionary proximity between domains Bacteria and Archaea.
Nat. Commun. 10, 5477 (2019).
2. Parks, D. H. et al. GTDB: an ongoing census of bacterial and
archaeal diversity through a phylogenetically consistent, rank
normalized and complete genome-based taxonomy. Nucleic
Acids Res. 50, D785–D794 (2022).
3. Quast, C. et al. The SILVA ribosomal RNA gene database project:
improved data processing and web-based tools. Nucleic Acids
Res. 41, D590–D596 (2013).
4. McDonald, D. et al. An improved Greengenes taxonomy with
explicit ranks for ecological and evolutionary analyses of Bacteria
and Archaea. ISME J. 6, 610–618 (2012).
5. Balaban, M. et al. Generation of accurate, expandable
phylogenomic trees with uDANCE. Nat. Biotechnol. https://doi.
org/10.1038/s41587-023-01868-8 (2023).
6. Hugenholtz, P., Chuvochina, M., Oren, A., Parks, D. H. & Soo, R.
M. Prokaryotic taxonomy and nomenclature in the age of big
sequence data. ISME J. 15, 1879–1892 (2021).
7. Ludwig, W. et al. Release LTP_12_2020, featuring a new ARB
alignment and improved 16S rRNA tree for prokaryotic type
strains. Syst. Appl. Microbiol. 44, 126218 (2021).
8. Karst, S. M. et al. High-accuracy long-read amplicon sequences
using unique molecular identiiers with Nanopore or PacBio
sequencing. Nat. Methods 18, 165–169 (2021).
9. Shaer, J. P. et al. Standardized multi-omics of Earth’s
microbiomes reveals microbial and metabolite diversity. Nat.
Microbiol. 7, 2128–2150 (2022).
10. Amir, A. et al. Deblur rapidly resolves single-nucleotide
community sequence patterns. mSystems 2, e00191-16 (2017).
11. Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome
meta-analysis. Nat. Methods 15, 796–798 (2018).
12. Jiang, Y., McDonald, D., Knight, R. & Mirarab, S. Scaling deep
phylogenetic embedding to ultra-large reference trees: a
tree-aware ensemble approach. Preprint at bioRxiv https://doi.
org/10.1101/2023.03.27.534201 (2023).
13. Thompson, L. R. et al. A communal catalogue reveals Earth’s
multiscale microbial diversity. Nature 551, 457–463 (2017).
14. McDonald, D. et al. American Gut: an open platform for citizen
science microbiome research. mSystems 3, e00031-18 (2018).
15. Human Microbiome Project Consortium. Structure, function
and diversity of the healthy human microbiome. Nature 486,
207–214 (2012).
16. Salosensaari, A. et al. Taxonomic signatures of cause-speciic
mortality risk in human gut microbiome. Nat. Commun. 12, 2671
(2021).
17. Bray, J. R. & Curtis, J. T. An ordination of the upland forest commu-
nities of southern Wisconsin. Ecol. Monogr. 27, 325–349 (1957).
18. Siligoi, I., Armstrong, G., Gonzalez, A., McDonald, D. & Knight,
R. Optimizing UniFrac with OpenACC yields greater than one
thousand times speed increase. mSystems 7, e0002822 (2022).
19. Zhu, Q. et al. Phylogeny-aware analysis of metagenome
community ecology based on matched reference genomes while
bypassing taxonomy. mSystems 7, e0016722 (2022).
20. Bokulich, N. A. et al. Optimizing taxonomic classiication
of marker-gene amplicon sequences with QIIME 2’s
q2-feature-classiier plugin. Microbiome 6, 90 (2018).
21. Schloss, P. D. Identifying and overcoming threats to
reproducibility, replicability, robustness, and generalizability in
microbiome research. mBio 9, e00525-18 (2018).
22. Sinha, R. et al. Assessment of variation in microbial community
amplicon sequencing by the Microbiome Quality Control (MBQC)
project consortium. Nat. Biotechnol. 35, 1077–1086 (2017).
23. Cantrell, K. et al. EMPress enables tree-guided, interactive,
and exploratory analyses of multi-omic data sets. mSystems 6,
e01216-20 (2021).
24. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic
local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Publisher’s note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional ailiations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format,
as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate
if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this license, visit http://creativecommons.
org/licenses/by/4.0/.
© The Author(s) 2023, corrected publication 2023
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
Methods
Human research protocols
THDMI participant informed consent was obtained under University
of California, San Diego, institutional review board protocol 141853.
FINRISK participant informed consent was obtained under the Coordi-
nating Ethical Committee of the Helsinki and Uusimaa Hospital District
protocol reference number 558/E3/2001.
Phylogeny construction
WoL2 (ref. 1; a tree inferred using genome-wide data) was used as the
starting backbone. Full-length 16S sequences from the LTP
7
, full-length
mitochondria and chloroplast from SILVA 138 (ref. 3), full-length 16S
from GTDB r207 (ref. 2), full-length 16S from Karst et al.
8
and full-length
16S from the EMP500 (ref. 9; samples selected and sequenced spe-
cifically for Greengenes2) were collected and deduplicated. Sequences
were then aligned using UPP25, and gappy sequences with less than
1,000 base pairs were removed. The resulting set of 321,210 unique
sequences was used with uDance v1.1.0 to update the WoL2 backbone.
Briefly, uDance updates an existing tree with new sequences and (unlike
placement methods) also infers the relationship of existing sequences.
uDance has two modes, one that allows updates to the backbone and
one that keeps the backbone fixed, where the former mode is intended
for use with whole genomes. In our analyses, we kept the backbone tree
(inferred using genomic data) fixed. To extend the genomic tree with
16S data, we identified 13,249 (of 15,953 total) genomes in the WoL2
backbone tree with at least one 16S copy and used them to train a DEPP
model with the weighted average method detailed later to handle multi-
ple copies. We then used DEPP to insert all 16S copies of all genomes into
the backbone and measured the distance between the genome position
and the 16S position. We removed copies that were placed much further
than others, as identified using a two-means approach with centroids
equal to at least 13 branches. We repeated this process in a second round.
For every remaining genome, we selected as its representative the copy
with the minimum placement error and computed the consensus with
ties. At the end, we were left with 12,344 unique 16S sequences across
all WoL2 genomes. For tree inference, uDance used IQ-TREE2 (ref. 26) in
fast tree search with model GTR+ Γ after removing duplicate sequences.
Next, we collected 16S V4 ASVs from Qiita
11
using redbiom
27
(query
performed 14 December 2021) from contexts ‘Deblur_2021.09-Illumina-
16S-V4-90nt-dd6875’, ‘Deblur_2021.09-Illumina-16S-V4-100nt-50b3a2’,
‘Deblur_2021.09-Illumina-16S-V4-125nt-92f954’, ‘Deblur_2021.09-Illumina-
16S-V4-150nt-ac8c0b’, ‘Deblur_2021.09-Illumina-16S-V4-200nt-
0b8b48’ and ‘Deblur_2021.09-Illumina-16S-V4-250nt-8b2bff’ and
aligned them to the existing 16S alignment of sequences in WoL2 using
UPP, setting the maximum alignment subset size to 200 (to help with
scalability). The collected 16S V4 ASVs are aligned to the V4 region of
the existing ‘backbone’ alignments. A DEPP model was then trained
on the full-length 16S sequences from the backbone. DEPP constructs
a neural network model that embeds sequences in high-dimensional
spaces such that embedded points resemble the phylogeny in their
distances. Such a model then allows insertion of new sequences into a
tree using the distance-based phylogenetic insertion method APPLES-2
(ref. 28). The ASVs from redbiom were then inserted into the backbone
using the trained DEPP model. To enable analyses of large datasets, we
used a clustering approach with DEPP. We trained an ensemble of DEPP
models corresponding to different parts of the tree and used a classifier
to detect the correct subtree. During training, for species with multiple
16S, all the copies are mapped to the same leaf in the backbone tree.
To train the DEPP models with multiple sequences mapped to a leaf,
each site in each sequence is encoded as a probability vector of four
nucleotides across all the copies.
Integrating the GTDB and LTP taxonomies
GTDB and LTP are not directly compatible due to differences in their
curation. As a result, it is not always possible to map a species from
one resource to the other because parts of a species lineage are not
present, are described using different names or have an ambiguous asso-
ciation due to polyphyletic taxa in GTDB (for example, Firmicutes_A,
Firmicutes_B and so on; https://gtdb.ecogenomic.org/faq#why-do-
some-family-and-higher-rank-names-end-with-an-alphabetic-suffix).
We integrated taxonomic data from LTP into GTDB as LTP includes spe-
cies that are not yet represented in GTDB. Additionally, GTDB is actively
curated, while LTP generally uses the NCBI taxonomy. To account for
these differences, we first mapped any species that had a perfect species
name association and revised its ancestral lineage to match GTDB. Next,
we generated lineage rewrite rules using the GTDB record metadata.
Specifically, we limited the metadata to records that are GTDB repre
-
sentatives and NCBI-type material and defined a lineage renaming from
the recorded NCBI taxonomy to the GTDB taxonomy. These rewrite
rules were applied from most- to least-specific taxa, and through this
mechanism, we could revise much of the higher ranks of LTP. We then
identified incertae sedis records in LTP that we could not map, removed
their lineage strings and did not attempt to provide taxonomy for them,
instead opting to rely on downstream taxonomy decoration to resolve
their lineages. Next, any record that was ambiguous to map was split
into a secondary taxonomy for use in backfilling in the downstream
taxonomy decoration. Finally, we instrumented numerous consistency
checks in the taxonomy through the process to capture inconsistent
parents in the taxonomic hierarchy and consistent numbers of ranks in a
lineage and to ensure that the resulting taxonomy was a strict hierarchy.
Taxonomy decoration
The original tax2tree algorithm was not well suited for a large volume
of species-level records in the backbone, as the algorithm requires an
internal node to place a name. If two species are siblings, the tree would
lack a node to contain the species label for both taxa. To account for
this, we updated the algorithm to insert ‘placeholder’ nodes with zero
branch length as the parents of backbone records, which could accept
these species labels. We further updated tax2tree to operate directly on
.jplace data29, preserving edge numbering of the original edges before
adding ‘placeholder’ nodes. To support LTP records that could not be
integrated into GTDB, we instrumented a secondary taxonomy mode
for tax2tree. Specifically, following the standard decoration, backfilling
and name promotion procedures, we determine on a per-record basis
for the secondary taxonomy what portion of the lineage is missing
and place the missing labels on the placeholder node. We then issue a
second round of name promotion using the existing tax2tree methods.
The actual taxonomy decoration occurs on the backbone tree,
which contains only full-length 16S records and does not contain
ASVs. This is done as ASV placements are independent, do not modify
the backbone and would substantially increase the computational
resources required. After the backbone is decorated, fragment place-
ments from DEPP are resolved using a multifurcation strategy using
the balanced-parentheses library30.
Phylogenetic collapse for visualization
We are unaware of phylogenetic visualization software that can display
a tree with over 20,000,000 tips. To produce the visualizations in Fig.
1, we reduced the dimension of the tree by collapsing fragment multi-
furcations to single nodes, dropping the tree to 522,849 tips.
MAG target environments
A feature table for the 27,015 16S rRNA V4 90-nucleotide EMP samples
was obtained from redbiom. The ASVs were filtered to the overlap of
ASVs present in Greengenes2. Any feature with <1% relative abundance
within a sample was removed. The feature table was then rarefied to
1,000 sequences per sample. The amount of novel branch length was
then computed per sample by summing the branch length of each ASV’s
placement edge. The per-sample branch length was then normalized
by the total tree branch length (excluding length contributed by ASVs).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
Per-sample taxonomy correlations
All comparisons used THDMI
14
16S and Woltka processed shotgun
data. These data were accessed from Qiita study 10317 and filtered
the set of features that overlap with Greengenes2 using the QIIME 2
(ref. 31) q2-greengenes2 plugin. The 16S taxonomy was assessed using
either a traditional naive Bayes classifier with q2-feature-classifier
and default references from QIIME 2 2022.2 or by reading the lineage
directly from the phylogeny. To help improve correlations between
SILVA and Greengenes2 and between Greengenes and Greengenes2,
we stripped polyphyletic labelings from those data; we did not strip
polyphyletic labels from the phylogenetic taxonomy comparison or
the Greengenes2 16S versus Greengenes2 whole-genome shotgun
(WGS) naive Bayes comparison. Shotgun taxonomy was determined
by the specific observed genome records. Once the 16S taxonomy was
assigned, those tables and the WGS Woltka WoL2 table were collapsed
at the species, genus, family, order and class levels. We then computed
a minimum relative abundance per sample in the dataset from THDMI.
In each sample, we removed any feature, either 16S or WGS, below the
per-sample minimum (that is, max(min(16S), min(WGS))), forming a
common minimal basis for taxonomy comparison. Following filtering,
Pearson correlation was computed per sample using SciPy
32
. These
correlations were aggregated per 16S taxonomy assignment method
and by each taxonomic rank. The 25th, 50th and 75th percentiles were
then plotted with Matplotlib33.
Principal coordinates
THDMI Deblur 16S and Woltka processed shotgun sequencing data,
against WoL2, were obtained from Qiita study 10317. Both feature tables
were filtered against Greengenes2 2022.10, removing any feature not
present in the tree. For the genus collapsed plot (Fig. 1e), both the 16S
and WGS data features were collapsed using the same taxonomy. For
all three figures, the 16S data were subsampled, with replacement,
to 10,000 sequences per sample. The WGS data were subsampled,
with replacement, to 1,000,000 sequences per sample. Bray–Curtis,
weighted UniFrac and principal coordinates analysis were computed
using q2-diversity 2022.2. The resulting coordinates were visualized
with q2-emperor34.
The EMP ‘EMP500’ 16S and Woltka processed shotgun sequencing
data, against WoL2, were obtained from Qiita study 13114. Both feature
tables were filtered against Greengenes2 2022.10. The 16S data were
subsampled, with replacement, to 1,000 sequences per sample. The
WGS data were subsampled, with replacement, to 50,000 sequences
per sample. The sequencing depth for WGS data was selected based
on Supplementary Fig. 6 of Shaffer et al.
9
, which noted low levels of
read recruitment to publicly available whole genomes. Bray–Curtis,
weighted UniFrac and principal coordinates analysis were computed
using q2-diversity 2022.2. The resulting coordinates were visualized
with q2-emperor.
Effect size calculations
Similar to principal coordinates, data from THDMI were rarefied to
9,000 and 2,000,000 sequences per sample for 16S and WGS, respec-
tively. Bray–Curtis and weighted normalized UniFrac were computed
on both sets of data. The variables for THDMI were subset to those with
at least two category values having more than 50 samples. For UniFrac
with SILVA (Supplementar y Fig. 1e), we performed fragment insertion
using q2-fragment-insertion
35
into the standard QIIME 2 SILVA refer-
ence, followed by rarefaction to 9,000 sequences per sample, and then
computed weighted normalized UniFrac.
For FINRISK, the data were rarefied to 1,000 and 500,000
sequences per sample for 16S and WGS, respectively. A different
depth was used to account for the overall lower amount of sequenc-
ing data for FINRISK. As with THDMI, the variables selected were
reduced to those with at least two category values having more than
50 samples.
Support for computing paired effect sizes is part of the QIIME2
Greengenes2 plugin q2-greengenes2, which performs effect size cal-
culations using Evident36.
Reporting summary
Further information on research design is available in the Nature Port-
folio Reporting Summary linked to this article.
Data availability
The official location of the Greengenes2 releases is http://ftp.microbio.
me/greengenes_release/. The data are released under a BSD-3 clause
license. Data from THDMI are part of Qiita study 10317 and European
Bioinformatics Institute accession number PRJEB11419. The FINRISK
data and including the data presented in Supplementary Fig. 1c–g are
protected; details on data access are available in the European Genome–
Phenome Archive under accession number EGAD00001007035. The
data presented in Supplementary Fig. 1a,b are not compatible with
Excel. The EMP data are part of Qiita study 13114 and European Bio-
informatics Institute accession number ERP125879. Source data are
provided with this paper.
Code availability
A QIIME 2 plugin is available to facilitate use with the resource that
can be obtained from ref. 37 (version 2023.3; https://doi.org/10.5281/
zenodo.7758134). Taxonomy construction, decoration and release
processing is part of ref. 38 (version 2023.3; https://doi.org/10.5281/
zenodo.7758138). uDance is available at GitHub39 (version v1.1.0;
https://doi.org/10.5281/zenodo.7758289). Phylogeny insertion using
DEPP is available at ref. 40 (version 0.3; https://doi.org/10.5281/
zenodo.7768798). The trained model can be accessed via Zenodo at
https://doi.org/10.5281/zenodo.7416684. Code used for the figures
in this manuscript is available in ref. 41. Finally, an interactive website
to explore the Greengenes2 data is available at https://greengenes2.
ucsd.edu.
References
25. Nguyen, N.-P. D., Mirarab, S., Kumar, K. & Warnow, T. Ultra-large
alignments using phylogeny-aware proiles. Genome Biol. 16,
124 (2015).
26. Minh, B. Q. et al. IQ-TREE 2: new models and eicient methods
for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37,
1530–1534 (2020).
27. McDonald, D. et al. redbiom: a rapid sample discovery and feature
characterization system. mSystems 4, e00215-19 (2019).
28. Balaban, M., Jiang, Y., Roush, D., Zhu, Q. & Mirarab, S. Fast and
accurate distance-based phylogenetic placement using divide
and conquer. Mol. Ecol. Resour. 22, 1213–1227 (2022).
29. Matsen, F. A., Homan, N. G., Gallagher, A. & Stamatakis, A.
A format for phylogenetic placements. PLoS ONE 7, e31009
(2012).
30. McDonald, D. Improved-octo-waddle. GitHub https://github.com/
biocore/improved-octo-waddle/ (2023).
31. Bolyen, E. et al. Reproducible, interactive, scalable and extensible
microbiome data science using QIIME 2. Nat. Biotechnol. 37,
852–857 (2019).
32. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientiic
computing in Python. Nat. Methods 17, 261–272 (2020).
33. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci.
Eng. 9, 90–95 (2007).
34. Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A. & Knight, R. EMPeror:
a tool for visualizing high-throughput microbial community data.
Gigascience 2, 16 (2013).
35. Janssen, S. et al. Phylogenetic placement of exact amplicon
sequences improves associations with clinical information.
mSystems 3, e00021-18 (2018).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Nature Biotechnology
Brief Communication https://doi.org/10.1038/s41587-023-01845-1
36. Rahman, G. et al. Determination of eect sizes for power analysis
of microbiome studies using large mircrobiome datasets. Genes
https://doi.org/10.3390/genes14061239 (2023).
37. McDonald, D. q2-greengenes2. GitHub https://github.com/
biocore/q2-greengenes2/ (2023).
38. McDonald, D. greengenes2. GitHub https://github.com/biocore/
greengenes2 (2023).
39. Balaban, M. uDance. GitHub https://github.com/balabanmetin/
uDance (2023).
40. Jiang, Y. DEPP. GitHub https://github.com/yueyujiang/DEPP (2023).
41. McDonald, D. Greengenes2 analyses. GitHub https://github.com/
knightlab-analyses/greengenes2 (2023).
Acknowledgements
This work was supported, in part, by NSF XSEDE BIO210103 (Q.Z.), NSF
RAPID 20385.09 (R.K.), NIH 1R35GM14272 (S.M.), NIH U19AG063744
(R.K.), NIH U24DK131617 (R.K.), NIH DP1-AT010885 (R.K.) and Emerald
Foundation 3022 (R.K.). J.T.M. was funded by the intramural research
program of the Eunice Kennedy Shriver National Institute of Child
Health and Human Development. The dataset from THDMI was
generated through support from Danone Nutricia Research and the
Center for Microbiome Innovation. This work used Expanse at the San
Diego Supercomputing Center through allocation ASC150046 from
the Advanced Cyberinfrastructure Coordination Ecosystem: Services
& Support (ACCESS) program, which is supported by NSF grants
2138259, 2138286, 2138307, 2137603 and 2138296.
Author contributions
D.M. and R.K. conceived, initiated and coordinated the project
and performed analyses. D.M. and A.G. wrote infrastructure and
analysis code. Y.J., M.B., Q.Z. and S.M. coordinated phylogenetic
placements and reconstruction. K.C. wrote visualization code.
G.N. and J.T.M. performed analyses. S.M.K. and M.A. generated
16S rRNA operons. D.H.P., P.H. and T.D. provided guidance on the
genome taxonomy. S.J.S., A.B., A.S.H., P.J., S.C., M.I., T.N., M.J., V.S.
and L.L. provided data used for analysis. All authors reviewed and
edited the manuscript.
Competing interests
R.K. is a scientiic advisory board member, and consultant
for BiomeSense, Inc., has equity and receives income. The
terms of this arrangement have been reviewed and approved
by the University of California, San Diego in accordance with its
conlict of interest policies. The remaining authors declare no
competing interests.
Additional information
Supplementary information The online version
contains supplementary material available at
https://doi.org/10.1038/s41587-023-01845-1.
Correspondence and requests for materials should be addressed to
Rob Knight.
Peer review information Nature Biotechnology thanks Robin Rohwer,
C. Titus Brown and the other, anonymous, reviewer(s) for their
contribution to the peer review of this work.
Reprints and permissions information is available at
www.nature.com/reprints.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-scale
personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these purposes, Springer
Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue, royalties,
rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any other, institutional
repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or content on
this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke this
licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied with
respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed from
third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Selected culturable microbial isolates are currently being whole genome sequenced to provide understanding of the functional diversity as well as their phylogeny. This also permits, along with efforts in database harmonisation (McDonald et al. 2023), potential further integration of the different sequencebased datasets. Sequencing technology and data analysis approaches rapidly evolve, so any dataset is representative of a technological 'snapshot' . ...
Article
Full-text available
Plant microbiomes are the microbial communities essential to the functioning of the phytobiome—the system that consist of plants, their environment, and their associated communities of organisms. A healthy, functional phytobiome is critical to crop health, improved yields and quality food. However, crop microbiomes are relatively under-researched, and this is associated with a fundamental need to underpin phytobiome research through the provision of a supporting infrastructure. The UK Crop Microbiome Cryobank (UKCMC) project is developing a unique, integrated and open-access resource to enable the development of solutions to improve soil and crop health. Six economically important crops (Barley, Fava Bean, Oats, Oil Seed Rape, Sugar Beet and Wheat) are targeted, and the methods as well as data outputs will underpin research activity both in the UK and internationally. This manuscript describes the approaches being taken, from characterisation, cryopreservation and analysis of the crop microbiome through to potential applications. We believe that the model research framework proposed is transferable to different crop and soil systems, acting not only as a mechanism to conserve biodiversity, but as a potential facilitator of sustainable agriculture systems.
... QIIME2 version 2022.8 122 . Operational Taxonomic Units (OTUs) were assigned using Greengenes2 reference tree 48 . ...
Article
Full-text available
Vaginal microbial composition is associated with differential risk of urogenital infection. Although Lactobacillus spp. are thought to confer protection against infection, the lack of in vivo models resembling the human vaginal microbiota remains a prominent barrier to mechanistic discovery. Using 16S rRNA amplicon sequencing of C57BL/6J female mice, we found that vaginal microbial composition varies within and between colonies across three vivaria. Noting vaginal microbial plasticity in conventional mice, we assessed the vaginal microbiome of humanized microbiota mice ( HMb mice). Like the community structure in conventional mice, HMb mice vaginal microbiota clustered into community state types but, uniquely, HMb mice communities were frequently dominated by Lactobacillus or Enterobacteriaceae . Compared to conventional mice, HMb mice were less susceptible to uterine ascension by urogenital pathobionts group B Streptococcus (GBS) and Prevotella bivia . Although Escherichia and Lactobacillus both correlated with the absence of uterine GBS, vaginal pre-inoculation with exogenous HMb mouse-derived E. coli , but not Ligilactobacillus murinus , reduced vaginal GBS burden. Overall, HMb mice serve as a useful model to elucidate the role of endogenous microbes in conferring protection against urogenital pathogens.
... Nevertheless, a more broadly applicable database with an accurate taxonomy not only allows to more accurately capture the microbial diversity, but it also improves the integration of results from different omics approaches (Godfray, 2002;Parks et al., 2018). Therefore, McDonald et al., established a reference tree that unifies genomic and 16S rRNA databases into a consistent resource (McDonald et al., 2023). Furthermore, the genome taxonomy database (GTDB) uses a set of conserved proteins to normalize taxonomic ranks based on relative evolutionary divergence. ...
... Despite the broad application of CR clustering in metaanalysis, few studies clarified the primer effect. By using Silva 138 [50] and Greengenes2 databases [51] on the same sample set, we found that CR discards an average of 26% of sequences, with variation depending on habitat type and database used (Supporting Information Figure S12). CR led to lower diversity due to the exclusion of unmapped sequences but yielded similar primer bias patterns in alpha-diversity, SAD, taxa abundance, and beta-diversity patterns (Supporting Information Figures S13 and S14). ...
Article
Full-text available
This study revealed that primer selection substantially influences the taxonomic and predicted functional composition and the characterization of microecological patterns, which was not alleviated by close-reference clustering. Biases were relatively consistent across different habitats in community profiling but not in microecological patterns. These primer biases could be attributed to multiple aspects, including taxa specificity, regional hypervariability, and amplification efficiency.
Article
Full-text available
La diversidad de microorganismos en el suelo es elevada y su identificación mediante el uso de técnicas tradicionales de cultivo resulta inadecuada y limitada para un elevado porcentaje de los mismos. En la actualidad se cuentan con tecnologías de secuenciamiento masivo del ADN que ha permitido, junto a otras técnicas y herramientas, incluyendo la bioinfomática, la identificación de microrganismos sin necesidad del uso de medios de cultivos. Sin embargo, la secuenciación masiva ha generado enormes cantidades de información que requiere ser analizada y por ende demanda un esfuerzo computacional considerable. Existen diversos programas bioinformáticos, basados en uno o varios lenguajes de programación, para el análisis molecular in silico de secuencias de ADN, e.g., MOTHUR, QIIME1, DADA2 y QIIME2. De estos, QIIME2, por sus siglas en inglés “Quantitative insights into microbial ecology”, es una herramienta frecuentemente empleada para el analisis datos de secuenciamiento de marcadores moleculares o genes funcionales, e.g., 16S rRNA, 18S rRNA, ITS, COI, entre otros. Dada la importancia de éstas, y de la necesidad de acceso a este conocimiento en lenguaje español, en esta revisión se describe y detalla el flujo de trabajo para el análisis de secuencias del gen 16S rRNA provenientes de muestras ambientales empleando QIIME2.
Article
Full-text available
Phylogenetic trees provide a framework for organizing evolutionary histories across the tree of life and aid downstream comparative analyses such as metagenomic identification. Methods that rely on single-marker genes such as 16S rRNA have produced trees of limited accuracy with hundreds of thousands of organisms, whereas methods that use genome-wide data are not scalable to large numbers of genomes. We introduce updating trees using divide-and-conquer (uDance), a method that enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability. With uDance, we infer a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.
Article
Full-text available
Herein, we present a tool called Evident that can be used for deriving effect sizes for a broad spectrum of metadata variables, such as mode of birth, antibiotics, socioeconomics, etc., to provide power calculations for a new study. Evident can be used to mine existing databases of large microbiome studies (such as the American Gut Project, FINRISK, and TEDDY) to analyze the effect sizes for planning future microbiome studies via power analysis. For each metavariable, the Evident software is flexible to compute effect sizes for many commonly used measures of microbiome analyses, including α diversity, β diversity, and log-ratio analysis. In this work, we describe why effect size and power analysis are necessary for computational microbiome analysis and show how Evident can help researchers perform these procedures. Additionally, we describe how Evident is easy for researchers to use and provide an example of efficient analyses using a dataset of thousands of samples and dozens of metadata categories.
Preprint
Full-text available
Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, DEPP (or any distance-based method) does not scale to more than roughly ten thousand species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331,270 species. Scalability problems can be addressed in phylogenetics using divide-and-conquer. However, applying divide-and-conquer to data-hungry machine learning methods needs nuance. This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP that uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing twenty million 16S fragments on the GG2 reference tree in 41 hours of computation.
Article
Full-text available
Despite advances in sequencing, lack of standardization makes comparisons across studies challenging and hampers insights into the structure and function of microbial communities across multiple habitats on a planetary scale. Here we present a multi-omics analysis of a diverse set of 880 microbial community samples collected for the Earth Microbiome Project. We include amplicon (16S, 18S, ITS) and shotgun metagenomic sequence data, and untargeted metabolomics data (liquid chromatography-tandem mass spectrometry and gas chromatography mass spectrometry). We used standardized protocols and analytical methods to characterize microbial communities, focusing on relationships and co-occurrences of microbially related metabolites and microbial taxa across environments, thus allowing us to explore diversity at extraordinary scale. In addition to a reference database for metagenomic and metabolomic data, we provide a framework for incorporating additional studies, enabling the expansion of existing knowledge in the form of an evolving community resource. We demonstrate the utility of this database by testing the hypothesis that every microbe and metabolite is everywhere but the environment selects. Our results show that metabolite diversity exhibits turnover and nestedness related to both microbial communities and the environment, whereas the relative abundances of microbially related metabolites vary and co-occur with specific microbial consortia in a habitat-specific manner. We additionally show the power of certain chemistry, in particular terpenoids, in distinguishing Earth’s environments (for example, terrestrial plant surfaces and soils, freshwater and marine animal stool), as well as that of certain microbes including Conexibacter woesei (terrestrial soils), Haloquadratum walsbyi (marine deposits) and Pantoea dispersa (terrestrial plant detritus). This Resource provides insight into the taxa and metabolites within microbial communities from diverse habitats across Earth, informing both microbial and chemical ecology, and provides a foundation and methods for multi-omics microbiome studies of hosts and the environment. This Resource combines amplicon sequencing, shotgun metagenomic sequencing and untargeted metabolomics to provide a global view of microbial–metabolite associations across Earth’s environments.
Article
Full-text available
UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected.
Article
Full-text available
Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution.
Article
Full-text available
The Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database. GTDB R06-RS202 spans 254 090 bacterial and 4316 archaeal genomes, a 270% increase since the introduction of the GTDB in November, 2017. These genomes are organized into 45 555 bacterial and 2339 archaeal species clusters which is a 200% increase since the integration of species clusters into the GTDB in June, 2019. Here, we explore prokaryotic diversity from the perspective of the GTDB and highlight the importance of metagenome-assembled genomes in expanding available genomic representation. We also discuss improvements to the GTDB website which allow tracking of taxonomic changes, easy assessment of genome assembly quality, and identification of genomes assembled from type material or used as species representatives. Methodological updates and policy changes made since the inception of the GTDB are then described along with the procedure used to update species clusters in the GTDB. We conclude with a discussion on the use of average nucleotide identities as a pragmatic approach for delineating prokaryotic species.
Article
Full-text available
The collection of fecal material and developments in sequencing technologies have enabled standardised and non-invasive gut microbiome profiling. Microbiome composition from several large cohorts have been cross-sectionally linked to various lifestyle factors and diseases. In spite of these advances, prospective associations between microbiome composition and health have remained uncharacterised due to the lack of sufficiently large and representative population cohorts with comprehensive follow-up data. Here, we analyse the long-term association between gut microbiome variation and mortality in a well-phenotyped and representative population cohort from Finland ( n = 7211). We report robust taxonomic and functional microbiome signatures related to the Enterobacteriaceae family that are associated with mortality risk during a 15-year follow-up. Our results extend previous cross-sectional studies, and help to establish the basis for examining long-term associations between human gut microbiome composition, incident outcomes, and general health status.
Article
Phylogenetic placement of query samples on an existing phylogeny is increasingly used in molecular ecology, including sample identification and microbiome environmental sampling. As the size of available reference trees used in these analyses continues to grow, there is a growing need for methods that place sequences on ultra‐large trees with high accuracy. Distance‐based placement methods have recently emerged as a path to provide such scalability while allowing flexibility to analyze both assembled and unassembled environmental samples. In this paper, we introduce a distance‐based phylogenetic placement method, APPLES‐2, that is more accurate and scalable than existing distance‐based methods and even some of the leading maximum likelihood methods. This scalability is owed to a divide‐and‐conquer technique that limits distance calculation and phylogenetic placement to parts of the tree most relevant to each query. The increased scalability and accuracy enables us to study the effectiveness of APPLES‐2 for placing microbial genomes on a data set of 10,575 microbial species using subsets of 381 marker genes. APPLES‐2 has very high accuracy in this setting, placing 97% of query genomes within three branches of the optimal position in the species tree using 50 marker genes. Our proof of concept results show that APPLES‐2 can quickly place metagenomic scaffolds on ultra‐large backbone trees with high accuracy as long as a scaffold includes tens of marker genes. These results pave the path for a more scalable and widespread use of distance‐based placement in various areas of molecular ecology.
Article
The new release of the All-Species Living Tree Project (LTP) represents an important step forward in the reconstruction of 16S rRNA gene phylogenies, since we not only provide an updated set of type strain sequences until December 2020, but also a series of improvements that increase the quality of the database. An improved universal alignment has been introduced that is implemented in the ARB format. In addition, all low-quality sequences present in the previous releases have been substituted by new entries with higher quality, many of them as a result of whole genome sequencing. Altogether, the improvements in the dataset and 16S rRNA sequence alignment allowed us to reconstruct robust phylogenies. The trees made available through this current LTP release feature the best topologies currently achievable. The given nomenclature and taxonomic hierarchy reflect all the changes available up to December 2020. The aim is to regularly update the validly published nomenclatural classification changes and new taxa proposals. The new release can be found at the following URL: https://imedea.uib-csic.es/mmg/ltp/.