ArticlePublisher preview available

A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Metagenomic binning has revolutionized the study of uncultured microorganisms. Here we compare single- and multi-coverage binning on the same set of samples, and demonstrate that multi-coverage binning produces better results than single-coverage binning and identifies contaminant contigs and chimeric bins that other approaches miss. While resource expensive, multi-coverage binning is a superior approach and should always be performed over single-coverage binning.
This content is subject to copyright. Terms and conditions apply.
Nature Methods | Volume 20 | August 2023 | 1170–1173 1170
nature methods
Brief Communication
https://doi.org/10.1038/s41592-023-01934-8
A comparison of single-coverage and
multi-coverage metagenomic binning
reveals extensive hidden contamination
Jennifer Mattock  1 & Mick Watson  2,3
Metagenomic binning has revolutionized the study of uncultured
microorganisms. Here we compare single- and multi-coverage binning on
the same set of samples, and demonstrate that multi-coverage binning
produces better results than single-coverage binning and identies
contaminant contigs and chimeric bins that other approaches miss. While
resource expensive, multi-coverage binning is a superior approach and
should always be performed over single-coverage binning.
Metagenomic binning, the resolution of metagenomic sequence data
into individual genomes, has been used to identify hundreds of thou-
sands of genomes from microbiome samples
16
. These studies are ena-
bled by software that groups together assembled contigs based on the
assumption that contigs with similar sequence content and coverage
profiles across multiple samples probably originate from the same
genome7,8. However, calculating coverage from multiple samples repre-
sents a problem for large sample sizes, requiring an all-against-all com-
parison. It has therefore become routine for single-coverage binning
to be performed for large datasets. Previous research has described
multi-coverage binning in the context of co-assembly, finding that at
least five samples are required for it to be worthwhile3; increasing the
number of samples when performing multi-coverage binning decreased
the contamination and increased the completeness of bins
7,9
. However,
co-assembly is suboptimal as it allows the reconstruction of only one
bin per species3.
In this Brief Communication, we compare single- and multi-coverage
binning on the same dataset, to quantify the effect of the loss of
coverage information on the quantity and quality of bins produced.
We hypothesize that single-coverage binning will frequently bin
together contigs that are co-abundant only in a single sample (Fig. 1a),
that these errors represent invisible contamination and that they can
be detected by using multi-coverage data.
Forty-two rumen microbiome samples were assembled and
binned using two strategies, single-coverage and multi-coverage bin-
ning. All other parameters remained the same. The completeness and
contamination results for all bins produced by both methods are shown
in Fig. 1b. Minimal difference is observed between the distribution
of completeness scores in the single- and multi-coverage bins; how-
ever, the single-coverage bins have increased contamination: 22.5%
(1,273/5,658) of the single-coverage bins have a contamination score
of 5 or greater versus 3.5% (293/8,420) of the multi-coverage bins. This
suggests that more contigs classed as contaminant DNA are incorpo-
rated using the single-coverage approach.
The single-coverage approach produced a total of 5,658 bins across
the 42 samples, whereas the multi-coverage approach produced 8,420
(Fig. 1c). A filtered set of bins was produced using completeness and
contamination cutoffs that have previously been used in ruminants6,1012
(completeness ≥80% and contamination ≤10%). Using these cutoffs,
the single-coverage approach produced 931 filtered bins, compared
to 1,660 produced by the multi-coverage approach, an increase of 78%.
This suggests that the multi-coverage approach results in more bins of
higher quality. The filtered bins were used for all downstream analysis.
The taxonomies produced by either binning method were com-
pared. Variation was observed in the proportion of bins belonging to
each taxa at each rank. A greater proportion of the multi-coverage bins
were archaea (4.3%) than in the single-coverage bins (3.1%). In both
approaches the predominant phyla was Bacteroidota with a slight
variation in the Firmicutes/Bacteroidota ratio, 1.28 in multi-coverage
bins versus 1.05 in multi-coverage bins. One phylum, Patescibacte-
ria; two classes, Endomicrobia and Saccharimonadia, three orders,
nine families, 35 genera and 96 species were found exclusively in the
multi-coverage bins. Just two genera and 11 species were found exclu-
sively in the single-coverage bins. This suggests that single-coverage
binning may overlook taxa that can be recovered using multi-coverage
binning, perhaps due to the increased coverage data available with
Received: 22 February 2022
Accepted: 28 May 2023
Published online: 29 June 2023
Check for updates
1The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh, UK. 2Centre for Digital Innovation,
DSM Biotechnology Center, Delft, The Netherlands. 3Scotland’s Rural College, Peter Wilson Building, King’s Buildings, Edinburgh, UK.
e-mail: mick.watson@dsm.com
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... Following the assembly, we performed multi-coverage binning (i.e., when clustering contigs of a sample into bins, the read coverage of these contigs across all samples was also considered) [57] on the identified viral vOTUs from each sample. We used CONCOCT [38], MetaBAT2 [39], AVAMB [40], and vRhyme [41] with default parameters to generate bins ( Fig. 1). ...
... Thirdly, we identified viral contigs using a customized bioinformatics pipeline and clustered them into the nonredundant species-level viral contigs referred to as vOTUs (" Methods"). Fourthly, for the viral contigs generated by each assembler, we used CONCOCT [38], MetaBAT2 [39], AVAMB [40], and vRhyme [41] for multi-coverage binning [57] (" Methods"). Then, we conducted a systematic evaluation of the tools at the assembly level and the binning level. ...
Article
Full-text available
Background Metagenome-assembled viral genomes have significantly advanced the discovery and characterization of the human gut virome. However, we lack a comparative assessment of assembly tools on the efficacy of viral genome identification, particularly across next-generation sequencing (NGS) and third-generation sequencing (TGS) data. Results We evaluated the efficiency of NGS, TGS, and hybrid assemblers for viral genome discovery using 95 viral-like particle (VLP)-enriched fecal samples sequenced on both Illumina and PacBio platforms. MEGAHIT, metaFlye, and hybridSPAdes emerged as the optimal choices for NGS, TGS, and hybrid datasets, respectively. Notably, these assemblers recovered distinct viral genomes, demonstrating a remarkable degree of complementarity. By combining individual assembler results, we expanded the total number of nonredundant high-quality viral genomes by 4.83 ~ 21.7-fold compared to individual assemblers. Among them, viral genomes from NGS and TGS data have the least overlap, indicating the impact of data type on viral genome recovery. We also evaluated four binning methods, finding that CONCOCT incorporated more unrelated contigs into the same bins, while MetaBAT2, AVAMB, and vRhyme balanced inclusiveness and taxonomic consistency within bins. Conclusions Our findings highlight the challenges in metagenome-driven viral discovery, underscoring tool limitations. We advocate for combined use of multiple assemblers and sequencing technologies when feasible and highlight the urgent need for specialized tools tailored to gut virome assembly. This study contributes essential insights for advancing viral genome research in the context of gut metagenomics. FFXniapQgXp7Q3XJqCNwcSVideo Abstract
... Following the assembly, we performed multi-coverage binning (i.e., when clustering contigs of a sample into bins, the read coverage of these contigs across all samples were also considered) [53] on the identi ed viral vOTUs from each sample. We used CONCOCT (v1.1.0) ...
... Fourthly, for the viral contigs generated by each assembler, we used CONCOCT (v1.1.0) [54], MetaBAT2 [55], AVAMB [56], and vRhyme [57] for multi-coverage binning (i.e., when clustering contigs of a sample into bins, the read coverage of these contigs across all samples were also considered) [53]. Then, we conducted a systematic evaluation of the tools at the assembly level and the binning level. ...
Preprint
Full-text available
Background Metagenome-assembled viral genomes have significantly advanced the discovery and characterization of the human gut virome. However, we lack a comparative assessment of assembly tools on the efficacy of viral genome identification, particularly across Next Generation Sequencing (NGS) and Third Generation Sequencing (TGS) data. Results We evaluated the efficiency of NGS, TGS and hybrid assemblers for viral genome discovery using 95 viral-like particle (VLP) enriched fecal samples sequenced on both Illumina and PacBio platforms. MEGAHIT, metaFlye and hybridSPAdes emerged as the optimal choices for NGS, TGS and hybrid datasets, respectively. Notably, these assemblers produced distinctive viral genomes, demonstrating a remarkable degree of complementarity. By combining individual assembler results, we expanded the total number of non-redundant high-quality viral genomes by 4.83 ~ 21.7 fold compared to individual assemblers. Among them, viral genomes from NGS and TGS data have the least overlap, indicating the impact of data type on viral genome recovery. We also evaluated four binning methods, finding that CONCOCT incorporated more unrelated contigs into the same bins, while MetaBAT2, AVAMB and vRhyme balanced inclusiveness and taxonomic consistency within bins. Conclusions Our findings highlight the challenges in metagenome-driven viral discovery, underscoring tool limitations. We advocate for combined use of multiple assemblers and sequencing technologies when feasible and highlight the urgent need for specialized tools tailored to gut virome assembly. This study contributes essential insights for advancing viral genome research in the context of gut metagenomics.
... Some studies [12][13][14] claim that the quality of these MAGs is comparable to genomes from microbial isolates. However, there is growing concern that contamination may severely affect the qualities of MAGs 15 . MAG contamination refers to the presence of contigs from different microbes in the same MAG, resulting in chimeric MAGs that compromise the reliability of downstream ecological and evolutionary analyses. ...
... Therefore, contig binning is necessary to https://doi.org/10.1038/s42256-024-00908-5 group contigs with similar sequence characteristics and abundances to represent microbial genomes. A recent study 15 highlighted that MAG contamination is a challenge during contig binning in metagenome assemblies. Tools such as MAGpurify and MDMcleaner have been developed to address this issue by removing contaminated contigs from MAGs. ...
Article
Full-text available
Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
... Recent studies have shown that there are significant differences in the intestinal microbiotas of local Chinese pig breeds and commercialized breeds 13 . Metagenomic binning approaches were applied in oral microbiome research and superior approach such as multi-coverage binning strategy was also applied to retrieve more comprehensive binning results 14,15 . In addition to updating data analysis pipelines, the microfluidics-based mini-metagenomics strategy could also be a powerful tool for dissecting microbial community structure in complex habitats 16 www.nature.com/scientificdata ...
Article
Full-text available
Compared with leaner breeds, local Chinese pig breeds have distinct intestinal microbial, as determined by metagenomic techniques, and the interactions between oral microorganisms and their hosts are also gradually being clarified. However, the high host genome content means that few metagenome-based oral microbiomes have been reported. Here, we combined dilution-based metagenomic sequencing and binning approaches to extract the microbial genomes from the oral microbiomes of Tibetan and Duroc pigs. The host contamination rates were reduced to 13.64%, a quarter of the normal metagenomic level (65.25% on average). Medium–high-quality metagenome-assembled genomes (MAGs; n = 3,448) spanning nine phyla were retrieved and 70.79% were novel species. Of the nonredundant MAGs, only 13.37% were shared, revealing the strong disparities between Tibetan and Duroc pigs. The oral microbial diversity of the Duroc pig was greater than that of the Tibetan pig. We present the first large-scale dilute-based metagenomic data on the pig oral microbiome, which should facilitate further investigation of the functions of oral microorganisms in pigs.
... Though the co-assembly approach has some advantages, there are challenges in the implementation, such as the computational burden. 38 Because of the large datasets (three of the six datasets contain over 200 samples), we used the individually assembling BGCs analysis (MSSA-BGCs) method to relieve the computational burden. Another advantage of this approach is to minimize the mixing of data from closely related strains (from the same species) and, therefore, potentially result in more completely assembled genes. ...
Article
Full-text available
Gut microbiome plays a pivotal role in combating diseases and facilitating healthy aging, and natural products derived from biosynthetic gene clusters (BGCs) of the human microbiome exhibit significant biological activities. However, the natural products of the gut microbiome in long-lived populations remain poorly understood. Here, we integrated six cohorts of long-lived populations, encompassing a total of 1029 fecal metagenomic samples, and employed the metagenomic single sample assembled BGCs (MSSA-BGCs) analysis pipeline to investigate the natural products and their associated species. Our findings reveal that the BGC composition of the extremely long-lived group differed significantly from that of younger elderly and young individuals across five cohorts. Terpene and Type I PKS BGCs were enriched in the extremely long-lived, whereas cyclic-lactone-autoinducer BGCs were more prevalent in the young. Association analysis indicated that terpene BGCs were strongly associated with the abundance of Akkermansia muciniphila, which was also more abundant in the long-lived elderly across at least three cohorts. We assembled 18 A. muciniphila draft genomes using metagenomic data from the extremely long-lived group across six cohorts and discovered that they all harbor two classes of terpene BGCs, which aligns with the 97 complete genomes of A. muciniphila strains retrieved from the NCBI database. The core domains of these two BGC classes are squalene/phytoene synthases involved in the biosynthesis of tri- and tetraterpenes. Furthermore, the abundance of fecal A. muciniphila was significantly associated with eight types of triterpenoids. Targeted terpenoid metabolomic analysis revealed that two triterpenoids, Holstinone C and colubrinic acid, were enriched in the A. muciniphila culture solution compared to the medium, thereby confirming the production of triterpenoids by A. muciniphila. The natural products derived from the gut of long-lived populations provide intriguing indications of their potential beneficial roles in regulating health.
... Furthermore, MAGFlow/BIgMAG can be useful during the execution of studies aimed to detect the presence of MAG hidden contamination after the binning step. For instance, our tool could automatize part of the work during the analysis of the effects of multi-coverage metagenomic binning (Mattock & Watson, 2023), since some of the tools used in such study to measure the MAG quality and perform taxonomical annotation are also enclosed by MAGFlow/BIgMAG. Also, MAGFlow/BIgMAG provides a convenient support during exploratory analyses that involve establishing general differences across samples as we showed with the examples presented in this paper. ...
Article
Background Building Metagenome–Assembled Genomes (MAGs) from highly complex metagenomics datasets encompasses a series of steps covering from cleaning the sequences, assembling them to finally group them into bins. Along the process, multiple tools aimed to assess the quality and integrity of each MAG are implemented. Nonetheless, even when incorporated within end–to–end pipelines, the outputs of these pieces of software must be visualized and analyzed manually lacking integration in a complete framework. Methods We developed a Nextflow pipeline (MAGFlow) for estimating the quality of MAGs through a wide variety of approaches (BUSCO, CheckM2, GUNC and QUAST), as well as for annotating taxonomically the metagenomes using GTDB-Tk2. MAGFlow is coupled to a Python–Dash application (BIgMAG) that displays the concatenated outcomes from the tools included by MAGFlow, highlighting the most important metrics in a single interactive environment along with a comparison/clustering of the input data. Results By using MAGFlow/BIgMAG, the user will be able to benchmark the MAGs obtained through different workflows or establish the quality of the MAGs belonging to different samples following the divide and rule methodology. Conclusions MAGFlow/BIgMAG represents a unique tool that integrates state-of-the-art tools to study different quality metrics and extract visually as much information as possible from a wide range of genome features.
Preprint
Full-text available
Genomes are fundamental to understanding microbial ecology and evolution. The emergence of high-throughput, long-read DNA sequencing has enabled recovery of microbial genomes from environmental samples at scale. However, expanding the microbial genome catalogue of soils and sediments has been challenging due to the enormous complexity of these environments. Here, we performed deep, long-read Nanopore sequencing of 154 soil and sediment samples collected across Denmark and through an optimised bioinformatics pipeline, we recovered genomes of 15,314 novel microbial species, including 4,757 high-quality genomes. The recovered microbial genomes span 1,086 novel genera and provide the first high-quality reference genomes for 612 previously known genera, expanding the phylogenetic diversity of the prokaryotic tree of life by 8 %. The long-read assemblies also enabled the recovery of thousands of complete rRNA operons, biosynthetic gene clusters and CRISPR-Cas systems, all of which were underrepresented and highly fragmented in previous terrestrial genome catalogues. Furthermore, the incorporation of the recovered MAGs into public genome databases significantly improved species-level classification rates for soil and sediment metagenomic datasets, thereby enhancing terrestrial microbiome characterization. With this study, we demonstrate that long-read sequencing and optimised bioinformatics, allows cost-effective recovery of high-quality microbial genomes from highly complex ecosystems, which remain the largest untapped source of biodiversity for expanding genome databases and filling in the gaps of the tree of life.
Article
Full-text available
Determining the taxonomic composition (taxonomic profiling) is a fundamental task in studying environmental and host-associated microbial communities. However, genome-resolved microbial diversity on Earth remains undersampled, and accessing the genomic context of taxa detected during taxonomic profiling remains a challenging task. Here, we present the mOTUs online database (mOTUs-db), which is consistent with and interfaces with the mOTUs taxonomic profiling tool. It comprises 2.83 million metagenome-assembled genomes (MAGs) and 919 090 single-cell and isolate genomes from 124 295 species-level taxonomic units. In addition to being one of the largest prokaryotic genome resources to date, all MAGs in the mOTUs-db were reconstructed de novo in 117 902 individual samples by abundance correlation of scaffolds across multiple samples for improved quality metrics. The database complements the Genome Taxonomy Database, with over 50% of its species-level taxonomic groups being unique. It also offers interactive querying, enabling users to explore and download genomes at various taxonomic levels. The mOTUs-db is accessible at https://motus-db.org.
Preprint
Full-text available
A common procedure for studying the microbiome is binning the sequenced contigs into metagenome-assembled genomes. Currently, unsupervised and self-supervised deep learning based methods using co-abundance and sequence based motifs such as tetranucleotide frequencies are state-of-the-art for metagenome binning. Taxonomic labels derived from alignment based classification have not been widely used. Here, we propose TaxVAMB, a metagenome binning tool based on semi-supervised bi-modal variational autoencoders, combining tetranucleotide frequencies and contig co-abundances with contig annotations returned by any taxonomic classifier on any taxonomic rank. TaxVAMB outperforms all other binners on CAMI2 human microbiome datasets, returning on average 40% more near-complete assemblies than the next best binner. On real long-read datasets TaxVAMB recovers on average 13% more near-complete bins and 14% more species. When used in a single-sample setup, TaxVAMB on average returns 83% more high quality bins than VAMB. TaxVAMB bins incomplete genomes drastically better than any other tool, returning 255% more high quality bins of incomplete genomes than the next best binner. Our method has immediate research and industrial applications, as well as methodological novelty which can be translated to other biological problems with semi-supervised multimodal datasets.
Article
Full-text available
Genomes are critical units in microbiology, yet ascertaining quality in prokaryotic genome assemblies remains a formidable challenge. We present GUNC (the Genome UNClutterer), a tool that accurately detects and quantifies genome chimerism based on the lineage homogeneity of individual contigs using a genome’s full complement of genes. GUNC complements existing approaches by targeting previously underdetected types of contamination: we conservatively estimate that 5.7% of genomes in GenBank, 5.2% in RefSeq, and 15–30% of pre-filtered “high-quality” metagenome-assembled genomes in recent studies are undetected chimeras. GUNC provides a fast and robust tool to substantially improve prokaryotic genome quality.
Article
Full-text available
The Interactive Tree Of Life (https://itol.embl.de) is an online tool for the display, manipulation and annotation of phylogenetic and other trees. It is freely available and open to everyone. iTOL version 5 introduces a completely new tree display engine, together with numerous new features. For example, a new dataset type has been added (MEME motifs), while annotation options have been expanded for several existing ones. Node metadata display options have been extended and now also support non-numerical categorical values, as well as multiple values per node. Direct manual annotation is now available, providing a set of basic drawing and labeling tools, allowing users to draw shapes, labels and other features by hand directly onto the trees. Support for tree and dataset scales has been extended, providing fine control over line and label styles. Unrooted tree displays can now use the equal-daylight algorithm, proving a much greater display clarity. The user account system has been streamlined and expanded with new navigation options and currently handles >1 million trees from >70 000 individual users.
Article
Full-text available
The rumen microbiota comprises a community of microorganisms which specialise in the degradation of complex carbohydrates from plant-based feed. These microbes play a highly important role in ruminant nutrition and could also act as sources of industrially useful enzymes. In this study, we performed a metagenomic analysis of samples taken from the ruminal contents of cow (Bos Taurus), sheep (Ovis aries), reindeer (Rangifer tarandus) and red deer (Cervus elaphus). We constructed 391 metagenome-assembled genomes originating from 16 microbial phyla. We compared our genomes to other publically available microbial genomes and found that they contained 279 novel species. We also found significant differences between the microbiota of different ruminant species in terms of the abundance of microbial taxonomies, carbohydrate-active enzyme genes and KEGG orthologs. We present a dataset of rumen-derived genomes which in combination with other publicly-available rumen genomes can be used as a reference dataset in future metagenomic studies.
Article
Full-text available
Background: The Boran (Bos indicus), indigenous Zebu cattle breed from sub-Saharan Africa, is remarkably well adapted to harsh tropical environments. Due to financial constraints and low-quality forage, African livestock are rarely fed at 100% maintenance energy requirements (MER) and the effect of sub-optimal restricted feeding on the rumen microbiome of African Zebu cattle remains largely unexplored. We collected 24 rumen fluid samples from six Boran cattle fed at sub-optimal and optimal MER levels and characterised their rumen microbial composition by performing shotgun metagenomics and de novo assembly of metagenome-assembled genomes (MAGs). These MAGs were used as reference database to investigate the effect of diet restriction on the composition and functional potential of the rumen microbiome of African cattle. Results: We report 1200 newly discovered MAGs from the rumen of Boran cattle. A total of 850 were dereplicated, and their uniqueness confirmed with pairwise comparisons (based on Mash distances) between African MAGs and other publicly available genomes from the rumen. A genome-centric investigation into sub-optimal diets highlighted a statistically significant effect on rumen microbial abundance profiles and a previously unobserved relationship between whole microbiome shifts in functional potential and taxon-level associations in metabolic pathways. Conclusions: This study is the first to identify 1200 high-quality African rumen-specific MAGs and provides further insight into the rumen function in harsh environments with food scarcity. The genomic information from the rumen microbiome of an indigenous African cattle breed sheds light on the microbiome contribution to rumen functionality and constitutes a vital resource in addressing food security in developing countries.
Article
Full-text available
Comprehensive, high-quality reference genomes are required for functional characterization and taxonomic assignment of the human gut microbiota. We present the Unified Human Gastrointestinal Genome (UHGG) collection, comprising 204,938 nonredundant genomes from 4,644 gut prokaryotes. These genomes encode >170 million protein sequences, which we collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations. Intraspecies genomic variation analyses revealed a large reservoir of accessory genes and single-nucleotide variants, many of which are specific to individual human populations. The UHGG and UHGP collections will enable studies linking genotypes to phenotypes in the human gut microbiome.
Article
Full-text available
The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. Availability: GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
We previously reported on MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here, we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat.
Article
Full-text available
Ruminants provide essential nutrition for billions of people worldwide. The rumen is a specialized stomach that is adapted to the breakdown of plant-derived complex polysaccharides. The genomes of the rumen microbiota encode thousands of enzymes adapted to digestion of the plant matter that dominates the ruminant diet. We assembled 4,941 rumen microbial metagenome-assembled genomes (MAGs) using approximately 6.5 terabases of short- and long-read sequence data from 283 ruminant cattle. We present a genome-resolved metagenomics workflow that enabled assembly of bacterial and archaeal genomes that were at least 80% complete. Of note, we obtained three single-contig, whole-chromosome assemblies of rumen bacteria, two of which represent previously unknown rumen species, assembled from long-read data. Using our rumen genome collection we predicted and annotated a large set of rumen proteins. Our set of rumen MAGs increases the rate of mapping of rumen metagenomic sequencing reads from 15% to 50-70%. These genomic and protein resources will enable a better understanding of the structure and functions of the rumen microbiota.
Article
Full-text available
Productivity of ruminant livestock depends on the rumen microbiota, which ferment indigestible plant polysaccharides into nutrients used for growth. Understanding the functions carried out by the rumen microbiota is important for reducing greenhouse gas production by ruminants and for developing biofuels from lignocellulose. We present 410 cultured bacteria and archaea, together with their reference genomes, representing every cultivated rumen-associated archaeal and bacterial family. We evaluate polysaccharide degradation, short-chain fatty acid production and methanogenesis pathways, and assign specific taxa to functions. A total of 336 organisms were present in available rumen metagenomic data sets, and 134 were present in human gut microbiome data sets. Comparison with the human microbiome revealed rumen-specific enrichment for genes encoding de novo synthesis of vitamin B12, ongoing evolution by gene loss and potential vertical inheritance of the rumen microbiome based on underrepresentation of markers of environmental stress. We estimate that our Hungate genome resource represents ∼75% of the genus-level bacterial and archaeal taxa present in the rumen.
Article
Full-text available
Largely due to challenges cultivating microbes under laboratory conditions, the genome sequence of many species in the human gut microbiome remains unknown. To address this problem, we reconstructed 60,664 prokaryotic draft genomes from 3,810 faecal metagenomes from geographically and phenotypically diverse human subjects. These genomes provide reference points for 2,058 previously unknown species-level operational taxonomic units (OTUs), representing a 50% increase in the phylogenetic diversity of sequenced gut bacteria. On average, new OTUs comprise 33% of richness and 28% of species abundance per individual and are enriched in humans from rural populations. A meta-analysis of clinical gut microbiome studies pinpointed numerous disease associations for new OTUs, which have the potential to improve predictive models. Finally, our analysis revealed that uncultured gut species have undergone genome reduction with loss of certain biosynthetic pathways, which may offer clues for improving cultivation strategies in the future.