Figure - available from: Genome Biology
This content is subject to copyright. Terms and conditions apply.
The performance of metaMIC on real metagenomic datasets. a The number of bins of different completeness with low contamination (<5%) reconstructed from original and corrected assemblies of “Ethiopian” (left) and “Madagascar” (right) cohorts. b Comparison of F1 scores for reconstructed bins before and after correction of contigs from “Ethiopian” (top) and “Madagascar” (bottom) cohorts. c An example of a predicted misassembled contig “k141_847840” assembled from combined rumen fluid and solid sample. The top plot shows the alignment result of Illumina short-read assembled contig “k141_847840” and PacBio long-read assembled contigs (“contig_982” and “contig_158”), where two regions in the “k141_84780” (1201-6738bp and 6920-8700bp) were aligned to “contig_982” and “contig_158,” respectively. The middle figure shows a snapshot of Integrative Genomics Viewer for contig “k141_847840.” The bottom plot shows the anomaly score (blue) and read breakpoint ratio (orange) across contig “k141_847840”

The performance of metaMIC on real metagenomic datasets. a The number of bins of different completeness with low contamination (<5%) reconstructed from original and corrected assemblies of “Ethiopian” (left) and “Madagascar” (right) cohorts. b Comparison of F1 scores for reconstructed bins before and after correction of contigs from “Ethiopian” (top) and “Madagascar” (bottom) cohorts. c An example of a predicted misassembled contig “k141_847840” assembled from combined rumen fluid and solid sample. The top plot shows the alignment result of Illumina short-read assembled contig “k141_847840” and PacBio long-read assembled contigs (“contig_982” and “contig_158”), where two regions in the “k141_84780” (1201-6738bp and 6920-8700bp) were aligned to “contig_982” and “contig_158,” respectively. The middle figure shows a snapshot of Integrative Genomics Viewer for contig “k141_847840.” The bottom plot shows the anomaly score (blue) and read breakpoint ratio (orange) across contig “k141_847840”

Source publication
Article
Full-text available
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC (https://github.com/ZhaoXM-Lab/metaMIC), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated a...

Similar publications

Article
Full-text available
It is well known that there is a myriad of factors that determine the kimchi microbial composition during fermentation, including but not limited to, salt concentrations, temperature, raw ingredients, and even manufacturing processes. Because different environments breed different species of microorganisms, the location of where raw ingredients wer...
Preprint
Full-text available
Background: The most widely used technologies for profiling microbial communities are 16S marker-gene sequencing and shotgun metagenomic sequencing. Interestingly, many microbiome studies have performed both sequencing experiments on the same cohort of samples. The two sequencing datasets often reveal consistent patterns of microbial signatures, hi...

Citations

... These strategies model misassembly by calculating descriptive statistics of alignment signals derived from the alignment of short reads against assemblies. Alignment signals that are commonly used to model misassembly include read depth (the number of reads that map to a genome region), discordant read pairs (read pairs from the same fragment whose mapping deviates in distance or orientation), and clipped reads (reads that have partial alignments to the assembly) [5]. These signals are then combined into a sophisticated statistical or machine learning model to learn and detect misassembly. ...
... These signals are then combined into a sophisticated statistical or machine learning model to learn and detect misassembly. ALE [6], a statistical model, utilizes Bayesian probability to describe the likelihood of each position to measure the quality of assemblies; metaMIC [5], a machine learning-based tool, harnesses a range of descriptive statistics of alignment features from alignments to detect and correct misassemblies. Deep learning, with its capacity to autonomously learn complex representations from extensive labeled datasets without expert guidance, presents a compelling approach for broadly identifying misassemblies. ...
... For fair evaluation, we balanced the number of misassemblies and correct assemblies in metaSPAdes assembled dataset as metaSPAdes produce less miassemblies. Only assemblies longer than 5000bp are considered as the majority of misassemblies that occur in assemblies are longer than 5000bp [5]. ...
Preprint
Full-text available
Accurate metagenomic assemblies are essential for constructing reliable metagenome-assembled genomes (MAGs). However, the complexity of microbial genomes continues to pose challenges for accurate assembly. Current reference-free assembly evaluation tools primarily rely on handcrafted features and suffer from poor generalization across different metagenomic data. To address these limitations, we propose DeepMM, a novel deep learning-based visual model designed for the identification and correction of metagenomic misassemblies. DeepMM transforms alignments between assemblies and reads into a multi-channel image for misassembly feature learning and applies contrastive learning to bring different views of misassemblies closer. Furthermore, DeepMM offers a fine-tuning process to match different sequencer data. Our results show that DeepMM outperforms state-of-the-art methods in identifying misassemblies, achieving the highest AUPRC score in five CAMI datasets. DeepMM provides accurate correction of misassemblies, significantly improving downstream binning results, increasing the number of near-complete MAGs from 905 to 1006 in a large real metagenomic sequencing dataset derived from a diarrhea-predominant Irritable Bowel Syndrome (IBS-D) cohort.
... Mis-assembly was then identified using metaMIC [48] with default parameters for the contig generated from all assemblers. Mis-assembled contigs were corrected by splitting into fragments at the mis-assembled positions reported by the metaMIC tool; the fragments were considered as contigs and also used for subsequent analysis. ...
Article
Full-text available
Background Metagenome-assembled viral genomes have significantly advanced the discovery and characterization of the human gut virome. However, we lack a comparative assessment of assembly tools on the efficacy of viral genome identification, particularly across next-generation sequencing (NGS) and third-generation sequencing (TGS) data. Results We evaluated the efficiency of NGS, TGS, and hybrid assemblers for viral genome discovery using 95 viral-like particle (VLP)-enriched fecal samples sequenced on both Illumina and PacBio platforms. MEGAHIT, metaFlye, and hybridSPAdes emerged as the optimal choices for NGS, TGS, and hybrid datasets, respectively. Notably, these assemblers recovered distinct viral genomes, demonstrating a remarkable degree of complementarity. By combining individual assembler results, we expanded the total number of nonredundant high-quality viral genomes by 4.83 ~ 21.7-fold compared to individual assemblers. Among them, viral genomes from NGS and TGS data have the least overlap, indicating the impact of data type on viral genome recovery. We also evaluated four binning methods, finding that CONCOCT incorporated more unrelated contigs into the same bins, while MetaBAT2, AVAMB, and vRhyme balanced inclusiveness and taxonomic consistency within bins. Conclusions Our findings highlight the challenges in metagenome-driven viral discovery, underscoring tool limitations. We advocate for combined use of multiple assemblers and sequencing technologies when feasible and highlight the urgent need for specialized tools tailored to gut virome assembly. This study contributes essential insights for advancing viral genome research in the context of gut metagenomics. FFXniapQgXp7Q3XJqCNwcSVideo Abstract
... Mis-assembly was then identi ed using metaMIC [44] with default parameters for the contig generated from all assemblers. Mis-assembled contigs were corrected by splitting into fragments at the misassembled positions reported by the metaMIC tool; the fragments were considered as contigs and also used for subsequent analysis. ...
Preprint
Full-text available
Background Metagenome-assembled viral genomes have significantly advanced the discovery and characterization of the human gut virome. However, we lack a comparative assessment of assembly tools on the efficacy of viral genome identification, particularly across Next Generation Sequencing (NGS) and Third Generation Sequencing (TGS) data. Results We evaluated the efficiency of NGS, TGS and hybrid assemblers for viral genome discovery using 95 viral-like particle (VLP) enriched fecal samples sequenced on both Illumina and PacBio platforms. MEGAHIT, metaFlye and hybridSPAdes emerged as the optimal choices for NGS, TGS and hybrid datasets, respectively. Notably, these assemblers produced distinctive viral genomes, demonstrating a remarkable degree of complementarity. By combining individual assembler results, we expanded the total number of non-redundant high-quality viral genomes by 4.83 ~ 21.7 fold compared to individual assemblers. Among them, viral genomes from NGS and TGS data have the least overlap, indicating the impact of data type on viral genome recovery. We also evaluated four binning methods, finding that CONCOCT incorporated more unrelated contigs into the same bins, while MetaBAT2, AVAMB and vRhyme balanced inclusiveness and taxonomic consistency within bins. Conclusions Our findings highlight the challenges in metagenome-driven viral discovery, underscoring tool limitations. We advocate for combined use of multiple assemblers and sequencing technologies when feasible and highlight the urgent need for specialized tools tailored to gut virome assembly. This study contributes essential insights for advancing viral genome research in the context of gut metagenomics.
... The failed samples were then assembled by MEGAHIT [80] (version 1.2.9) with default parameters. The resulting contigs were further refined with metaMIC [81] (https:// github. com/ ZhaoXM-Lab/ metaM lC). ...
Article
Full-text available
Background Our facial skin hosts millions of microorganisms, primarily bacteria, crucial for skin health by maintaining the physical barrier, modulating immune response, and metabolizing bioactive materials. Aging significantly influences the composition and function of the facial microbiome, impacting skin immunity, hydration, and inflammation, highlighting potential avenues for interventions targeting aging-related facial microbes amidst changes in skin physiological properties. Results We conducted a multi-center and deep sequencing survey to investigate the intricate interplay of aging, skin physio-optical conditions, and facial microbiome. Leveraging a newly-generated dataset of 2737 species-level metagenome-assembled genomes (MAGs), our integrative analysis highlighted aging as the primary driver, influencing both facial microbiome composition and key skin characteristics, including moisture, sebum production, gloss, pH, elasticity, and sensitivity. Further mediation analysis revealed that skin characteristics significantly impacted the microbiome, mostly as a mediator of aging. Utilizing this dataset, we uncovered two consistent cutotypes across sampling cities and identified aging-related microbial MAGs. Additionally, a Facial Aging Index (FAI) was formulated based on the microbiome, uncovering the cutotype-dependent effects of unhealthy lifestyles on skin aging. Finally, we distinguished aging related microbial pathways influenced by lifestyles with cutotype-dependent effect. Conclusions Together, our findings emphasize aging’s central role in facial microbiome dynamics, and support personalized skin microbiome interventions by targeting lifestyle, skin properties, and aging-related microbial factors. 8-x6JkQezK-S195ofXGtyGVideo Abstract
... Although such prior knowledge about target species in samples is in theory impossible in diagnostic mNGS applications, these modeling efforts show the importance of proper reference data selection for a given classification algorithm on mNGS read classification and warn for cautiously selecting reference databases fit for purpose. The quality of the de novo contigs can be assessed and even corrected by new tools such as metaMIC (38). Inclusion of the de novo contigs in the regularly updated reference database would be a great strategy for improving mNGS as a generic diagnostic method in both veterinary and public health. ...
Article
Full-text available
Metagenomic shotgun sequencing (mNGS) can serve as a generic molecular diagnostic tool. An mNGS proficiency test (PT) was performed in six European veterinary and public health laboratories to detect porcine astroviruses in fecal material and the extracted RNA. While different mNGS workflows for the generation of mNGS data were used in the different laboratories, the bioinformatic analysis was standardized using a metagenomic read classifier as well as read mapping to selected astroviral reference genomes to assess the semiquantitative representation of astrovirus species mixtures. All participants successfully identified and classified most of the viral reads to the two dominant species. The normalized read counts obtained by aligning reads to astrovirus reference genomes by Bowtie2 were in line with Kraken read classification counts. Moreover, participants performed well in terms of repeatability when the fecal sample was tested in duplicate. However, the normalized read counts per detected astrovirus species differed substantially between participants, which was related to the different laboratory methods used for data generation. Further modeling of the mNGS data indicated the importance of selecting appropriate reference data for mNGS read classification. As virus- or sample-specific biases may apply, caution is needed when extrapolating this swine feces-based PT for the detection of other RNA viruses or using different sample types. The suitability of experimental design to a given pathogen/sample matrix combination, quality assurance, interpretation, and follow-up investigation remain critical factors for the diagnostic interpretation of mNGS results. IMPORTANCE Metagenomic shotgun sequencing (mNGS) is a generic molecular diagnostic method, involving laboratory preparation of samples, sequencing, bioinformatic analysis of millions of short sequences, and interpretation of the results. In this paper, we investigated the performance of mNGS on the detection of porcine astroviruses, a model for RNA viruses in a pig fecal material, among six European veterinary and public health laboratories. We showed that different methods for data generation affect mNGS performance among participants and that the selection of reference genomes is crucial for read classification. Follow-up investigation remains a critical factor for the diagnostic interpretation of mNGS results. The paper contributes to potential improvements of mNGS as a diagnostic tool in clinical settings.
... due to chimeric assemblies or the presence of TerL sequences in defective prophages. As chimeric contigs are rare 48 , we consider it unlikely that this significantly impacted our results. ...
Article
Full-text available
Viruses are core components of the human microbiome, impacting health through interactions with gut bacteria and the immune system. Most human microbiome viruses are bacteriophages, which exclusively infect bacteria. Until recently, most gut virome studies focused on low taxonomic resolution (e.g., viral operational taxonomic units), hampering population-level analyses. We previously identified an expansive and widespread bacteriophage lineage in inhabitants of Amsterdam, the Netherlands. Here, we study their biodiversity and evolution in various human populations. Based on a phylogeny using sequences from six viral genome databases, we propose the Candidatus order Heliusvirales. We identify heliusviruses in 82% of 5441 individuals across 39 studies, and in nine metagenomes from humans that lived in Europe and North America between 1000 and 5000 years ago. We show that a large lineage started to diversify when Homo sapiens first appeared some 300,000 years ago. Ancient peoples and modern hunter-gatherers have distinct Ca. Heliusvirales populations with lower richness than modern urbanized people. Urbanized people suffering from type 1 and type 2 diabetes, as well as inflammatory bowel disease, have higher Ca. Heliusvirales richness than healthy controls. We thus conclude that these ancient core members of the human gut virome have thrived with increasingly westernized lifestyles.
... For candidate genes, it is important to able to distinguish between the two cases. Given the complexity of genome assembly methods, there continues to be interest in the development of algorithms for detecting misassemblies through different sources of information such as read coverage (Zhu et al., 2015), improper read alignments (Hunt et al., 2013), discrepancies in k-mer distributions between a reference genome and long-reads (Dishuck et al., 2023), mate-pair information (Kelley & Salzberg, 2010) and notable breakpoints based on the alignment of the focal assembly, the constituent raw reads or the genome of a closely related species (Asalone et al., 2020;Bao et al., 2018;Lai et al., 2022;Zhang et al., 2023). ...
Article
The improvement and decreasing costs of third‐generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g. genes) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy's utility by investigating antifreeze glycoprotein ( afgp ) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able identify an unmappable locus in the mudskipper reference genome and identify a putative repetitive element shared among several species of bees.
... For candidate genes, it is important to able to distinguish between the two cases. Given the complexity of genome assembly methods, there continues to be interest in the development of algorithms for detecting misassemblies through different sources of information such as read coverage (Zhu et al. 2015), improper read alignments (Hunt et al. 2013), discrepancies in k-mer distributions between a reference genome and long-reads (Dishuck et al. 2023), mate-pair information (Kelley and Salzberg 2010), and notable breakpoints based on the alignment of the focal assembly, the constituent raw reads, or the genome of a closely related species (Bao et al. 2018;Asalone et al. 2020;Lai et al. 2022;Zhang et al. 2023). ...
Preprint
The improvement and decreasing costs of third-generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g., genes, promotors, or transposable elements) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy’s utility by investigating antifreeze glycoprotein (afgp) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able to identify an cryptic locus in the mudskipper reference genome, and identify a putative repetitive element shared amongst several species of bees.
... Brie y, for the NGS data, we used IDBA-UD (v1. Mis-assembly was then identi ed using metaMIC [32] with default parameters for the contig generated from all assemblers. Mis-assembled contigs were corrected by splitting into fragments at the misassembled positions reported by the metaMIC tool; the fragments were considered as contigs and also used for subsequent analysis. ...
Preprint
Full-text available
Background Metagenome-assembled viral genomes have significantly advanced the discovery and characterization of the human gut virome. However, we lack a comparative assessment of assembly tools on the efficacy of viral genome identification, particularly across Next Generation Sequencing (NGS) and Third Generation Sequencing (TGS) data. Results We evaluated the efficiency of NGS, TGS and hybrid assemblers for viral genome discovery using 95 viral-like particle (VLP) enriched fecal samples sequenced on both Illumina and PacBio platforms. MEGAHIT, metaFlye and hybridSPAdes emerged as the optimal choices for NGS, TGS and hybrid datasets, respectively. Notably, these assemblers produced distinctive viral genomes, demonstrating a remarkable degree of complementarity. By combining individual assembler results, we expanded the total number of non-redundant high-quality viral genomes by 4.43 ~ 11.8 fold compared to individual assemblers. Among them, viral genomes from NGS and TGS data have the least overlap, indicating the impact of data type on viral genome recovery. We also evaluated two binning methods, finding that CONCOCT incorporated more unrelated contigs into the same bins, while MetaBAT2 balanced inclusiveness and taxonomic consistency within bins. Conclusions Our findings highlight the challenges in metagenome-driven viral discovery, underscoring tool limitations. We recommend the simultaneous use multiple assemblers, and both short- and long-read sequencing if resources permit, and advocate the pressing need for specialized tools tailored to gut virome assembly. This study contributes essential insights for advancing viral genome research in the context of gut metagenomics.
... The MAGs were assembled as follows: each individual's saliva and faecal samples were independently subjected to de novo metagenomic assembly using metaSPAdes (V.3.15.5) with default parameters, 34 followed by assembly refinement by metaMIC (V.1.0) 35 and binning using metaWRAP (V.1.3.2). 36 After undergoing refinement using the 'bin_refinement' module in metaWRAP with parameters (-c 50 -x 10), we obtained a total of 14 044 and 8496 genomic bins from the faecal and salivary samples, respectively. ...
Article
Full-text available
Objective We aim to compare the effects of proton pump inhibitors (PPIs) and histamine-2 receptor antagonists (H2RAs) on the gut microbiota through longitudinal analysis. Design Healthy volunteers were randomly assigned to receive either PPI (n=23) or H2RA (n=26) daily for seven consecutive days. We collected oral (saliva) and faecal samples before and after the intervention for metagenomic next-generation sequencing. We analysed intervention-induced alterations in the oral and gut microbiome including microbial abundance and growth rates, oral-to-gut transmissions, and compared differences between the PPI and H2RA groups. Results Both interventions disrupted the gut microbiota, with PPIs demonstrating more pronounced effects. PPI usage led to a significantly higher extent of oral-to-gut transmission and promoted the growth of specific oral microbes in the gut. This led to a significant increase in both the number and total abundance of oral species present in the gut, including the identification of known disease-associated species like Fusobacterium nucleatum and Streptococcus anginosus . Overall, gut microbiome-based machine learning classifiers could accurately distinguish PPI from non-PPI users, achieving an area under the receiver operating characteristic curve (AUROC) of 0.924, in contrast to an AUROC of 0.509 for H2RA versus non-H2RA users. Conclusion Our study provides evidence that PPIs have a greater impact on the gut microbiome and oral-to-gut transmission than H2RAs, shedding light on the mechanism underlying the higher risk of certain diseases associated with prolonged PPI use. Trial registration number ChiCTR2300072310.