ArticlePDF Available

Abstract and Figures

Background Microbiome/host interactions describe characteristics that affect the host's health. Shotgun metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools. Methods We tested the effects of the assembly and binning processes on population genome reconstruction using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3 assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of binning tools using genome completeness and taxonomic identification. Results We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632 ± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive to the biological and technological variations within the project, compared with other assemblers. Among binning tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49), low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were 100% complete, but show low similarity to genomes within databases. Conclusions In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in highest quality metagenome-assembled genomes. Electronic supplementary material The online version of this article (10.1186/s12864-017-4294-1) contains supplementary material, which is available to authorized users.
This content is subject to copyright. Terms and conditions apply.
M E T H O D O L O G Y A R T I C L E Open Access
Optimizing and evaluating the
reconstruction of Metagenome-assembled
microbial genomes
Bhavya Papudeshi
1,2
, J. Matthew Haggerty
3
, Michael Doane
3
, Megan M. Morris
3
, Kevin Walsh
3
,
Douglas T. Beattie
5
, Dnyanada Pande
1
, Parisa Zaeri
6
, Genivaldo G. Z. Silva
4
, Fabiano Thompson
7
,
Robert A. Edwards
8
and Elizabeth A. Dinsdale
3*
Abstract
Background: Microbiome/host interactions describe characteristics that affect the host's health. Shotgun
metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic
potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled
genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome
reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a
genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may
influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects
that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral
reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools.
Methods: We tested the effects of the assembly and binning processes on population genome reconstruction
using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3
assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency
of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of
binning tools using genome completeness and taxonomic identification.
Results: We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632
± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and
evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive
to the biological and technological variations within the project, compared with other assemblers. Among binning
tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49),
low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat
extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled
genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were
100% complete, but show low similarity to genomes within databases.
Conclusions: In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal
assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed
quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from
microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in
highest quality metagenome-assembled genomes.
* Correspondence: elizabeth_dinsdale@hotmail.com
3
Department of Biology, San Diego State University, 5500 Campanile Drive,
San Diego 92115, California, USA
Full list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Papudeshi et al. BMC Genomics (2017) 18:915
DOI 10.1186/s12864-017-4294-1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Background
Microbiome studies describe the significance of micro-
bial community that is associated with the host organism
[1]. However, less than 1% of all microbial species can be
cultured in vivo [24]; therefore, applications of culture-
independent sequencing technology has revolutionized
microbiome analysis [511]. Shotgun metagenomics
provides a rapid assessment of microbial communities
by sequencing a random subset of the genetic material
from the environment [2, 610, 12]. Annotations of
metagenomic DNA fragments is used to infer taxonomic
and functional patterns within microbial communities
across multiple environments, including oceans [7, 13],
coral reefs [5, 9, 1318], algae [19], and sharks [6]. How-
ever, linking the taxonomic origin of functional genes
from metagenomes is a complex task, because the se-
quences belong to multiple genomes. In addition, many
sequences may not match the database and therefore re-
main unidentified, for example in the viral community
collected from a marine oxygen minimum zone only 2%
of sequences were identified [20]. Improved sequencing
technology and coverage have enabled reconstruction of
fragments into metagenome-assembled genomes by
process of assembly and binning. However, genome re-
construction is affected by sequencing technology and
the biological characteristics of the microbial commu-
nity. Sequencers are currently restricted by an inverse
relationship between sequence length and the number of
reads. Longer reads provide more accurate annotation,
whereas, shorter reads produce greater coverage of the
community. High coverage is preferred in diverse com-
munities to identify rare species [21]. Similarly, if the di-
vergence within the species in the metagenome is small,
reconstruction of metagenome-assembled genomes will
inherently become difficult due to the inseparability of
the microbial genomes [2, 22]. It is unresolved how se-
quencing characteristics of read length and depth inter-
act with the biological variation of the microbial
community, during the reconstruction of genomes on
real metagenomic datasets.
The first step in the reconstruction of genomes is as-
sembly, where short metagenomic reads are joined based
on sequence overlap to form longer sequences called
contigs. Assemblers apply different algorithms which
may influence reconstructed genome quality. Incorrect
assembly draws ambiguous conclusions from the data
and reduces the number of annotations [23]. Therefore,
assembly evaluation is an important step that includes
both contig continuity and contig chimerism. The pro-
gram QUAST (Quality Assessment for Genome Assem-
blies) calculates contig continuity by describing both
contig length and number of contigs [24]. Contig chime-
rism is due to random sequence overlap; therefore a
contig contains sequences from divergent bacteria and
can be removed by tools that assess read coverage like
Bowtie [25]. While not often recognized, changes in spe-
cies richness and evenness from raw sequences com-
pared with assembled contigs can also be used to assess
contig chimerism as assemblers should maintain rich-
ness (number of taxa identified) while increasing even-
ness (greatest with equal distribution of taxa) [2628].
In addition, a substantial reduction in diversity may indi-
cate chimera formation. Therefore, an optimal assembly
will provide; a high number of long contigs, a high pro-
portion of reads assembled, conserved species richness,
and an increased species evenness.
Binning reconstructs genomes of taxa from the indi-
vidual contigs allowing for sequences with no homology
to the databases to be annotated and taxonomic origin
of functional genes to be identified [2931]. Binning in-
cludes grouping phylogenetically related contigs into a
bin, which represents a population genome containing
the gene content of closely related species [32]. Binning
tools group similar sequences based on sequence com-
position, which is an unsupervised approach that uses
genomic signatures, such as GC content [33], tetranu-
cleotide frequencies [3436], and read coverage per con-
tigs [2, 29, 30]. An ideal bin will represent one bacterial
genome with minimal GC variation, species richness,
and ~100% genome completeness. To increase the qual-
ity of binning, tools are advancing from applications
using one genome signature, such as GroopM (group
metagenomes) [30] and cross assembly [29], to applica-
tions using a combination of genome signatures, such as
MetaBat (Metagenome Binning with Abundance and
Tetra-nucleotide frequencies) [31]. The quality of the
resulting bins is assessed by calculating the variation in
GC content, species richness, and predicted genome
completeness using tools, such as CheckM (check gen-
ome completeness) tool [37]. Bins containing sequences
from mainly single taxa are metagenome-assembled ge-
nomes. Bins that contain sequences similar to multiple
taxa, but include most of the bacterial marker genes may
be novel population genomes. Identifying novel mi-
crobes is a crucial objective of reconstructing genomes
from metagenomes. The phylogeny and genomic content
of the novel genomes are investigated using tools such
as CheckM [37], PhyloSift (phylogenetic analysis of ge-
nomes and metagenomes) [38], and RAST (Rapid Anno-
tations using Subsystems Technology) [39]. Further,
relatedness to species can also be identified using
average nucleotide identity (ANI) that reciprocates the
results from DNA-DNA hybridization experiments to
show species relatedness [40]. In DNA-DNA
hybridization a 70% cut-off delineates species relatedness
and is reflected in the ANI calculations as the propor-
tion of protein-coding regions that align between two
genomes [41], if ANI is > 95%, it represents species
Papudeshi et al. BMC Genomics (2017) 18:915 Page 2 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
relatedness [40]. As metagenomics analysis of microbial
communities becomes more popular, many new genomic
tools are being produced to analyze the DNA sequences
(https://omictools.com). There are benefits, and draw-
backs of the analysis conducted by each tool and under-
standing how these analyses affect the results is essential
to microbiologists. Previous evaluation of assemblers
and binning tools have emphasized computational effi-
ciency, including runtime, and memory usage. Many of
these analyses were completed on synthetic microbial
communities rather than actual metagenomic data [22],
using parameters such as the number of miss-
assemblies, genome recalls and precision that is a chal-
lenge to calculate on real datasets [22, 24, 31]. Another
analysis has only used one assembler and binning tool
[42], without comparing the effects of the assembler on
the dataset. Other studies have spiked genomic reads
into metagenomes to investigate the number of reads re-
quired to reconstruct a draft metagenomics-assembled
genome [43]. In this paper, we investigate the effect of
assembly and binning by comparing 105 metagenomes
that were; 1) recovered from different marine environ-
ments, 2) varied in diversity, and 3) sequenced on differ-
ent sequencing platforms. Biologically relevant
parameters are used to analyze the data after the appli-
cation of each tool. We hypothesize that the biological
characteristics will affect assembly and binning. First,
theassemblyqualityforthethreeassemblers:IDBA
(Iterative De Bruijn graph Assembler), MetaVelvet
(METAgenomic-Velvet assembler), and SPAdes (St.
Petersburg genome assembler) was assessed using a set
of assembly statistics, including contig continuity and
contig chimerism. The most optimal assembler was ap-
plied to each project, followed by two composition
based binning tools: GroopM and MetaBat to recon-
struct genomes. These bins were assessed for genome
completeness and taxonomic identification. Last, we
explorethegenomiccontentandphylogeneticrelation-
ships of a metagenome-assembled genome. Our pipe-
lineisshowninFig.1.
Methods
Metagenomes collection
To test the effects of the assembly and binning processes
on population genome reconstruction, we used 105 mar-
ine metagenomes from 4 projects. The projects were
collected from coral atolls in Abrolhos Bank, Brazil
(coral) and Southern California kelp forests (kelp) (see
Additional file 1: Table S1). In two of the projects, the
microbial community was experimentally manipulated
before sequencing to reduce the diversity of the
microbes, and these projects are labeled as coral low
diversity (coral_IT_low) [9] and kelp low diversity
(kelp_IL_low) [8]. The other two projects are natural
microbial communities collected from the marine water
associated the same environments and called coral high
diversity (coral_IL_high) [14] and kelp high diversity
(kelp_IT_high) [10]. Coral_IT_low and kelp_IT_high
metagenomes were sequenced on Ion Torrent PGM
(IT), 200 sequencing kit (ThermoFisher Scientific),
whereas coral_IL_high, and kelp_IL_low was sequenced
on an Illumina MiSeq v3 reagent cartridge (IL), 600 cycle
kit (Illumina Inc.). Many metagenomes are publicly
available on MG-RAST (MetaGenomics-Rapid Annota-
tion using Subsystems Technology); thus the pipeline
started with obtaining the metagenomes from this data-
base [44](Table 1). The variation between the different
projects was used to identify the repeatability of the
Fig. 1 Overview of the workflow developed with the tools applied at each step (in bold). aMetagenomic reads are assembled using three
assemblers: SPAdes, MetaVelvet, and IDBA. bOptimization of assembly tool using assembly statistics. cAssembled contigs from optimal assembly
were binned using: MetaBat, and GroopM. dOptimal binning tool selected through bin validation. Colors (black, dark grey, and light grey) depict
different microbial species, each line of a color representing the sequence belonging to a bacterial species
Papudeshi et al. BMC Genomics (2017) 18:915 Page 3 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
workflow on datasets that vary with the environment
from, level of biological diversity, and sequencing
platform used.
The first step in a metagenomic pipeline is to remove
poor quality sequences by running each metagenome
through PRINSEQ (PReprocessing and INformation of
SEQuence data) [45]. PRINSEQ was performed to re-
move sequencing tags, duplicates and Ns within the
metagenome. Forward and reverse reads from Illumina
MiSeq platform were first paired using PEAR (Paired-
End Read merger) [46]. All the reads from a project were
placed together in one file and cross-assembled (i.e., all
metagenomes from the one project were assembled)
using three De Bruijn graph assemblers: IDBA, MetaVel-
vet, and SPAdes. Default kmer sizes were applied for
each tool; IDBA (k
min
: 25), MetaVelvet (kmer: 31) and
SPAdes (kmers: 21, 33 and 55).
Assembly evaluation
Each assembler (IDBA, MetaVelvet, and SPAdes) pro-
vides one output contig file for each project, therefore
providing 12 contig files in total. We calculated the as-
sembly statistics for the 12 contig files using QUAST
[24], including N
50
length, L
50
(which includes the num-
ber of contigs longer than N
50
), the number of contigs
assembled, the length of the largest contig, and the total
length of the assembly. Contig continuity was assessed
using contig length (length of 1000 contigs from 12 con-
tig files), and the total number of contigs per assembly.
Contig chimerism was first assessed by calculating the
proportion of reads assembled (for 1000 contigs from 12
contig files) using Bowtie [25]. FOCUS (Find Organisms
by Composition USage), a taxa identification tool that is
alignment independent, was applied to the 12 contig
files. The resulting information was used to calculate the
Margalef richness and Pielous evenness of the 12 contig
files using Primer statistics tool [47]. FOCUS was used
explicitly for this step, as each contig is assigned to bac-
terial species based on kmer ratios [48]. Contig chimeras
will have variable kmer ratios and will remain unidenti-
fied by Focus and be removed from further analysis. The
second step for assessing contig chimerism included a
comparison of Margalef richness and Pielous evenness
of the 12 contig files against the metagenomic reads.
The overall proportion of reads assembled into the
entire assembly for the 12 contig files were also calcu-
lated using Bowtie.
The contigs from the optimal assemblers for each
project were selected and uploaded to the Contig
Clustering of Metagenomics (CCOM) tool [49] along
with their read files in FASTA format to perform
GroopM [30] and MetaBat [31] clustering. CCOM tool
runs BWA (Burrows-Wheeler Aligner) aligner to map
reads on contigs, the resulting output from the tools
includes bam format. GroopM and MetaBat both use
the contigs (.fasta) and reads (.bam) format as input to
extract the resulting bins.
Bin validation
CCOM tool extracted two sets of bins for GroopM and
MetaBat binning tools for each project. Evaluation of
binning tools was performed using bin characteristics in-
cluding; variation in GC content, species richness and
genome completeness. GC content was calculated using
a self-written Biopython [50] script. Taxonomy compos-
ition for each bin was predicted using FOCUS [48].
Margalefs species richness was calculated using Primer
[47] for FOCUS taxonomy results. Genome complete-
ness was assessed using CheckM [37]. A bin was identi-
fied as a specific population genome if the bins included
sequences belonging to a single genus. Species or strain
level resolution could be used depending on the
amount of coverage and diversity of the microbes. Po-
tentially novel bins were identified as those bins that
contained > 50% genome completeness but were not
annotated by FOCUS. These potentially novel bins were
further analyzed using CheckM [37], PhyloSifts [38],
and RAST [39], all of which predict the neighboring ge-
nomes using marker genes. Proteome content of a
novel population genome was investigated using
PATRIC (Pathosystems Resource Integration Center)
[51], followed by calculating the average nucleotide
identity of the protein-encoding genes by applying the
blast (ANIb) analysis and tetranucleotide correlation
search (TCS) in JSpeciesWS tool [41].
Statistical analysis
The first statistical analysis was a one-way ANOVA
(ANalysis Of VAriance) conducted on the unassembled
metagenomes from each project to identify differences
in microbial diversity. Assembly evaluation variables
Table 1 Background information on the projects used to evaluate the selection of assembly and binning tools
Project name Source Number of metagenomes Total number of reads Sequencing technology Environment
coral_IL_high Abrolhos, Brazil. 2014 16 20,711,400 Illumina MiSeq (IL) Coral atolls (coral)
coral_IT_low Abrolhos, Brazil. 2011 15 18,323,050 IonTorrent, PGM (IT) Coral atolls (coral)
kelp_IL_low San Diego, USA 2015 51 6,493,217 Illumina MiSeq (IL) kelp forest (kelp)
kelp_IT_high San Diego, USA 20122013 23 9,769,952 IonTorrent PGM (IT) kelp forest (kelp)
Papudeshi et al. BMC Genomics (2017) 18:915 Page 4 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
included the number of contigs, richness, and evenness,
and binning tools evaluation variables included, GC
content, species richness, and genome completeness.
These variables were tested for normality using the
Shapiro-Wilks test, and non-normal data was log trans-
formed when appropriate. Data containing many in-
stances (> 5000), for example, contig length and
percent of reads assembled, were tested for normality
using the Kolmogorov-Smirnov test and non-normal
data was log transformed when appropriate. To test for
differences in assemblers, a one-way ANOVA was con-
ducted on the following variables; the number of
contigs, richness, and evenness. A one-way ANOVA
wasusedbecausetherewasonlyonedatapointfor
each variable per project because the metagenomes
were cross assembled. To investigate whether the
assemblers performed differently depending on the
projects a 2-way ANOVA model was conducted on the
factors; project, assemblers and projects by assemblers
as the interaction term for the variables contig length
and reads assembled. For the 2-way ANOVA, the data
was subsampled to select for the 1000 longest contigs
in each project, because running statistics on all
300,000 contigs is not feasible. Tukey HSD post hoc
comparisons were performed to identify the project
that contributed to the differences. Similar statistics
were conducted on the binning evaluation variables for
the two binning tools, MetaBat and GroopM. There-
fore, to investigate whether the binning tools performed
differently depending on the projects a 2-way ANOVA
model was conducted on the factors; project, binning
tools, and projects by binning tools as the interaction
term for the variables; GC variation, richness and gen-
ome completeness. Overall, the statistical analysis was
implemented using R scripts and visualized using Sigma
Plot (Systat Software, San Jose, CA).
Results
Variation between projects
The metagenomes from four projects were downloaded
from MG-RAST (Table 1). Samples were from two envi-
ronments; coral atolls and kelp forest, sequenced on two
sequencing platforms; Illumina and IonTorrent (Table 1).
In each environment, a subset of samples was experi-
mentally manipulated before sequencing to reduce the
diversity of the microbes. Diversity measures were sig-
nificantly different between the four projects (P< 0.05)
(see see Additional file 2: Figure S1). Tukey HSD post
hoc conducted on the four diversity parameters showed
that the coral_IT_low project was significantly lower in
diversity from the remaining projects (P< 0.05) (see
Additional file 3: Table S2). However, the manipulation
of the kelp_IL_low project did not result in a significant
decrease in taxonomic diversity.
Assembly evaluation
The 12 contig files (4 projects, 3 assemblers) were ana-
lyzed using QUAST, which identified that SPAdes and
IDBA provided high contig continuity compared to
MetaVelvet that assembled fewer contigs, with short
contig lengths (see Additional file 4: Table S3).
Contig continuity was further assessed using contig
length (length of 1000 contigs from 12 contig files), the
total number of contigs per assembly, and by calculating
of proportion of reads assembled (1000 contigs from 12
contig files). Each project assembled a significantly differ-
ent number of contigs (F
3, 8
= 6.56, P=0.01), greater
number of contigs were assembled for Illumina (coral_IL_-
high = 209,144 ± 26,756, kelp_IL_low = 153,607 ± 11,954)
compared to IonTorrent (coral_IT_low = 73,772 ± 3450,
kelp_IT_high = 70,759 ± 15,380) (Fig. 2a). The length of
1000 contigs from the 12 files showed a significant dif-
ference between the three assemblers (F
2, 11,994
=
133,077, P< 0.001), four projects (F
3, 11,994
= 35,061,
P< 0.001) and an interaction between the projects
and assemblers (F
6, 11,994
= 7551, P< 0.001). SPAdes
provided longer contig for Illumina (coral_IL_high:
22,728 ± 5797 bp, kelp_IL_low: 14,957 ± 3660 bp)
compared to IonTorrent projects (coral_IT_low: 697
± 299 bp, kelp_IT_high: 638 ± 51 bp) (Fig. 2b). IDBA
assembler performed uniformly for the different pro-
jects varying from a mean length of 3359 bp to
11,203 bp. A Tukey HSD post hoc test showed that
all the project and assembler combinations were signifi-
cant (see Additional file 5: Table S4).
Contig chimerism was assessed using Bowtie analysis
which identifies the number of reads in the assembly by
mapping the reads to contigs. Significant differences
were observed for reads assembled (1000 contigs) be-
tween assemblers (F
2, 11,988
= 29,139, P< 0.001), pro-
jects (F
3, 11,988
= 4677, P< 0.001), and the interaction
term between assemblers and projects (F
6, 11,988
=
8046, P< 0.001) (see Additional file 6: Table S5). The
differences were caused by the high diversity samples
having a lower proportion of reads assembled (cora-
l_IL_high, kelp_IT_high) compared with the low diver-
sity samples (coral_IT_low, kelp_IL_low) having a
higher proportion of reads assembled (Fig. 2c). IDBA and
SPAdes followed this pattern except for IDBA coral_
IT_low samples which assembled a lower number of reads
(Fig. 2c). SPAdes were found to be selective for coral atoll
projects (coral_IL_high, coral_IT_low) providing contigs
with a higher read coverage compared to kelp forest
samples (kelp_IL_low, kelp_IT_high) (Fig. 2c).
The richness and evenness of the assembled sequences
were compared against their respective unassembled
reads and showed no significant difference in diversity
after assembly (richness; P= 0.92, evenness; P= 0.91),
suggesting that microbial richness was maintained with
Papudeshi et al. BMC Genomics (2017) 18:915 Page 5 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
minimal chimera formation. Similarly, microbial even-
ness did not show a significant difference between as-
semblers (Fig. 2) (see Additional file 7: Table S6).
Overall the assessment showed that the SPAdes
assembly generated contigs of longer length (N
50
:
1632 bp) with a higher proportion of reads assembled
into contigs (reads assembled (all contigs): 19.65 ±
1.41%) compared with IDBA (N
50
: 1024 ± 7.15 bp, reads
assembled (all contigs): 16.83 ± 1.56%). However,
SPAdes assembler performed selectively for the differ-
ent projects (Fig. 2), suggesting that the underlying
biology and sequencer affect assembly. The assembly
provided by IDBA was similar across all projects, sug-
gesting it is not responsive to the underlying biology of
the microbial communities. MetaVelvet performed
poorly in all aspects. In addition, SPAdes assembly
showed no significant biasinrichnessandevenness
compared to the reads, suggesting the lower proportion
of contig chimerism. Therefore, based on our data of
contig continuity and contig chimerism, we selected
SPAdes as the optimal assembler.
Binning tools evaluation
SPAdes assembled contigs for the four projects were
binned using two different binning tools, GroopM and
MetaBat. The GroopM binning tool applies only one
genome signature: contig coverage, i.e. it groups contigs
that have a similar proportion of reads that were com-
bined from each metagenome, and this process extracted
a high number of bins (coral_IL_high: 71, coral_IT_low:
31, kelp_IL_low: 117, and kelp_IT_high: 37 bins). Meta-
Bat applies a combination of two genome signatures,
contig coverage and tetranucleotide frequency, and the
more stringent parameters extracted less bins (cora-
l_IL_high: 57, coral_IT_low: 17, kelp_IL_low: 17, and
kelp_IT_high: 24 bin).
The population genome bins obtained from
GroopM and MetaBat were evaluated for the follow-
ing parameters; variation in GC content, genus rich-
ness and genome completeness (Fig. 3a). Two-way
ANOVA was performed on variation in GC content,
genus richness and genome completeness and identi-
fied differences between binning tools (GC variation:
Fig. 2 Assembly evaluation of IDBA, MetaVelvet, and SPAdes assemblers for cross assembled contigs for the four projects: coral_IL_high,
coral_IT_low, kelp_IL_low and kelp_IT_high based on parameters; (a) number of contigs, (b) mean contig length for1000 contigs (bp), (c) mean
reads assembled for 1000 contigs (%), (d) richness, and (e) evenness. Here we show the performance of each assembler in terms of all the five
parameters for each project, the lines in graph b and c represent the standard errors
Papudeshi et al. BMC Genomics (2017) 18:915 Page 6 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
F
1, 368
= 4.43, P< 0.03, genus richness: F
1, 362
= 37.56,
P< 0.001, genome completeness: F
1, 367
= 24.78, P<
0.001). Significant interaction between the projects
and binning tools was detected for parameters: GC
variation (F
3, 368
= 19.18, P< 0.001), richness (F
3, 362
= 4.96, P< 0.001) and genome completeness (F
3, 367
= 3.88, P< 0.001). MetaBat produced bins from the
low diversity coral reef, and kelp forest projects are
each dominated by one or a few species, showing that
low diversity samples separate into better population
genomes. The bins extracted from GroopM for the
kelp low diversity were poorly separated with multiple
taxa identified in each bin (Fig. 3b). For genome com-
pleteness, MetaBat bins contained greater complete-
ness compared with GroopM for all the projects,
except for coral_IT_low (Fig. 3c). Overall, MetaBat
produced bins with less variation in GC content, low
species richness (4.91 ± 0.66), and higher genome
completeness (40.92 ± 1.75) compared to GroopM
(species richness: 7.41 ± 0.66, genome completeness:
25.17 ± 1.80) (see Additional file 8: Table S7) irre-
spective of the project (Fig. 3).
Bin validation and metagenome-assembled genome
identification
An ideal reconstruction of a microbial genome would be
where each bin represents one metagenome-assembled
genome that includes a high abundance of contigs of
closely related species. Therefore, the taxonomic com-
position of the MetaBat bins was identified using
FOCUS, because these are reconstructed genomes from
metagenomics data, some of the contigs that are placed
into a bin may not have a taxonomic annotation, and
these contigs will represent novel genomic material from
the environment. In addition, some of the contigs that
are placed in similar bins will have mixed taxonomic as-
signments, suggesting that these contigs have come from
phylogenetically similar organisms to those in the data-
base, which cannot be separated by this process. In some
bins, most contigs will have a similar taxonomic identifi-
cation, with a few contigs that are from distinct taxa,
and these could be DNA that has been horizontally
transferred or contamination by contigs that cannot be
sorted by the binning process. Identifying novel organ-
isms, sister species, and horizontal gene transferred
DNA is an important part of the reconstruction process
and will increase the description of microbial diversity.
Each project produced a different proportion of
metagenome-assembled genomes that were similar to a
single genus; coral_IL_high showed 46.42%, coral_
IT_low showed 88.23%, kelp_IL_low showed 64.70% and
kelp_IT_high showed 62.5% (Fig. 4). Genus level classifi-
cation was applied to identify closely related species.
Kelp_IL_low bin 9, and bin 13 contained multiple gen-
era, Ketogulonicigenium,Ruegeria, and Roseobacter, sug-
gesting these bins contain sequences belonging to family
Rhodobacteraceae and thus could represent closely re-
lated novel species. Several bins contained a high abun-
dance of sequences belonging to one microbial genus
(Alteromonas or Vibrio metagenome-assembled ge-
nomes), however, they also included sequences belong-
ing to other distantly relates taxa. A proportion of bins
from each project had high completeness, but the genus
identification was not apparent through FOCUS, sug-
gesting they could be potential novel genomes (shown in
black in Fig. 4). The proportion of potentially novel
genomes varied depending on projects, for example,
coral_IT_low showed no potentially novel genomes, and
coral_IL_high had 51.78% of potentially novel metagen-
ome- assembled genomes.
Investigating novel metagenome-assembled genome
Overall, 13 bins (coral_IL_high: 7 bins, kelp_IL_low: 1
bin, and kelp_IT_high: 5 bins) had 50% completeness
Fig. 3 Evaluation of binning tools MetaBat (white) over GroopM (grey) using three parameters, (a) variation in GC content, shown as a box and
whisker plot where the mean value is represented by a bold line in the box, the second line represents the median value for the data, (b) species
richness, and (c) genome completeness (%). Error bars are one standard error
Papudeshi et al. BMC Genomics (2017) 18:915 Page 7 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
with ambiguous genus identifications (Table 2). These
bins contain sequences with similar tetranucleotide fre-
quencies, similar contig coverage profiles, and high gen-
ome completeness (presence of bacterial marker genes).
The 13 potentially novel metagenome-assembled ge-
nomes were analyzed using marker genes and alignment
to identify their closest phylogenetic neighbors using
CheckM, PhyloSift, RAST, and ANI (Table 2). From the
13 bins, 8 bins were identified by two or more tools as
the same microbial species, coral_IL_high bin 13 con-
tains sequences belonging to class Alphaproteobacteria,
coral_IL_high bin 14 is phylogenetically similar to Alter-
omonas genus, coral_IL_high bin 41 to marine gamma
proteobacterium, coral_IL_high_54 to SAR86 cluster,
kelp_IL_low bin 5 to Oceanibulbus indolifex, kelp_IL_
low bin 8 to Limnobacter sps, kelp_IL_low bin 7 to
belong to order Flavobacteriales and kelp_IL_low bin 20
to belong to family Rhodobacteraceae (Table 2).
Distinguishing novel metagenome-assembled genomes
A single metagenome-assembled genome; coral_IL_high
bin 13 was identified to have 100% genome complete-
ness, containing all 104 conserved bacterial marker
genes. The metagenome-assembled genome was phylo-
genetically affiliated with Parvibaculum lavamentivor-
ans, by CheckM and RAST, and Alpha proteobacterium
IMCC 14465 by PhyloSift. Using GC content, genome
size, the number of protein-encoding genes, and the
number of RNA genes the reconstructed genome (cora-
l_IL_high bin 13) was more similar to Parvibaculum
lavamentivorans compared with Alphaproteobacteria
IMCC14465 (see Additional file 9: Table S8). However,
the proteome of the reconstructed genome compared to
Parvibaculum lavamentivorans and Alphaproteobacteria
IMCC 14465 showed 44.12% similarity to both the refer-
ence organisms (Fig. 5a and b). Average nucleotide iden-
tity (ANI) of the novel population genome was
calculated to show 63.50% similarity with Alphaproteo-
bacteria IMCC14465, and 62.52% similarity with
Parvibaculum lavamentivorans. The tetranucleotide fre-
quencies of the novel metagenome-assembled genome
were further compared against a database to be 82.22%
similar to Pelagibacter ubique. Proteome comparison
against Pelagibacter ubique showed to have 90.35%
(Fig. 5c) compared to the 44.12% shown earlier (Fig. 5b).
Coral_IL_high bin 13 contains twice as high GC content,
Fig. 4 Taxonomic identification of the MetaBat bins using FOCUS for the four projects; (a) coral_IL_high, (b) coral_IT_low, (c) kelp_IL_low and (d)
kelp_IT_high. Population genomes belonging to the 32 genera have been identified with abundance (> 20%) and their relative abundance in a
bin is plotted. We also include a category potentially novel population genomesin black to represent bins that were identified to different taxa
with low abundance. We predict that the bins with high species richness and have greater than 50% genome completeness are potentially novel
population genomes
Papudeshi et al. BMC Genomics (2017) 18:915 Page 8 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 2 List of 13 novel bins identified from the four projects, the closest neighbor with similarity index using CheckM, PhyloSift, RAST, and JSpeciesWS
Project Number
of contigs
Completeness GC (%) Genome
size (Mbp)
Gene
count
CheckM PhyloSift RAST JSpeciesWS (best hit
that has >90%)
coral_IL_high_13 769 100 53.7 3.96 4342 Parvibaculum lavamentivorans Alphaproteobacteria
strain IMCC14465
Parvibaculum lavamentivorans Pelagibacter ubique
coral_IL_high_14 1215 99.14 44.3 8.85 7897 Alphaproteobacteria
strain HIMB5
Alteromonas macleodii Alteromonas mediterranea
Alteromonas naphthalenivorans
coral_IL_high_26 1554 87.93 34.6 6.72 7756 Verrucomicrobia SAR86 cluster
bacterium SAR86A
Ruegeria sp. R11, Roseobacter
denitrificans OCh 114
Alteromonas mediterranea
coral_IL_high_28 270 52.27 39.4 1.03 1175 Alteromonas taeanensis Flavobacteria strain
MS024 2A
Polaribacter sp. MED152 SAR116 cluster alpha
proteobacterium HIMB100
coral_IL_high_41 825 79.67 56.7 2.99 3384 Gammproteobacteria
strain HIMB55
marine gammaproteobacteria
strain HTCC2080
coral_IL_high_49 3536 50.62 51.1 12.97 12,794 Bacteria Gammaproteobacteria strain
IMCC3088
coral_IL_high_54 297 93.1 37.8 2.51 2776 unresolved SAR86 cluster
strain SAR86E
SAR86 cluster bacterium SAR86E
kelp_IL_low_5 1046 90.7 62.8 5.16 5903 Oceanibulbus indolifex Oceanibulbus indolifex Oceanibulbus indolifex HEL-45
kelp_IT_high_1 848 82.45 51.6 5.04 7180 Rubritalea marina Verrucomicrobia strain
SCGC AAA168 F10
Akkermansia muciniphila,
Verrucomicrobium spinosum
DSM 4136
Marinobacter salarius
Marinobacter algicola
kelp_IT_high_7 487 77.27 43.3 1.97 3260 Owenweeksia hongkongensis Flavobacteria
strain MS024 2A
Kordia algicidaOT-1
kelp_IT_high_8 1224 54.55 52.1 2.98 5637 Limnobacter Limnobacter sp.MED105 Limnobacter sp.MED105 Marinobacter sps
kelp_IT_high_16 1011 57.94 39.5 2.06 3981 Flavobacteriaceae SAR86 cluster
strain SAR86C
Tenacibaculum sp. MED152
kelp_IT_high_20 836 84.8 52 4.19 6227 Rhodobacteraceae Rhodobacteraceae
strain HTCC2150
Roseovarius nubinhibens
Papudeshi et al. BMC Genomics (2017) 18:915 Page 9 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
genome size, the number of protein-encoding genes, and
RNA sequences compared with Pelagibacter ubique (see
Additional file 9: Table S8), we suggest it is a novel gen-
ome within the Alphaproteobacteria. The identification
of the novel genomes provides support that the
metagenome-assembled genomes contain environmen-
tally relevant genomic material that is not in the
cultured relatives from the databases.
Discussion
We present a set of evaluation parameters to optimize
the workflow to reconstruct metagenome-assembled ge-
nomes from environmental microbial communities using
assembly evaluation parameters; the number of contigs,
contig length, the proportion of reads assembled, genus
richness, evenness and binning evaluation parameters;
GC content, species richness, and genome completeness.
Selection of the four projects, containing 105 metagen-
omes, in the study accounts for variation in biological
and procedural biases that are common in every micro-
biome study. By including these variables in the
optimization, rather than using mock communities or
few metagenomes [22, 26, 31, 43, 52], we tested the tools
under realistic conditions and identified biases. For our
datasets, SPAdes assembler and MetaBat binning tools
provided optimal results, and our evaluation techniques
could be used to explore and evaluate new assemblers
and binning tools.
Assembly evaluation parameters
The metagenomic variations within the projects influ-
enced the performance of the assemblers. To select an
optimal assembler, contig length, the number of contigs,
and proportion of reads assembled showed that Meta-
Velvet performed poorly and was not considered further.
The underlying algorithm for both IDBA and SPAdes
assemblers apply De-Bruijn graphs. The difference in-
cluded, IDBA iteratively improving the kmer size based
on input [28, 52], and SPAdes sequentially assembling
the metagenomes with kmer fragments between 21 to
127 [27].We observed that SPAdes assembled contigs
were longer for Illumina samples compared to IonTor-
rent samples. We predict as the SPAdes assembler fur-
ther fragments the reads to different kmer sizes to form
contigs, the overlapping region between forwards and
reverse reads from Illumina facilitates the forming of
longer contigs [27]. More reads were incorporated to
contigs for coral environment samples when using
SPAdes and for kelp forest samples when using IDBA,
which could be due to the bias associated with the algo-
rithms in handling the variability within the microbial
communities. We included two additional parameters,
species richness and evenness to account for shortcuts
applied in the assembly algorithms that include a data
reduction step to discards the low abundant sequences,
and formation of contig chimeras [53]. A decrease in
species richness compared to the unassembled metagen-
omes would suggest contig chimeras. However, all as-
semblies showed a slight increase in species richness,
and conserved evenness suggesting minimal contig chi-
meras were constructed by IDBA or SPAdes. IDBA as-
sembler performance was more uniform suggesting that
the assembler is treating all datasets the same and does
not take advantage of underlying structure in the meta-
genomes, such as longer reads. The IDBA documenta-
tion is minimal [52], and this may affect the users ability
Fig. 5 Proteome comparison of the reconstructed population genome (coral_IL_high Bin 13) compared against the genomes closest neighbors
Parvibaculum lavamentivorans (a), Alpha proteobacterium IMCC14465 (b) and Pelagibacteria ubique (c). The outer ring represents the contig of the
reference species. The middle ring represents the reference bacterial species, and the inner most ring represent the potentially novel population
genome with the color scale representing the protein similarity
Papudeshi et al. BMC Genomics (2017) 18:915 Page 10 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
to use the assemblers to full potential. In conclusion, the
applied parameters showed SPAdes assembly provides
the best contig continuity and minimal contig chimerism
across four different microbial environments and dis-
played flexibility with each of the biological and platform
biases. While conducted on far less data, other studies
have also found SPAdes to provide longer contigs with
more reads used in the assembly [26, 52].
Binning tool selection
MetaBat was selected as the optimal binning tool be-
cause the bins had minimal GC variation, species rich-
ness, and high genome completeness that may represent
a single genome. The number of bins extracted by
MetaBat was low compared to GroopM extracted bins.
MetaBat bins were further validated using taxonomic
identification to show the workflow reconstructed 66
metagenome-assembled genomes. These metagenome-
assembled genomes include sequences of closely related
species; therefore, they were identified to the genus level.
Each metagenome-assembled genome contained se-
quences belonging to distant bacterial species, suggest-
ing possible horizontal gene transfers or novel sequences
with no genome relative in the database. Metagenome-
assembled genomes of Arcobacter extracted from coral
reefs were studied to identify unique genes that were
previously not associated with the genomes cultured
from other environments [9]. Identification of potentially
novel genomes extracted from metagenomes relies on
the presence of marker genes [32, 54]. A novel popula-
tion bin (coral_IL_high bin 13) that has all the bacterial
genome markers used in CheckM, and was phylogenet-
ically affiliated to the bacterial species Parvibaculum
lavamentivorans, with 44% proteome similarity using
Focus. Further analysis with ANI and JSpeciesWS (TCS),
suggested 82.22% similarity to Pelagicater ubique. ANI >
95% represents over 70% DNA-DNA hybridization
which shows species relatedness, suggesting that Bin 13
falls below the species levels classification. The conflict-
ing results of two kmer based tools, suggests that the
genomes are novel and therefore do not closely match
organisms in the databases. In addition, several data-
bases need to be used in the description of
metagenome-assembled genomes to overcome any data-
base bias. The resulting metagenome-assembled ge-
nomes enable linking taxa to function to understand the
role of the population in the microbial community, and
we are currently investigating the role of these genomes
in the coral reef environment [14]. Our pipeline meets
the minimum standards for metagenome-assembled
genomes [55]. In the process, novel genomes, genes, and
sequences were identified, which can now be deposited
in a database to improve future annotation [29, 32, 56].
Conclusions
We present a set of assembly and binning evaluation
parameters to select for an optimized workflow to
reconstruct metagenome-assembled genomes (see
Additional file 10). The set of parameters provides
biologically relevant information regarding richness,
evenness, and GC content to help infer the optimal
tools for the dataset. Using these parameters, we
present an optimized workflow for four metagenome
projects, to be SPAdes assembly and MetaBat binning
tool regardless of the metagenomic variations. However,
the metagenomic variations within each project did result
in the differential quality of the metagenome-assembled
genomes. Communities that have high coverage of phylo-
genetically distinct organisms and low taxonomic diversity
resulted in better quality genome reconstruction.
Additional files
Additional file 1: Table S1. Metagenomes used in this study. List of
metagenomes used in the analysis and the sequencing statistics. (DOCX 20 kb)
Additional file 2: Figure S1. Microbial diversity in the 4 microbiome
projects. Representation of microbial diversity using, (a) genus richness,
(b) genus evenness, (c) Shannon diversity, and (d) Simpson diversity of
the four projects, which are represented on the x axis. The box
represents 50% of the data ranges around the median. The outliers for
each case are represented as black dots. (DOCX 167 kb)
Additional file 3: Table S2. Post hoc Tukey HSD test results for
diversity analysis. Post hoc Tukey HSD test results for Shannon, Simpson,
Richness and Evenness for the four projects. (DOCX 14 kb)
Additional file 4: Table S3. Assembly statistics. QUAST results for the
12 Contigs files assembled using the three assemblers; IDBA, MetaVelvet,
SPAdes. (DOCX 16 kb)
Additional file 5: Table S4. Post hoc Tukey HSD test results for contig
length. Post hoc Tukey test results comparing contig length of 1000
contigs across assemblers and projects. (DOCX 18 kb)
Additional file 6: Table S5. Post hoc Tukey HSD test results for mean
reads assembled. Post hoc Tukey test results for the mean reads assembled
(%) of 1000 contigs across assemblers and projects. (DOCX 18 kb)
Additional file 7: Table S6. Assembly evaluation parameters. List of all
the assembly evaluation parameters. (DOCX 16 kb)
Additional file 8: Table S7. Binning tool evaluation parameters. List of
the parameters for the GroopM and MetaBat extracted bins. (DOCX 51 kb)
Additional file 9: Table S8. Comparison on metagenome-assembled
genomes. Comparison of the genome parameters of novel
metagenome-assembled genome (coral_IL_high Bin 13) against the three
closest genomes from the database. (DOCX 13 kb)
Additional file 10: Optimized workflow. Guide to optimized workflow
to reconstruct metagenome-assembled genomes. Description of the
programs used in this study at each step and the evaluation parameters
calculation is provided as step by step workflow. (DOCX 148 kb)
Abbreviations
ANI: Average Nucleotide Identity; ANOVA: ANalysis Of VAriance;
BWA: Burrows-Wheeler Aligner; CCOM: Contig Clustering of Metagenomics;
CheckM: Check genome completeness; coral_IL_high: coral high diversity;
coral_IT_low: coral low diversity; FOCUS: Find Organisms by Composition
USage; GroopM: Group Metagenomes; IDBA: Iterative De Bruijn graph
Assembler; IL: Illumina MiSeq; IT: Ion Torrent PGM; kelp_IL_low: kelp low
diversity; kelp_IT_high: kelp high diversity; MetaBat: Metagenome Binning
with Abundance and Tetra-nucleotide frequencies;
Papudeshi et al. BMC Genomics (2017) 18:915 Page 11 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
MetaVelvet: (METAgenomic-Velvet assembler); MG-RAST: MetaGenomics-
Rapid Annotation using Subsystems Technology; PATRIC: Pathosystems
Resource Integration Center; PEAR: Paired-End Read merger;
PhyloSift: Phylogenetic analysis of genomes and metagenomes;
PRINSEQ: PReprocessing and INformation of SEQuence data; QUAST: Quality
Assessment for Genome Assemblies; RAST: Rapid Annotations using
Subsystems Technology; SPAdes: (St. Petersburg genome assembler);
TCS: Tetranucleotide Correlation Search
Acknowledgments
We acknowledge funding from CNPq, FAPERJ, and CAPES and permits provided
by Brazilian federal government license, USA National Fisheries and Wildlife
Permit for sample collection. We would also like to thank National Center for
Genome Analysis Support (NSF Awards DBI-1458641 and ABI-1062432).
Funding
All the work was conducted at San Diego State University. We thank the funding
support from National Science Foundation Grants NSF Division of Undergraduate
Education #1323809, NSF Division of Molecular and Cellular Science #1330800,
and NSF Division of Computer and Network Systems CNS-1305112.
Availability of data and materials
The metagenomes analyzed in this study are available on MG-RAST
repository, and their MG-RAST IDs are in Additional file 1: Table S1.
Authorscontributions
Conceived and designed the experiments: BP and ED. Data collection and
sequencing experiments: JMH, MD, MM, FT, and KW. Performed
metagenomic analysis: BP, DTB, DP, and GGS. Statistical analysis: BP and PZ.
Critical revision of the manuscript: RE and ED. All authors read and approved
the final manuscript.
Ethics approval and consent to participate
No animal ethics approval was required. This research was conducted under
the Brazilian federal government license (SISBIO no. 101122). We received
this license to access protected areas from Parque Nacional Marinho de
Abrolhos/IBAMA (Instituto Brasileiro do Meio Ambiente e dos Recursos
Naturasis Renovaveis). The macroalgae were collected under the USA
National Fisheries and Wildlife Permit # SC - 13075.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Author details
1
Bioinformatics and Medical Informatics, San Diego State University, San
Diego, California, USA.
2
National Center for Genome Analysis Support,
Indiana University, Bloomington, Indiana, USA.
3
Department of Biology, San
Diego State University, 5500 Campanile Drive, San Diego 92115, California,
USA.
4
Computational Science Research Center, San Diego State University,
San Diego, California, USA.
5
Department of Biology, University of New South
Wales, Sydney, New South Wales, Australia.
6
Department of Mathematics and
Statistics, San Diego State University, San Diego, California, USA.
7
Institute of
Biology, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil.
8
Department of Computer Science, San Diego State University, 5500
Campanile Drive, San Diego, California, USA.
Received: 8 June 2017 Accepted: 13 November 2017
References
1. JLaAT MC. Ome Sweet'Omics-a genealogical Treasury of words. Sci. 2001;
17(7):88.
2. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH.
Genome sequences of rare, uncultured bacteria obtained by differential
coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):5338.
3. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome
Biol. 2002;3(2):reviews0003.00018.
4. Locey KJ, Lennon JT. Scaling laws predict global microbial diversity. Proc
Natl Acad Sci. 2016;113(21):59705.
5. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, Wegley L, Hatay M,
Hall D, Brown E, Haynes M, et al. Microbial ecology of four coral atolls in the
northern Line Islands. PLoS One. 2008b;3(2):e1584.
6. Doane MP, Haggerty JM, Kacev D, Papudeshi B, Dinsdale EA. The skin
microbiome of the common thresher shark (Alopias Vulpinus) has low
taxonomic and gene function beta-diversity. Environ Microbiol Rep. 2017;
9(4):35773.
7. Haggerty JM, Dinsdale EA. Distinct biogeographical patterns of marine
bacterial taxonomy and functional genes. Glob Ecol Biogeogr. 2016;26(2):
17790.
8. Haggerty JM, Bhavya Papudeshi, Alejandro Vega, Megan Morris, Michael
Doane, Holly Norman, Dinsdale E: Taxonomic selection and metabolic
strategies during bacterial succession of decomposing giant kelp,
Macrocystis pyrifera. In review.
9. Haggerty JM, Bhavya Papudeshi, Kevin Walsh, Marc B. Turner, Ronaldo
Francini-Filho, Cynthia B. Silveira, Timothy T. Harkins, Robert A. Edwards,
Fabiano L. Thompson, Dinsdale EA: Hunt for the super-heterotroph:
investigating the gene content of rarer coral reef bacterial genera.In review.
10. Morris MJM, Haggerty BN, Papudeshi AA, Vega MS, Edwards EA. Dinsdale 2016
Altered microbial abundance and community composition affect recruitment
and development in gametophytes of giant kelp, Macrocystis pyrifera.
Frontiers in Microbiology. https://doi.org/10.3389/fmicb.2016.01800.
11. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V,
McEwen JE, Wetterstrand KA, Deal C, et al. The NIH human microbiome
project. Genome Res. 2009;19(12):231723.
12. Dinsdale EA, Edwards RA, Bailey BA, Tuba I, Akhter S, McNair K, Schmieder R,
Apkarian N, Creek M, Guan E, et al. Multivariate analysis of functional
metagenomes. Front Genet. 2013;4:41.
13. Coutinho FH, Meirelles PM, Moreira APB, Paranhos RP, Dutilh BE, Thompson
FL. Niche distribution and influence of environmental parameters in marine
microbial communities: a systematic review. PeerJ. 2015;3:e1008.
14. Walsh K, Haggerty JM, Doane M, Hansen J, Morris M, Moreira AP, de Oliveira L,
Leomil L, Garcia G, Thompson FL, Dinsdale EA. Aura-biomes are present in the
water layer above coral reef benthic macro-organisms. Peer J. 2017;5:e3666.
15. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M,
Desnues C, Haynes M, Li L, et al. Functional metagenomic profiling of nine
biomes. Nature. 2008a;452(7187):62932.
16. Kelly LW, Williams GJ, Barott KL, Carlson CA, Dinsdale EA, Edwards RA, Haas
AF, Haynes M, Lim YW, McDole T, et al. Local genomic adaptation of coral
reef-associated microbiomes to gradients of natural variability and
anthropogenic stressors. Proc Natl Acad Sci. 2014;111(28):1022732.
17. Jensen S, Bourne DG, Hovland M, Murrell JC. High diversity of
microplankton surrounds deep-water coral reef in the Norwegian Sea. FEMS
Microbiol Ecol. 2012;82(1):7589.
18. Bruce T, Meirelles PM, Garcia G, Paranhos R, Rezende CE, de Moura RL, Filho
R-F, Coni EOC, Vasconcelos AT, Amado Filho G, et al. Abrolhos Bank reef
health evaluated by means of water quality, microbial diversity, benthic
cover, and fish biomass data. PLoS One. 2012;7(6):e36687.
19. Fernandes N, Steinberg P, Rusch D, Kjelleberg S, Thomas T. Community
structure and functional gene profile of bacteria on healthy and diseased
thalli of the red seaweed Delisea pulchra. PLoS One. 2012;7(12):e50854.
20. Cassman N, Prieto-Davó A, Walsh K, Silva GG, Angly F, Akhter S, Barott K,
Busch J, McDole T, Haggerty JM. Oxygen minimum zones harbor novel viral
communities with low diversity. Environ Microbiol. 2012;14(11):304365.
21. Huggett JF, Laver T, Tamisak S, Nixon G, OSullivan DM, Elaswarapu R,
Studholme DJ, Foy CA. Considerations for the development and application
of control materials to improve metagenomic microbial community
profiling. Accred Qual Assur. 2012;18(2):7783.
22. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droege J, Gregor I,
Majda S, Fiedler J, Dahms E et al: Critical Assessment of Metagenome
Interpretation a benchmark of computational metagenomics software.
bioRxiv 099127; https://doi.org/10.1101/099127.
23. Prakash T, Taylor TD. Functional assignment of metagenomic data:
challenges and applications. Brief Bioinform. 2012;13(6):71127.
Papudeshi et al. BMC Genomics (2017) 18:915 Page 12 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
24. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool
for genome assemblies. Bioinformatics. 2013;29(8):10725.
25. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol.
2009;10(3):110.
26. Garcia-Lopez R, Vazquez-Castellanos JF, Moya A. Fragmentation and
coverage variation in viral metagenome assemblies, and their effect in
diversity calculations. Front Bioeng Biotechnol. 2015;3:141.
27. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS.
SPAdes: a new genome assembly algorithm and its applications to single-
cell sequencing. J Comput Biol. 2012;19:45577.
28. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for
single-cell and metagenomic sequencing data with highly uneven depth.
Bioinformatics. 2012;28(11):14208.
29. Dutilh BE, Schmieder R, Nulton J, Felts B, Salamon P, Edwards RA, Mokili JL.
Reference-independent comparative metagenomics using cross-assembly:
crAss. Bioinformatics. 2012;28(24):322531.
30. Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW.
GroopM: an automated tool for the recovery of population genomes from
related metagenomes. PeerJ. 2014;2:e603.
31. Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately
reconstructing single genomes from complex microbial communities. PeerJ.
2015;3:e1165.
32. Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population
genomes from metagenome datasets. Microbiome. 2016;4:8.
33. Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes
and evolutionary implications. J Bacteriol. 1997;179(12):3899913.
34. Cleary B, Brito IL, Huang K, Gevers D, Shea T, Young S, Alm EJ. Detection of
low-abundance bacterial strains in metagenomic datasets by eigengenome
partitioning. Nat Biotechnol. 2015;33(10):105360.
35. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary
implications of microbial genome tetranucleotide frequency biases.
Genome Res. 2003;13(2):14558.
36. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-
service and a stand-alone program for the analysis and comparison of
tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics.
2004;5(1):163.
37. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM:
assessing the quality of microbial genomes recovered from isolates, single
cells, and metagenomes. Genome Res. 2015;25(7):104355.
38. Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift:
phylogenetic analysis of genomes and metagenomes. PeerJ. 2014;2:e243.
39. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K,
Gerdes S, Glass EM, Kubal M. The RAST server: rapid annotations using
subsystems technology. BMC Genomics. 2008;9:75.
40. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje
JM. DNA-DNA hybridization values and their relationship to whole-genome
sequence similarities. Int J Syst Evol Microbiol. 2007;57(Pt 1):8191.
41. Richter M, Rosselló-Móra R, Oliver Glöckner F, Peplies J. JSpeciesWS: a web
server for prokaryotic species circumscription based on pairwise genome
comparison. Bioinformatics. 2016;32(6):92931.
42. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta
DR, Gautier L, Pedersen AG, Le Chatelier E, et al. Identification and assembly
of genomes and genetic elements in complex metagenomic samples
without using reference genomes. Nat Biotech. 2014;32(8):8228.
43. Gupta A, Kumar S, Prasoodanan VPK, Harish K, Sharma AK, Sharma VK.
Reconstruction of bacterial and viral genomes from multiple Metagenomes.
Front Microbiol. 2016;7:469.
44. Meyer F, Paarmann D, D'Souza M, Olson R, Glass E, Kubal M, Paczian T,
Rodriguez A, Stevens R, Wilke A, et al. The metagenomics RAST server a
public resource for the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics. 2008;9(1):18.
45. Schmieder R, Edwards R. Quality control and preprocessing of
metagenomic datasets. Bioinformatics. 2011;27(6):8634.
46. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina
paired-end reAd mergeR. Bioinformatics. 2013;30(5):61420.
47. Clarke K, Gorley, RN: PRIMER v7: User Manual/Tutorial. PRIMER-E. 2015:
Plymouth, 296pp.
48. Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. FOCUS: an alignment-free
model to identify organisms in metagenomes using non-negative least
squares. PeerJ. 2014;2:e425.
49. Raheema JY: Contig clustering of Metagenomics (CCOM): a tool that
generates population genomes (bins) to analyze and capture uncultured
genomes. Thesis. San Diego: Montezuma Publishing: San Diego State
University; 2016.
50. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
Hamelryck T, Kauff F, Wilczynski B, et al. Biopython: freely available python
tools for computational molecular biology and bioinformatics.
Bioinformatics. 2009;25(11):14223.
51. Wattam AR, Davis JJ, Assaf R, Boisvert S, Brettin T, Bun C, Conrad N, Dietrich
EM, Disz T, Gabbard JL, et al. Improvements to PATRIC, the all-bacterial
bioinformatics database and analysis resource Center. Nucleic Acids Res.
2017;45(D1):D53542.
52. Vollmers J, Wiegand S, Kaster A-K. Comparing and evaluating metagenome
assembly tools from a microbiologists perspective-not only size matters!
PLoS One. 2017;12(1):e0169662.
53. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling
metagenome sequence assembly with probabilistic de Bruijn graphs. Proc
Natl Acad Sci. 2012;109(33):132727.
54. Yuan C, Lei J, Cole J, Sun Y. Reconstructing 16S rRNA genes in
metagenomic data. Bioinformatics. 2015;31(12):i3543.
55. Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy
TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA, et al. Minimum
information about a single amplified genome (MISAG) and a metagenome-
assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;
35(8):72531.
56. Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Alexander Richter R, Valas
R, Novotny M, Yee-Greenbaum J, Selengut JD, Haft DH, et al. Genomic
insights to SAR86, an abundant and uncultivated marine bacterial lineage.
ISME J. 2012;6(6):118699.
We accept pre-submission inquiries
Our selector tool helps you to find the most relevant journal
We provide round the clock customer support
Convenient online submission
Thorough peer review
Inclusion in PubMed and all major indexing services
Maximum visibility for your research
Submit your manuscript at
www.biomedcentral.com/submit
Submit your next manuscript to BioMed Central
and we will help you at every step:
Papudeshi et al. BMC Genomics (2017) 18:915 Page 13 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... The purpose was to ensure more confidence and conduct a deeper analysis of the functional genes found in the Domas Crater environment. In bioinformatics, there is the concept of MAGs, which is a strategy for reconstructing genomes from metagenomic datasets (Setubal 2021) through assembly and binning (Papudeshi et al. 2017). This study used bioinformatics tools, specifically MEGAHIT, for the assembly and a combination of Metabat2, Maxbin2, Concoct, and DAStools for binning. ...
Article
Full-text available
Santoso HB, Suwanto A, Pratama R. 2024. Exploring the microbial diversity and functional potential of Domas Crater of Mount Tangkuban Perahu, Indonesia, through Shotgun Metagenomics. Biodiversitas 25: 4613-4626. Domas Crater, an extreme environment in Indonesia, is known for high temperatures and acidic conditions, providing a unique habitat for specialized microbial communities. These extreme conditions increase the possibility of discovering thermophilic enzymes with valuable biotechnological applications. Therefore, this study aimed to explore the microbial diversity in Domas Crater using shotgun metagenomics to analyze both previously reported microbes and novel microorganisms comprehensively. Shotgun metagenomics is particularly advantageous in identifying microbial species that cannot be cultured using conventional methods, enabling the exploration of microorganisms with considerable potential. The application of next-generation sequencing technologies and bioinformatics tools allowed the successful reconstruction of eight high-quality Metagenome-Assembled Genomes (MAGs), a testament to the technical proficiency of the study. The genomes were further characterized based on the functional genes, including the enzymes in carbohydrate metabolism or Carbohydrate-active enzymes (CAZyme), biosynthetic gene clusters for secondary metabolite (BGCs), and genes associated with micronutrient metabolism. The results showed that the microbial community was dominated by Hydrogenobaculum and Sulfurisphaera, both known for adaptation to extreme environments. Moreover, the first Hydrogenobaculum and Thermocladium were recorded in Indonesia as the novel discoveries of the study. These findings highlight the significance of Domas Crater as a reservoir for novel microbial species, particularly in terms of thermophilic microorganisms with unique enzymatic properties.
... Species abundance profiles. The literature describes abundance and species phylogenies as the main parameters controlling MAG recovery [41]. In this study, we define community First, we proceeded with species selection and sequence retrieval from the National Center for Biotechnology Information (NCBI). ...
Article
Full-text available
We hypothesize that sample species abundance, sequencing depth, and taxonomic relatedness influence the recovery of metagenome-assembled genomes (MAGs). To test this hypothesis, we assessed MAG recovery in three in silico microbial communities composed of 42 species with the same richness but different sample species abundance, sequencing depth, and taxonomic distribution profiles using three different pipelines for MAG recovery. The pipeline developed by Parks and colleagues (8K) generated the highest number of MAGs and the lowest number of true positives per community profile. The pipeline by Karst and colleagues (DT) showed the most accurate results (~ 92%), outperforming the 8K and Multi-Metagenome pipeline (MM) developed by Albertsen and collaborators. Sequencing depth influenced the accurate recovery of genomes when using the 8K and MM, even with contrasting patterns: the MM pipeline recovered more MAGs found in the original communities when employing sequencing depths up to 60 million reads, while the 8K recovered more true positives in communities sequenced above 60 million reads. DT showed the best species recovery from the same genus, even though close-related species have a low recovery rate in all pipelines. Our results highlight that more bins do not translate to the actual community composition and that sequencing depth plays a role in MAG recovery and increased community resolution. Even low MAG recovery error rates can significantly impact biological inferences. Our data indicates that the scientific community should curate their findings from MAG recovery, especially when asserting novel species or metabolic traits.
... Such organisms often encode and produce thermo-and alkali-stable proteins, which have evolved to sustain functionality under extreme conditions. 29,30 More than 99% of extremophilic species, however, cannot be cultured using standard microbiology protocols 31,32 and, thus, metagenomic analyses are typically employed for such purposes, either via experimental functional screening or through bioinformatic approaches. 33−37 Bioinformatic approaches are exceptionally high-throughput and enable the screening of billions of potential enzymeencoding genes present in extensive, metagenomic data sets. ...
Article
Full-text available
Taking immediate action to combat the urgent threat of CO₂-driven global warming is crucial for ensuring a habitable planet. Decarbonizing the industrial sector requires implementing sustainable carbon-capture technologies, such as biomimetic hot potassium carbonate capture (BioHPC). BioHPC is superior to traditional amine-based strategies due to its eco-friendly nature. This innovative technology relies on robust carbonic anhydrases (CAs), enzymes that accelerate CO₂ hydration and endure harsh industrial conditions like high temperature and alkalinity. Thus, the discovery of highly stable CAs is crucial for the BioHPC technology advancement. Through high-throughput bioinformatics analysis, we identified a highly thermo- and alkali-stable CA, termed CA-KR1, originating from a metagenomic sample collected at a hot spring in Kirishima, Japan. CA-KR1 demonstrates remarkable stability at high temperatures and pH, with a half-life of 24 h at 80 °C and retains activity and solubility even after 30 d in a 20% (w/v) K₂CO₃/pH 11.5 solution─a standard medium for HPC. In pressurized batch reactions, CA-KR1 enhanced CO₂ absorption by >90% at 90 °C, 20% K₂CO₃, and 7 bar. To our knowledge, CA-KR1 constitutes the most resilient CA biocatalyst for efficient CO₂ capture under HPC-relevant conditions, reported to date. CA-KR1 integration into industrial settings holds great promise in promoting efficient BioHPC, a potentially game-changing development for enhancing carbon-capture capacity toward industrial decarbonization.
... Hence, this review aims to deliver a comprehensive overview of the different types of metagenomic binning methods (Fig. 1), discuss their challenges, study new trends and highlight the areas that require improvement. It must be noted that this review does not focus on benchmarking binning methods, as detailed benchmarking studies on various datasets have already been published [22][23][24][25][26]. This review is a comprehensive starting point for beginners entering the field of computational metagenomics and will pave the way for improvement in the research field. ...
Article
Full-text available
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
... DNA reads were assembled by SPAdes (v3.13.0) with "-k 21,33,55,77--careful". "--sc" was additionally enabled for Multiple Displacement Amplification (MDA) data to deal with non-uniform coverage and to remove potential chimeras (Bankevich et al. 2012;Nurk et al. 2013;Papudeshi et al. 2017;Prjibelski et al. 2020;Xu and Zhao 2018). GapCloser (v1.1.2, ...
Article
Full-text available
Genomes are incredibly dynamic within diverse eukaryotes and programmed genome rearrangements (PGR) play important roles in generating genomic diversity. However, genomes and chromosomes in metazoans are usually large in size which prevents our understanding of the origin and evolution of PGR. To expand our knowledge of genomic diversity and the evolutionary origin of complex genome rearrangements, we focus on ciliated protists (ciliates). Ciliates are single-celled eukaryotes with highly fragmented somatic chromosomes and massively scrambled germline genomes. PGR in ciliates occurs extensively by removing massive amounts of repetitive and selfish DNA elements found in the silent germline genome during development of the somatic genome. We report the partial germline genomes of two spirotrich ciliate species, namely Strombidium cf. sulcatum and Halteria grandinella, along with the most compact and highly fragmented somatic genome for S. cf. sulcatum. We provide the first insights into the genome rearrangements of these two species and compare these features with those of other ciliates. Our analyses reveal: (1) DNA sequence loss through evolution and during PGR in S. cf. sulcatum has combined to produce the most compact and efficient nanochromosomes observed to date; (2) the compact, transcriptome-like somatic genome in both species results from extensive removal of a relatively large number of shorter germline-specific DNA sequences; (3) long chromosome breakage site motifs are duplicated and retained in the somatic genome, revealing a complex model of chromosome fragmentation in spirotrichs; (4) gene scrambling and alternative processing are found throughout the core spirotrichs, offering unique opportunities to increase genetic diversity and regulation in this group. Supplementary Information The online version contains supplementary material available at 10.1007/s42995-023-00213-x.
... "Real-world" data is commonly used in benchmarking and optimization studies for genome assemblers and binners, as seen for example in a study by Papudeshi et al. [9] comparing three assemblers and two binners. This has the advantage of reflecting the complexity of the data processed with such programs, but means the underlying composition of the community used for testing remains unknown [10,11]. ...
Article
Full-text available
Background The possibility of recovering metagenome-assembled genomes (MAGs) from sequence reads allows for further insights into microbial communities and their members, possibly even analyzing such sequences with tools designed for single-isolate genomes. As result quality depends on sequence quality, performance of tools for single-isolate genomes on MAGs should be tested beforehand. Bioinformatics can be leveraged to quickly create varied synthetic test sets with known composition for this purpose. Results We present MAGICIAN, a flexible, user-friendly pipeline for the simulation of MAGs. MAGICIAN combines a synthetic metagenome simulator with a metagenomic assembly and binning pipeline to simulate MAGs based on user-supplied input genomes, allowing users to test performance of tools on MAGs while having a ground truth to compare results to. Using MAGICIAN, we found that even very slight (1%) changes in depth of coverage can drastically affect whether a genome can be recovered. We also demonstrate the use of simulated MAGs by evaluating the suitability of such genomes obtained with MAGICIAN’s current default pipeline for analysis with the antimicrobial resistance gene identification tool ResFinder. Conclusions Using MAGICIAN, it is possible to simulate MAGs which, while generally high in quality, reflect issues encountered with real-world data, thus providing realistic best-case data. Evaluating the results of ResFinder analysis of these genomes revealed a risk for plausible-looking false positives, which underlines the need for pipeline validation so that researchers are aware of the potential issues when interpreting real-world data. Furthermore, the effects of fluctuations in depth of coverage on genome recovery in our simulated “random sequencing” warrant further investigation and indicate random subsampling of reads may affect discovery of more genomes.
Chapter
Abstract In the 21st century, conventional techniques for elucidating specific gene sequences and molecular components for bacterial identification, characterization, and classification are being replaced. Genomics and proteomics have introduced novel tools and techniques to fulfill various clinical and research goals. Most molecular approaches for identifying bacteria are based on DNA amplification or sequencing. These techniques range from straightforward DNA amplification-based procedures to more intricate ones based on mass spectrometry, targeted gene and whole-genome sequencing, and restriction fragment analysis. In addition, methods based on distinctive protein signatures, including matrix-assisted laser desorption/ionization time-of-flight mass spectrometry and related variants, are being investigated. Herein, we briefly discuss traditional diagnostic laboratory methods for identifying pathogenic bacteria and explain the use of genotypic and proteomic technologies while providing an overview of the combined approaches of both methods. Moreover, various applications of nanotechnology (i.e., immune-based nanosensors, aptasensors, etc.), lab-on-a-chip techniques, machine learning technology, microelectromechanical systems, and Raman spectroscopy techniques in bacterial identification and diagnosis are summarized.
Preprint
Full-text available
Elasmobranch epidermal microbiomes are species-specific, yet microbial assembly and retainment drivers are mainly unknown. The contribution of host-derived factors in recruiting an associated microbiome is essential for understanding host-microbe interactions. Here, we focus on the physical aspect of the host skin in structuring microbial communities. Each species of elasmobranch exhibits unique denticle morphology, and we investigate whether microbial communities and functional pathways are correlated with the morphological features or follow the phylogeny of the three species. We extracted and sequenced the DNA from the epidermal microbial communities of three captive shark species: Horn (Heterodontus francisci), Leopard (Triakis semifasciata), and Swell shark (Cephaloscyllium ventriosum) and use electron microscopy to measure the dermal denticle features of each species. Our results outline species-specific microbial communities, as microbiome compositions vary at the phyla level; C. ventriosum hosted a higher relative abundance of Pseudomonadota and Bacillota, while H. francisci were associated with a higher prevalence of Euryarchaeota and Aquificae, and Bacteroidota and Crenarchaeota were ubiquitous with T. semifasciata. Functional pathways performed by each species' respective microbiome were species-specific metabolic. Microbial genes associated with aminosugars and electron-accepting reactions were correlated with the distance between dermal denticles, whereas desiccation stress genes were only present when the dermal denticle overlapped. Microbial genes associated with Pyrimidines, chemotaxis and virulence followed the shark phylogeny. Microbial genera display associations that resemble host evolutionary lineage, while others had linear relationships with interdenticle distance. Therefore, denticle morphology was a selective influence for some microbes and functions in the microbiome contributing to the phylosymbiosis.
Chapter
Full-text available
Next-generation sequencing (NGS) is a new technique for determining DNA/RNA sequences for the entire genome of interest at a fraction of the cost as compared to traditional sequencing methods. Nowadays, major plant and animal species are being sequenced using NGS platforms. These sequencing platforms produce massive amounts of biological data in various file formats, which may then be analyzed using a variety of computational tools. NGS can reveal a wealth of information on genetic variants, transcriptome dynamics, transcription factors, epigenetic changes, and more. The number of NGS applications is continually growing, necessitating more effective and innovative data storage, analysis, and visualization methods. In this chapter, we have covered the NGS data analysis pipeline, which included pre-processing of NGS data, genome and transcriptome assembly approaches like de novo assembly and reference-based assembly, gene prediction, and annotation along with identification of genetic marker as well as computational tools and software used for these analyses. NGS technologies are currently used for whole genome sequencing, investigation of genome diversity, metagenomics, epigenetics, discovery of non-coding RNAs and protein-binding sites, and gene-expression profiling by RNA sequencing.
Preprint
Full-text available
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.
Article
Full-text available
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
Article
Full-text available
As coral reef habitats decline worldwide, some reefs are transitioning from coral-to algal-dominated benthos with the exact cause for this shift remaining elusive. Increases in the abundance of microbes in the water column has been correlated with an increase in coral disease and reduction in coral cover. Here we investigated how multiple reef organisms influence microbial communities in the surrounding water column. Our study consisted of a field assessment of microbial communities above replicate patches dominated by a single macro-organism. Metagenomes were constructed from 20 L of water above distinct macro-organisms, including (1) the coral Mussismilia braziliensis, (2) fleshy macroalgae (Stypopodium, Dictota and Canistrocarpus), (3) turf algae, and (4) the zoanthid Palythoa caribaeorum and were compared to the water microbes collected 3 m above the reef. Microbial genera and functional potential were annotated using MG-RAST and showed that the dominant benthic macro-organisms influence the taxa and functions of microbes in the water column surrounding them, developing a specific ''aura-biome''. The coral aura-biome reflected the open water column, and was associated with Synechococcus and functions suggesting oligotrophic growth, while the fleshy macroalgae aura-biome was associated with Ruegeria, Pseudomonas, and microbial functions suggesting low oxygen conditions. The turf algae aura-biome was associated with Vibrio, Flavobacterium, and functions suggesting pathogenic activity, while zoanthids were associated with Alteromonas and functions suggesting a stressful environment. Because each benthic organism has a distinct aura-biome, a change in benthic cover will change the microbial community of the water, which may lead to either the stimulation or suppression of the recruitment of benthic organisms.
Article
Full-text available
We present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a Metagenome-Assembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Gene Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.
Article
Full-text available
The health of sharks is linked with emergent properties of its microbiome. Most marine organisms have mucus overlying the skin, but shark have dermal denticles that protrude above the mucus. We characterized the microbiome from the skin of the common thresher shark (Alopias vulpinus) to investigate the structure and composition of the skin microbiome. An average of 618,812 reads per metagenomic library contained open reading frames (80.9% ± S.D. 0.44%), and 7.6 to 12.8% matched known protein sequences. Genera distinguishing the A. vulpinus microbiome from the water column included, Pseudoalteromonas (12.8% ± 4.7 of sequences), Erythrobacter (5. 3% ± 0.5), Limnobacter (4.1% ± 1.4), and Idiomarina (4.2% ± 1.2) and gene pathways included, cobalt, zinc, and cadmium resistance (2.2% ± 0.1); iron acquisition (1.2% ± 0.1); ton/tol transport (1.3% ± 0.08); and n-Phenylalkanoic acid degradation (0.9% ± 0.08). Taxonomic β-diversity of the shark (77.6) was higher than the water column (70.6) and a reference host microbiome (algae: 71.5), and functional β-diversity of the shark (87.4) was similar to water (82.9) and algae (87.5). We conclude the A. vulpinus skin microbiome is influenced by filtering processes, that include biochemical and biophysical components of the shark skin and result in a highly structured microbiome, confirmed by high β-diversity. This article is protected by copyright. All rights reserved.
Article
Full-text available
With the constant improvement in cost-efficiency and quality of Next Generation Sequencing technologies, shotgun-sequencing approaches -such as metagenomics- have nowadays become the methods of choice for studying and classifying microorganisms from various habitats. The production of data has dramatically increased over the past years and processing and analysis steps are becoming more and more of a bottleneck. Limiting factors are partly the availability of computational resources, but mainly the bioinformatics expertise in establishing and applying appropriate processing and analysis pipelines. Fortunately, a large diversity of specialized software tools is nowadays available. Nevertheless, choosing the most appropriate methods for answering specific biological questions can be rather challenging, especially for non-bioinformaticians. In order to provide a comprehensive overview and guide for the microbiological scientific community, we assessed the most common and freely available metagenome assembly tools with respect to their output statistics, their sensitivity for low abundant community members and variability in resulting community profiles as well as their ease-of-use. In contrast to the highly anticipated "Critical Assessment of Metagenomic Interpretation" (CAMI) challenge, which uses general mock community-based assembler comparison we here tested assemblers on real Illumina metagenome sequencing data from natural communities of varying complexity sampled from forest soil and algal biofilms. Our observations clearly demonstrate that different assembly tools can prove optimal, depending on the sample type, available computational resources and, most importantly, the specific research goal. In addition, we present detailed descriptions of the underlying principles and pitfalls of publically available assembly tools from a microbiologist’s perspective, and provide guidance regarding the user-friendliness, sensitivity and reliability of the resulting phylogenetic profiles.
Article
Full-text available
Marine microbes mediate key ecological processes in kelp forest ecosystems and interact with macroalgae. Pelagic and biofilm-associated microbes interact with macroalgal propagules at multiple stages of recruitment, yet these interactions have not been described for Macrocystis pyrifera. Here we investigate the influence of microbes from coastal environments on recruitment of giant kelp, M. pyrifera. Through repeated laboratory experiments, we tested the effects of altered pelagic microbial abundance on the settlement and development of the microscopic propagules of M. pyrifera during recruitment. M. pyrifera zoospores were reared in laboratory microcosms exposed to environmental microbial communities from seawater during the complete haploid stages of the kelp recruitment cycle, including zoospore release, followed by zoospore settlement, to gametophyte germination and development. We altered the microbial abundance states differentially in three independent experiments with repeated trials, where microbes were (a) present or absent in seawater, (b) altered in community composition, and (c) altered in abundance. Within the third experiment, we also tested the effect of nearshore versus offshore microbial communities on the macroalgal propagules. Distinct pelagic microbial communities were collected from two southern California temperate environments reflecting contrasting intensity of human influence, the nearshore Point Loma kelp forest and the offshore Santa Catalina Island kelp forest. The Point Loma kelp forest is a high impacted coastal region adjacent to the populous San Diego Bay; whereas the kelp forest at Catalina Island is a low impacted region of the Channel Islands, 40 km offshore the southern California coast, and is adjacent to a marine protected area. Kelp gametophytes reared with nearshore Point Loma microbes showed lower survival, growth, and deteriorated morphology compared to gametophytes with the offshore Catalina Island microbial community, and these effects were magnified under high microbial abundances. Reducing abundance of Point Loma microbes restored M. pyrifera propagule success. Yet an intermediate microbial abundance was optimal for kelp propagules reared with Catalina Island microbes, suggesting that microbes also have a beneficial influence on kelp. Our study shows that pelagic microbes from nearshore and offshore environments are differentially influencing kelp propagule success, which has significant implications for kelp recruitment and kelp forest ecosystem health.
Article
Full-text available
Aim While paradigms of macroecology are challenged by the high rates of reproduction, dispersal and horizontal gene exchange of bacterial communities, environmental DNA sequencing makes community profiles accessible. We test fundamental hypotheses of macroecological theories, showing that both taxonomic and functional classifications have distinct biogeographical variation across distance and environments depending on trophic composition. Location Studies spanning the global oceans. Methods Taxonomic and functional profiles were obtained from metagenomes and were compared across oceanographic regions and tested for patterns of co‐occurrence. The influences of sampling method (filter size), environmental variables and geographical distribution were compared with distance‐based linear models to test predictors of taxonomic and functional composition. Macroecological drivers were compared with bacterial community structure to test four biogeographical hypotheses: (1) no biogeographical patterns, (2) community structure reflects environmental dissimilarity, (3) community structure reflects distance, (4) community structure reflects environment and distance. Results Bacterial families were clustered into four trophic groups – phototrophic, oligotrophic, eutrophic and copiotrophic – by changes in abundance across oceanographic regions and co‐occurrence with core functions. Changes in community composition were best modelled by longitude for free‐living communities and dissolved oxygen for mixed communities of free‐living and particle‐associated bacteria. Both microhabitat and community assignment had an impact on biogeographical patterns, with taxonomic compositions following our hypotheses 2 and 4 and functional gene compositions following hypotheses 3 and 4. Main conclusions We described four trophic groups adding to the current dichotomy of the classification of marine bacteria as oligotrophic or copiotrophic. Taxonomic composition of mixed communities reflected environmental differences but not geographical distance, whereas functional gene composition in free‐living communities was independent of environmental dissimilarity and reflected geographical distance. Patterns of biogeography in bacterial communities differed depending on the description of taxa or function. Therefore, we developed a new paradigm for bacterial ecology which shows that some aspects of bacterial evolution depend on trophic complexity, history and current environmental conditions.
Article
The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by 'virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.