ArticlePDF Available

Abstract and Figures

Background Microbiome/host interactions describe characteristics that affect the host's health. Shotgun metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools. Methods We tested the effects of the assembly and binning processes on population genome reconstruction using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3 assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of binning tools using genome completeness and taxonomic identification. Results We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632 ± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive to the biological and technological variations within the project, compared with other assemblers. Among binning tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49), low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were 100% complete, but show low similarity to genomes within databases. Conclusions In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in highest quality metagenome-assembled genomes. Electronic supplementary material The online version of this article (10.1186/s12864-017-4294-1) contains supplementary material, which is available to authorized users.
This content is subject to copyright. Terms and conditions apply.
M E T H O D O L O G Y A R T I C L E Open Access
Optimizing and evaluating the
reconstruction of Metagenome-assembled
microbial genomes
Bhavya Papudeshi
1,2
, J. Matthew Haggerty
3
, Michael Doane
3
, Megan M. Morris
3
, Kevin Walsh
3
,
Douglas T. Beattie
5
, Dnyanada Pande
1
, Parisa Zaeri
6
, Genivaldo G. Z. Silva
4
, Fabiano Thompson
7
,
Robert A. Edwards
8
and Elizabeth A. Dinsdale
3*
Abstract
Background: Microbiome/host interactions describe characteristics that affect the host's health. Shotgun
metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic
potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled
genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome
reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a
genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may
influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects
that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral
reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools.
Methods: We tested the effects of the assembly and binning processes on population genome reconstruction
using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3
assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency
of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of
binning tools using genome completeness and taxonomic identification.
Results: We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632
± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and
evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive
to the biological and technological variations within the project, compared with other assemblers. Among binning
tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49),
low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat
extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled
genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were
100% complete, but show low similarity to genomes within databases.
Conclusions: In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal
assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed
quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from
microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in
highest quality metagenome-assembled genomes.
* Correspondence: elizabeth_dinsdale@hotmail.com
3
Department of Biology, San Diego State University, 5500 Campanile Drive,
San Diego 92115, California, USA
Full list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Papudeshi et al. BMC Genomics (2017) 18:915
DOI 10.1186/s12864-017-4294-1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Background
Microbiome studies describe the significance of micro-
bial community that is associated with the host organism
[1]. However, less than 1% of all microbial species can be
cultured in vivo [24]; therefore, applications of culture-
independent sequencing technology has revolutionized
microbiome analysis [511]. Shotgun metagenomics
provides a rapid assessment of microbial communities
by sequencing a random subset of the genetic material
from the environment [2, 610, 12]. Annotations of
metagenomic DNA fragments is used to infer taxonomic
and functional patterns within microbial communities
across multiple environments, including oceans [7, 13],
coral reefs [5, 9, 1318], algae [19], and sharks [6]. How-
ever, linking the taxonomic origin of functional genes
from metagenomes is a complex task, because the se-
quences belong to multiple genomes. In addition, many
sequences may not match the database and therefore re-
main unidentified, for example in the viral community
collected from a marine oxygen minimum zone only 2%
of sequences were identified [20]. Improved sequencing
technology and coverage have enabled reconstruction of
fragments into metagenome-assembled genomes by
process of assembly and binning. However, genome re-
construction is affected by sequencing technology and
the biological characteristics of the microbial commu-
nity. Sequencers are currently restricted by an inverse
relationship between sequence length and the number of
reads. Longer reads provide more accurate annotation,
whereas, shorter reads produce greater coverage of the
community. High coverage is preferred in diverse com-
munities to identify rare species [21]. Similarly, if the di-
vergence within the species in the metagenome is small,
reconstruction of metagenome-assembled genomes will
inherently become difficult due to the inseparability of
the microbial genomes [2, 22]. It is unresolved how se-
quencing characteristics of read length and depth inter-
act with the biological variation of the microbial
community, during the reconstruction of genomes on
real metagenomic datasets.
The first step in the reconstruction of genomes is as-
sembly, where short metagenomic reads are joined based
on sequence overlap to form longer sequences called
contigs. Assemblers apply different algorithms which
may influence reconstructed genome quality. Incorrect
assembly draws ambiguous conclusions from the data
and reduces the number of annotations [23]. Therefore,
assembly evaluation is an important step that includes
both contig continuity and contig chimerism. The pro-
gram QUAST (Quality Assessment for Genome Assem-
blies) calculates contig continuity by describing both
contig length and number of contigs [24]. Contig chime-
rism is due to random sequence overlap; therefore a
contig contains sequences from divergent bacteria and
can be removed by tools that assess read coverage like
Bowtie [25]. While not often recognized, changes in spe-
cies richness and evenness from raw sequences com-
pared with assembled contigs can also be used to assess
contig chimerism as assemblers should maintain rich-
ness (number of taxa identified) while increasing even-
ness (greatest with equal distribution of taxa) [2628].
In addition, a substantial reduction in diversity may indi-
cate chimera formation. Therefore, an optimal assembly
will provide; a high number of long contigs, a high pro-
portion of reads assembled, conserved species richness,
and an increased species evenness.
Binning reconstructs genomes of taxa from the indi-
vidual contigs allowing for sequences with no homology
to the databases to be annotated and taxonomic origin
of functional genes to be identified [2931]. Binning in-
cludes grouping phylogenetically related contigs into a
bin, which represents a population genome containing
the gene content of closely related species [32]. Binning
tools group similar sequences based on sequence com-
position, which is an unsupervised approach that uses
genomic signatures, such as GC content [33], tetranu-
cleotide frequencies [3436], and read coverage per con-
tigs [2, 29, 30]. An ideal bin will represent one bacterial
genome with minimal GC variation, species richness,
and ~100% genome completeness. To increase the qual-
ity of binning, tools are advancing from applications
using one genome signature, such as GroopM (group
metagenomes) [30] and cross assembly [29], to applica-
tions using a combination of genome signatures, such as
MetaBat (Metagenome Binning with Abundance and
Tetra-nucleotide frequencies) [31]. The quality of the
resulting bins is assessed by calculating the variation in
GC content, species richness, and predicted genome
completeness using tools, such as CheckM (check gen-
ome completeness) tool [37]. Bins containing sequences
from mainly single taxa are metagenome-assembled ge-
nomes. Bins that contain sequences similar to multiple
taxa, but include most of the bacterial marker genes may
be novel population genomes. Identifying novel mi-
crobes is a crucial objective of reconstructing genomes
from metagenomes. The phylogeny and genomic content
of the novel genomes are investigated using tools such
as CheckM [37], PhyloSift (phylogenetic analysis of ge-
nomes and metagenomes) [38], and RAST (Rapid Anno-
tations using Subsystems Technology) [39]. Further,
relatedness to species can also be identified using
average nucleotide identity (ANI) that reciprocates the
results from DNA-DNA hybridization experiments to
show species relatedness [40]. In DNA-DNA
hybridization a 70% cut-off delineates species relatedness
and is reflected in the ANI calculations as the propor-
tion of protein-coding regions that align between two
genomes [41], if ANI is > 95%, it represents species
Papudeshi et al. BMC Genomics (2017) 18:915 Page 2 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
relatedness [40]. As metagenomics analysis of microbial
communities becomes more popular, many new genomic
tools are being produced to analyze the DNA sequences
(https://omictools.com). There are benefits, and draw-
backs of the analysis conducted by each tool and under-
standing how these analyses affect the results is essential
to microbiologists. Previous evaluation of assemblers
and binning tools have emphasized computational effi-
ciency, including runtime, and memory usage. Many of
these analyses were completed on synthetic microbial
communities rather than actual metagenomic data [22],
using parameters such as the number of miss-
assemblies, genome recalls and precision that is a chal-
lenge to calculate on real datasets [22, 24, 31]. Another
analysis has only used one assembler and binning tool
[42], without comparing the effects of the assembler on
the dataset. Other studies have spiked genomic reads
into metagenomes to investigate the number of reads re-
quired to reconstruct a draft metagenomics-assembled
genome [43]. In this paper, we investigate the effect of
assembly and binning by comparing 105 metagenomes
that were; 1) recovered from different marine environ-
ments, 2) varied in diversity, and 3) sequenced on differ-
ent sequencing platforms. Biologically relevant
parameters are used to analyze the data after the appli-
cation of each tool. We hypothesize that the biological
characteristics will affect assembly and binning. First,
theassemblyqualityforthethreeassemblers:IDBA
(Iterative De Bruijn graph Assembler), MetaVelvet
(METAgenomic-Velvet assembler), and SPAdes (St.
Petersburg genome assembler) was assessed using a set
of assembly statistics, including contig continuity and
contig chimerism. The most optimal assembler was ap-
plied to each project, followed by two composition
based binning tools: GroopM and MetaBat to recon-
struct genomes. These bins were assessed for genome
completeness and taxonomic identification. Last, we
explorethegenomiccontentandphylogeneticrelation-
ships of a metagenome-assembled genome. Our pipe-
lineisshowninFig.1.
Methods
Metagenomes collection
To test the effects of the assembly and binning processes
on population genome reconstruction, we used 105 mar-
ine metagenomes from 4 projects. The projects were
collected from coral atolls in Abrolhos Bank, Brazil
(coral) and Southern California kelp forests (kelp) (see
Additional file 1: Table S1). In two of the projects, the
microbial community was experimentally manipulated
before sequencing to reduce the diversity of the
microbes, and these projects are labeled as coral low
diversity (coral_IT_low) [9] and kelp low diversity
(kelp_IL_low) [8]. The other two projects are natural
microbial communities collected from the marine water
associated the same environments and called coral high
diversity (coral_IL_high) [14] and kelp high diversity
(kelp_IT_high) [10]. Coral_IT_low and kelp_IT_high
metagenomes were sequenced on Ion Torrent PGM
(IT), 200 sequencing kit (ThermoFisher Scientific),
whereas coral_IL_high, and kelp_IL_low was sequenced
on an Illumina MiSeq v3 reagent cartridge (IL), 600 cycle
kit (Illumina Inc.). Many metagenomes are publicly
available on MG-RAST (MetaGenomics-Rapid Annota-
tion using Subsystems Technology); thus the pipeline
started with obtaining the metagenomes from this data-
base [44](Table 1). The variation between the different
projects was used to identify the repeatability of the
Fig. 1 Overview of the workflow developed with the tools applied at each step (in bold). aMetagenomic reads are assembled using three
assemblers: SPAdes, MetaVelvet, and IDBA. bOptimization of assembly tool using assembly statistics. cAssembled contigs from optimal assembly
were binned using: MetaBat, and GroopM. dOptimal binning tool selected through bin validation. Colors (black, dark grey, and light grey) depict
different microbial species, each line of a color representing the sequence belonging to a bacterial species
Papudeshi et al. BMC Genomics (2017) 18:915 Page 3 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
workflow on datasets that vary with the environment
from, level of biological diversity, and sequencing
platform used.
The first step in a metagenomic pipeline is to remove
poor quality sequences by running each metagenome
through PRINSEQ (PReprocessing and INformation of
SEQuence data) [45]. PRINSEQ was performed to re-
move sequencing tags, duplicates and Ns within the
metagenome. Forward and reverse reads from Illumina
MiSeq platform were first paired using PEAR (Paired-
End Read merger) [46]. All the reads from a project were
placed together in one file and cross-assembled (i.e., all
metagenomes from the one project were assembled)
using three De Bruijn graph assemblers: IDBA, MetaVel-
vet, and SPAdes. Default kmer sizes were applied for
each tool; IDBA (k
min
: 25), MetaVelvet (kmer: 31) and
SPAdes (kmers: 21, 33 and 55).
Assembly evaluation
Each assembler (IDBA, MetaVelvet, and SPAdes) pro-
vides one output contig file for each project, therefore
providing 12 contig files in total. We calculated the as-
sembly statistics for the 12 contig files using QUAST
[24], including N
50
length, L
50
(which includes the num-
ber of contigs longer than N
50
), the number of contigs
assembled, the length of the largest contig, and the total
length of the assembly. Contig continuity was assessed
using contig length (length of 1000 contigs from 12 con-
tig files), and the total number of contigs per assembly.
Contig chimerism was first assessed by calculating the
proportion of reads assembled (for 1000 contigs from 12
contig files) using Bowtie [25]. FOCUS (Find Organisms
by Composition USage), a taxa identification tool that is
alignment independent, was applied to the 12 contig
files. The resulting information was used to calculate the
Margalef richness and Pielous evenness of the 12 contig
files using Primer statistics tool [47]. FOCUS was used
explicitly for this step, as each contig is assigned to bac-
terial species based on kmer ratios [48]. Contig chimeras
will have variable kmer ratios and will remain unidenti-
fied by Focus and be removed from further analysis. The
second step for assessing contig chimerism included a
comparison of Margalef richness and Pielous evenness
of the 12 contig files against the metagenomic reads.
The overall proportion of reads assembled into the
entire assembly for the 12 contig files were also calcu-
lated using Bowtie.
The contigs from the optimal assemblers for each
project were selected and uploaded to the Contig
Clustering of Metagenomics (CCOM) tool [49] along
with their read files in FASTA format to perform
GroopM [30] and MetaBat [31] clustering. CCOM tool
runs BWA (Burrows-Wheeler Aligner) aligner to map
reads on contigs, the resulting output from the tools
includes bam format. GroopM and MetaBat both use
the contigs (.fasta) and reads (.bam) format as input to
extract the resulting bins.
Bin validation
CCOM tool extracted two sets of bins for GroopM and
MetaBat binning tools for each project. Evaluation of
binning tools was performed using bin characteristics in-
cluding; variation in GC content, species richness and
genome completeness. GC content was calculated using
a self-written Biopython [50] script. Taxonomy compos-
ition for each bin was predicted using FOCUS [48].
Margalefs species richness was calculated using Primer
[47] for FOCUS taxonomy results. Genome complete-
ness was assessed using CheckM [37]. A bin was identi-
fied as a specific population genome if the bins included
sequences belonging to a single genus. Species or strain
level resolution could be used depending on the
amount of coverage and diversity of the microbes. Po-
tentially novel bins were identified as those bins that
contained > 50% genome completeness but were not
annotated by FOCUS. These potentially novel bins were
further analyzed using CheckM [37], PhyloSifts [38],
and RAST [39], all of which predict the neighboring ge-
nomes using marker genes. Proteome content of a
novel population genome was investigated using
PATRIC (Pathosystems Resource Integration Center)
[51], followed by calculating the average nucleotide
identity of the protein-encoding genes by applying the
blast (ANIb) analysis and tetranucleotide correlation
search (TCS) in JSpeciesWS tool [41].
Statistical analysis
The first statistical analysis was a one-way ANOVA
(ANalysis Of VAriance) conducted on the unassembled
metagenomes from each project to identify differences
in microbial diversity. Assembly evaluation variables
Table 1 Background information on the projects used to evaluate the selection of assembly and binning tools
Project name Source Number of metagenomes Total number of reads Sequencing technology Environment
coral_IL_high Abrolhos, Brazil. 2014 16 20,711,400 Illumina MiSeq (IL) Coral atolls (coral)
coral_IT_low Abrolhos, Brazil. 2011 15 18,323,050 IonTorrent, PGM (IT) Coral atolls (coral)
kelp_IL_low San Diego, USA 2015 51 6,493,217 Illumina MiSeq (IL) kelp forest (kelp)
kelp_IT_high San Diego, USA 20122013 23 9,769,952 IonTorrent PGM (IT) kelp forest (kelp)
Papudeshi et al. BMC Genomics (2017) 18:915 Page 4 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
included the number of contigs, richness, and evenness,
and binning tools evaluation variables included, GC
content, species richness, and genome completeness.
These variables were tested for normality using the
Shapiro-Wilks test, and non-normal data was log trans-
formed when appropriate. Data containing many in-
stances (> 5000), for example, contig length and
percent of reads assembled, were tested for normality
using the Kolmogorov-Smirnov test and non-normal
data was log transformed when appropriate. To test for
differences in assemblers, a one-way ANOVA was con-
ducted on the following variables; the number of
contigs, richness, and evenness. A one-way ANOVA
wasusedbecausetherewasonlyonedatapointfor
each variable per project because the metagenomes
were cross assembled. To investigate whether the
assemblers performed differently depending on the
projects a 2-way ANOVA model was conducted on the
factors; project, assemblers and projects by assemblers
as the interaction term for the variables contig length
and reads assembled. For the 2-way ANOVA, the data
was subsampled to select for the 1000 longest contigs
in each project, because running statistics on all
300,000 contigs is not feasible. Tukey HSD post hoc
comparisons were performed to identify the project
that contributed to the differences. Similar statistics
were conducted on the binning evaluation variables for
the two binning tools, MetaBat and GroopM. There-
fore, to investigate whether the binning tools performed
differently depending on the projects a 2-way ANOVA
model was conducted on the factors; project, binning
tools, and projects by binning tools as the interaction
term for the variables; GC variation, richness and gen-
ome completeness. Overall, the statistical analysis was
implemented using R scripts and visualized using Sigma
Plot (Systat Software, San Jose, CA).
Results
Variation between projects
The metagenomes from four projects were downloaded
from MG-RAST (Table 1). Samples were from two envi-
ronments; coral atolls and kelp forest, sequenced on two
sequencing platforms; Illumina and IonTorrent (Table 1).
In each environment, a subset of samples was experi-
mentally manipulated before sequencing to reduce the
diversity of the microbes. Diversity measures were sig-
nificantly different between the four projects (P< 0.05)
(see see Additional file 2: Figure S1). Tukey HSD post
hoc conducted on the four diversity parameters showed
that the coral_IT_low project was significantly lower in
diversity from the remaining projects (P< 0.05) (see
Additional file 3: Table S2). However, the manipulation
of the kelp_IL_low project did not result in a significant
decrease in taxonomic diversity.
Assembly evaluation
The 12 contig files (4 projects, 3 assemblers) were ana-
lyzed using QUAST, which identified that SPAdes and
IDBA provided high contig continuity compared to
MetaVelvet that assembled fewer contigs, with short
contig lengths (see Additional file 4: Table S3).
Contig continuity was further assessed using contig
length (length of 1000 contigs from 12 contig files), the
total number of contigs per assembly, and by calculating
of proportion of reads assembled (1000 contigs from 12
contig files). Each project assembled a significantly differ-
ent number of contigs (F
3, 8
= 6.56, P=0.01), greater
number of contigs were assembled for Illumina (coral_IL_-
high = 209,144 ± 26,756, kelp_IL_low = 153,607 ± 11,954)
compared to IonTorrent (coral_IT_low = 73,772 ± 3450,
kelp_IT_high = 70,759 ± 15,380) (Fig. 2a). The length of
1000 contigs from the 12 files showed a significant dif-
ference between the three assemblers (F
2, 11,994
=
133,077, P< 0.001), four projects (F
3, 11,994
= 35,061,
P< 0.001) and an interaction between the projects
and assemblers (F
6, 11,994
= 7551, P< 0.001). SPAdes
provided longer contig for Illumina (coral_IL_high:
22,728 ± 5797 bp, kelp_IL_low: 14,957 ± 3660 bp)
compared to IonTorrent projects (coral_IT_low: 697
± 299 bp, kelp_IT_high: 638 ± 51 bp) (Fig. 2b). IDBA
assembler performed uniformly for the different pro-
jects varying from a mean length of 3359 bp to
11,203 bp. A Tukey HSD post hoc test showed that
all the project and assembler combinations were signifi-
cant (see Additional file 5: Table S4).
Contig chimerism was assessed using Bowtie analysis
which identifies the number of reads in the assembly by
mapping the reads to contigs. Significant differences
were observed for reads assembled (1000 contigs) be-
tween assemblers (F
2, 11,988
= 29,139, P< 0.001), pro-
jects (F
3, 11,988
= 4677, P< 0.001), and the interaction
term between assemblers and projects (F
6, 11,988
=
8046, P< 0.001) (see Additional file 6: Table S5). The
differences were caused by the high diversity samples
having a lower proportion of reads assembled (cora-
l_IL_high, kelp_IT_high) compared with the low diver-
sity samples (coral_IT_low, kelp_IL_low) having a
higher proportion of reads assembled (Fig. 2c). IDBA and
SPAdes followed this pattern except for IDBA coral_
IT_low samples which assembled a lower number of reads
(Fig. 2c). SPAdes were found to be selective for coral atoll
projects (coral_IL_high, coral_IT_low) providing contigs
with a higher read coverage compared to kelp forest
samples (kelp_IL_low, kelp_IT_high) (Fig. 2c).
The richness and evenness of the assembled sequences
were compared against their respective unassembled
reads and showed no significant difference in diversity
after assembly (richness; P= 0.92, evenness; P= 0.91),
suggesting that microbial richness was maintained with
Papudeshi et al. BMC Genomics (2017) 18:915 Page 5 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
minimal chimera formation. Similarly, microbial even-
ness did not show a significant difference between as-
semblers (Fig. 2) (see Additional file 7: Table S6).
Overall the assessment showed that the SPAdes
assembly generated contigs of longer length (N
50
:
1632 bp) with a higher proportion of reads assembled
into contigs (reads assembled (all contigs): 19.65 ±
1.41%) compared with IDBA (N
50
: 1024 ± 7.15 bp, reads
assembled (all contigs): 16.83 ± 1.56%). However,
SPAdes assembler performed selectively for the differ-
ent projects (Fig. 2), suggesting that the underlying
biology and sequencer affect assembly. The assembly
provided by IDBA was similar across all projects, sug-
gesting it is not responsive to the underlying biology of
the microbial communities. MetaVelvet performed
poorly in all aspects. In addition, SPAdes assembly
showed no significant biasinrichnessandevenness
compared to the reads, suggesting the lower proportion
of contig chimerism. Therefore, based on our data of
contig continuity and contig chimerism, we selected
SPAdes as the optimal assembler.
Binning tools evaluation
SPAdes assembled contigs for the four projects were
binned using two different binning tools, GroopM and
MetaBat. The GroopM binning tool applies only one
genome signature: contig coverage, i.e. it groups contigs
that have a similar proportion of reads that were com-
bined from each metagenome, and this process extracted
a high number of bins (coral_IL_high: 71, coral_IT_low:
31, kelp_IL_low: 117, and kelp_IT_high: 37 bins). Meta-
Bat applies a combination of two genome signatures,
contig coverage and tetranucleotide frequency, and the
more stringent parameters extracted less bins (cora-
l_IL_high: 57, coral_IT_low: 17, kelp_IL_low: 17, and
kelp_IT_high: 24 bin).
The population genome bins obtained from
GroopM and MetaBat were evaluated for the follow-
ing parameters; variation in GC content, genus rich-
ness and genome completeness (Fig. 3a). Two-way
ANOVA was performed on variation in GC content,
genus richness and genome completeness and identi-
fied differences between binning tools (GC variation:
Fig. 2 Assembly evaluation of IDBA, MetaVelvet, and SPAdes assemblers for cross assembled contigs for the four projects: coral_IL_high,
coral_IT_low, kelp_IL_low and kelp_IT_high based on parameters; (a) number of contigs, (b) mean contig length for1000 contigs (bp), (c) mean
reads assembled for 1000 contigs (%), (d) richness, and (e) evenness. Here we show the performance of each assembler in terms of all the five
parameters for each project, the lines in graph b and c represent the standard errors
Papudeshi et al. BMC Genomics (2017) 18:915 Page 6 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
F
1, 368
= 4.43, P< 0.03, genus richness: F
1, 362
= 37.56,
P< 0.001, genome completeness: F
1, 367
= 24.78, P<
0.001). Significant interaction between the projects
and binning tools was detected for parameters: GC
variation (F
3, 368
= 19.18, P< 0.001), richness (F
3, 362
= 4.96, P< 0.001) and genome completeness (F
3, 367
= 3.88, P< 0.001). MetaBat produced bins from the
low diversity coral reef, and kelp forest projects are
each dominated by one or a few species, showing that
low diversity samples separate into better population
genomes. The bins extracted from GroopM for the
kelp low diversity were poorly separated with multiple
taxa identified in each bin (Fig. 3b). For genome com-
pleteness, MetaBat bins contained greater complete-
ness compared with GroopM for all the projects,
except for coral_IT_low (Fig. 3c). Overall, MetaBat
produced bins with less variation in GC content, low
species richness (4.91 ± 0.66), and higher genome
completeness (40.92 ± 1.75) compared to GroopM
(species richness: 7.41 ± 0.66, genome completeness:
25.17 ± 1.80) (see Additional file 8: Table S7) irre-
spective of the project (Fig. 3).
Bin validation and metagenome-assembled genome
identification
An ideal reconstruction of a microbial genome would be
where each bin represents one metagenome-assembled
genome that includes a high abundance of contigs of
closely related species. Therefore, the taxonomic com-
position of the MetaBat bins was identified using
FOCUS, because these are reconstructed genomes from
metagenomics data, some of the contigs that are placed
into a bin may not have a taxonomic annotation, and
these contigs will represent novel genomic material from
the environment. In addition, some of the contigs that
are placed in similar bins will have mixed taxonomic as-
signments, suggesting that these contigs have come from
phylogenetically similar organisms to those in the data-
base, which cannot be separated by this process. In some
bins, most contigs will have a similar taxonomic identifi-
cation, with a few contigs that are from distinct taxa,
and these could be DNA that has been horizontally
transferred or contamination by contigs that cannot be
sorted by the binning process. Identifying novel organ-
isms, sister species, and horizontal gene transferred
DNA is an important part of the reconstruction process
and will increase the description of microbial diversity.
Each project produced a different proportion of
metagenome-assembled genomes that were similar to a
single genus; coral_IL_high showed 46.42%, coral_
IT_low showed 88.23%, kelp_IL_low showed 64.70% and
kelp_IT_high showed 62.5% (Fig. 4). Genus level classifi-
cation was applied to identify closely related species.
Kelp_IL_low bin 9, and bin 13 contained multiple gen-
era, Ketogulonicigenium,Ruegeria, and Roseobacter, sug-
gesting these bins contain sequences belonging to family
Rhodobacteraceae and thus could represent closely re-
lated novel species. Several bins contained a high abun-
dance of sequences belonging to one microbial genus
(Alteromonas or Vibrio metagenome-assembled ge-
nomes), however, they also included sequences belong-
ing to other distantly relates taxa. A proportion of bins
from each project had high completeness, but the genus
identification was not apparent through FOCUS, sug-
gesting they could be potential novel genomes (shown in
black in Fig. 4). The proportion of potentially novel
genomes varied depending on projects, for example,
coral_IT_low showed no potentially novel genomes, and
coral_IL_high had 51.78% of potentially novel metagen-
ome- assembled genomes.
Investigating novel metagenome-assembled genome
Overall, 13 bins (coral_IL_high: 7 bins, kelp_IL_low: 1
bin, and kelp_IT_high: 5 bins) had 50% completeness
Fig. 3 Evaluation of binning tools MetaBat (white) over GroopM (grey) using three parameters, (a) variation in GC content, shown as a box and
whisker plot where the mean value is represented by a bold line in the box, the second line represents the median value for the data, (b) species
richness, and (c) genome completeness (%). Error bars are one standard error
Papudeshi et al. BMC Genomics (2017) 18:915 Page 7 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
with ambiguous genus identifications (Table 2). These
bins contain sequences with similar tetranucleotide fre-
quencies, similar contig coverage profiles, and high gen-
ome completeness (presence of bacterial marker genes).
The 13 potentially novel metagenome-assembled ge-
nomes were analyzed using marker genes and alignment
to identify their closest phylogenetic neighbors using
CheckM, PhyloSift, RAST, and ANI (Table 2). From the
13 bins, 8 bins were identified by two or more tools as
the same microbial species, coral_IL_high bin 13 con-
tains sequences belonging to class Alphaproteobacteria,
coral_IL_high bin 14 is phylogenetically similar to Alter-
omonas genus, coral_IL_high bin 41 to marine gamma
proteobacterium, coral_IL_high_54 to SAR86 cluster,
kelp_IL_low bin 5 to Oceanibulbus indolifex, kelp_IL_
low bin 8 to Limnobacter sps, kelp_IL_low bin 7 to
belong to order Flavobacteriales and kelp_IL_low bin 20
to belong to family Rhodobacteraceae (Table 2).
Distinguishing novel metagenome-assembled genomes
A single metagenome-assembled genome; coral_IL_high
bin 13 was identified to have 100% genome complete-
ness, containing all 104 conserved bacterial marker
genes. The metagenome-assembled genome was phylo-
genetically affiliated with Parvibaculum lavamentivor-
ans, by CheckM and RAST, and Alpha proteobacterium
IMCC 14465 by PhyloSift. Using GC content, genome
size, the number of protein-encoding genes, and the
number of RNA genes the reconstructed genome (cora-
l_IL_high bin 13) was more similar to Parvibaculum
lavamentivorans compared with Alphaproteobacteria
IMCC14465 (see Additional file 9: Table S8). However,
the proteome of the reconstructed genome compared to
Parvibaculum lavamentivorans and Alphaproteobacteria
IMCC 14465 showed 44.12% similarity to both the refer-
ence organisms (Fig. 5a and b). Average nucleotide iden-
tity (ANI) of the novel population genome was
calculated to show 63.50% similarity with Alphaproteo-
bacteria IMCC14465, and 62.52% similarity with
Parvibaculum lavamentivorans. The tetranucleotide fre-
quencies of the novel metagenome-assembled genome
were further compared against a database to be 82.22%
similar to Pelagibacter ubique. Proteome comparison
against Pelagibacter ubique showed to have 90.35%
(Fig. 5c) compared to the 44.12% shown earlier (Fig. 5b).
Coral_IL_high bin 13 contains twice as high GC content,
Fig. 4 Taxonomic identification of the MetaBat bins using FOCUS for the four projects; (a) coral_IL_high, (b) coral_IT_low, (c) kelp_IL_low and (d)
kelp_IT_high. Population genomes belonging to the 32 genera have been identified with abundance (> 20%) and their relative abundance in a
bin is plotted. We also include a category potentially novel population genomesin black to represent bins that were identified to different taxa
with low abundance. We predict that the bins with high species richness and have greater than 50% genome completeness are potentially novel
population genomes
Papudeshi et al. BMC Genomics (2017) 18:915 Page 8 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Table 2 List of 13 novel bins identified from the four projects, the closest neighbor with similarity index using CheckM, PhyloSift, RAST, and JSpeciesWS
Project Number
of contigs
Completeness GC (%) Genome
size (Mbp)
Gene
count
CheckM PhyloSift RAST JSpeciesWS (best hit
that has >90%)
coral_IL_high_13 769 100 53.7 3.96 4342 Parvibaculum lavamentivorans Alphaproteobacteria
strain IMCC14465
Parvibaculum lavamentivorans Pelagibacter ubique
coral_IL_high_14 1215 99.14 44.3 8.85 7897 Alphaproteobacteria
strain HIMB5
Alteromonas macleodii Alteromonas mediterranea
Alteromonas naphthalenivorans
coral_IL_high_26 1554 87.93 34.6 6.72 7756 Verrucomicrobia SAR86 cluster
bacterium SAR86A
Ruegeria sp. R11, Roseobacter
denitrificans OCh 114
Alteromonas mediterranea
coral_IL_high_28 270 52.27 39.4 1.03 1175 Alteromonas taeanensis Flavobacteria strain
MS024 2A
Polaribacter sp. MED152 SAR116 cluster alpha
proteobacterium HIMB100
coral_IL_high_41 825 79.67 56.7 2.99 3384 Gammproteobacteria
strain HIMB55
marine gammaproteobacteria
strain HTCC2080
coral_IL_high_49 3536 50.62 51.1 12.97 12,794 Bacteria Gammaproteobacteria strain
IMCC3088
coral_IL_high_54 297 93.1 37.8 2.51 2776 unresolved SAR86 cluster
strain SAR86E
SAR86 cluster bacterium SAR86E
kelp_IL_low_5 1046 90.7 62.8 5.16 5903 Oceanibulbus indolifex Oceanibulbus indolifex Oceanibulbus indolifex HEL-45
kelp_IT_high_1 848 82.45 51.6 5.04 7180 Rubritalea marina Verrucomicrobia strain
SCGC AAA168 F10
Akkermansia muciniphila,
Verrucomicrobium spinosum
DSM 4136
Marinobacter salarius
Marinobacter algicola
kelp_IT_high_7 487 77.27 43.3 1.97 3260 Owenweeksia hongkongensis Flavobacteria
strain MS024 2A
Kordia algicidaOT-1
kelp_IT_high_8 1224 54.55 52.1 2.98 5637 Limnobacter Limnobacter sp.MED105 Limnobacter sp.MED105 Marinobacter sps
kelp_IT_high_16 1011 57.94 39.5 2.06 3981 Flavobacteriaceae SAR86 cluster
strain SAR86C
Tenacibaculum sp. MED152
kelp_IT_high_20 836 84.8 52 4.19 6227 Rhodobacteraceae Rhodobacteraceae
strain HTCC2150
Roseovarius nubinhibens
Papudeshi et al. BMC Genomics (2017) 18:915 Page 9 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
genome size, the number of protein-encoding genes, and
RNA sequences compared with Pelagibacter ubique (see
Additional file 9: Table S8), we suggest it is a novel gen-
ome within the Alphaproteobacteria. The identification
of the novel genomes provides support that the
metagenome-assembled genomes contain environmen-
tally relevant genomic material that is not in the
cultured relatives from the databases.
Discussion
We present a set of evaluation parameters to optimize
the workflow to reconstruct metagenome-assembled ge-
nomes from environmental microbial communities using
assembly evaluation parameters; the number of contigs,
contig length, the proportion of reads assembled, genus
richness, evenness and binning evaluation parameters;
GC content, species richness, and genome completeness.
Selection of the four projects, containing 105 metagen-
omes, in the study accounts for variation in biological
and procedural biases that are common in every micro-
biome study. By including these variables in the
optimization, rather than using mock communities or
few metagenomes [22, 26, 31, 43, 52], we tested the tools
under realistic conditions and identified biases. For our
datasets, SPAdes assembler and MetaBat binning tools
provided optimal results, and our evaluation techniques
could be used to explore and evaluate new assemblers
and binning tools.
Assembly evaluation parameters
The metagenomic variations within the projects influ-
enced the performance of the assemblers. To select an
optimal assembler, contig length, the number of contigs,
and proportion of reads assembled showed that Meta-
Velvet performed poorly and was not considered further.
The underlying algorithm for both IDBA and SPAdes
assemblers apply De-Bruijn graphs. The difference in-
cluded, IDBA iteratively improving the kmer size based
on input [28, 52], and SPAdes sequentially assembling
the metagenomes with kmer fragments between 21 to
127 [27].We observed that SPAdes assembled contigs
were longer for Illumina samples compared to IonTor-
rent samples. We predict as the SPAdes assembler fur-
ther fragments the reads to different kmer sizes to form
contigs, the overlapping region between forwards and
reverse reads from Illumina facilitates the forming of
longer contigs [27]. More reads were incorporated to
contigs for coral environment samples when using
SPAdes and for kelp forest samples when using IDBA,
which could be due to the bias associated with the algo-
rithms in handling the variability within the microbial
communities. We included two additional parameters,
species richness and evenness to account for shortcuts
applied in the assembly algorithms that include a data
reduction step to discards the low abundant sequences,
and formation of contig chimeras [53]. A decrease in
species richness compared to the unassembled metagen-
omes would suggest contig chimeras. However, all as-
semblies showed a slight increase in species richness,
and conserved evenness suggesting minimal contig chi-
meras were constructed by IDBA or SPAdes. IDBA as-
sembler performance was more uniform suggesting that
the assembler is treating all datasets the same and does
not take advantage of underlying structure in the meta-
genomes, such as longer reads. The IDBA documenta-
tion is minimal [52], and this may affect the users ability
Fig. 5 Proteome comparison of the reconstructed population genome (coral_IL_high Bin 13) compared against the genomes closest neighbors
Parvibaculum lavamentivorans (a), Alpha proteobacterium IMCC14465 (b) and Pelagibacteria ubique (c). The outer ring represents the contig of the
reference species. The middle ring represents the reference bacterial species, and the inner most ring represent the potentially novel population
genome with the color scale representing the protein similarity
Papudeshi et al. BMC Genomics (2017) 18:915 Page 10 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
to use the assemblers to full potential. In conclusion, the
applied parameters showed SPAdes assembly provides
the best contig continuity and minimal contig chimerism
across four different microbial environments and dis-
played flexibility with each of the biological and platform
biases. While conducted on far less data, other studies
have also found SPAdes to provide longer contigs with
more reads used in the assembly [26, 52].
Binning tool selection
MetaBat was selected as the optimal binning tool be-
cause the bins had minimal GC variation, species rich-
ness, and high genome completeness that may represent
a single genome. The number of bins extracted by
MetaBat was low compared to GroopM extracted bins.
MetaBat bins were further validated using taxonomic
identification to show the workflow reconstructed 66
metagenome-assembled genomes. These metagenome-
assembled genomes include sequences of closely related
species; therefore, they were identified to the genus level.
Each metagenome-assembled genome contained se-
quences belonging to distant bacterial species, suggest-
ing possible horizontal gene transfers or novel sequences
with no genome relative in the database. Metagenome-
assembled genomes of Arcobacter extracted from coral
reefs were studied to identify unique genes that were
previously not associated with the genomes cultured
from other environments [9]. Identification of potentially
novel genomes extracted from metagenomes relies on
the presence of marker genes [32, 54]. A novel popula-
tion bin (coral_IL_high bin 13) that has all the bacterial
genome markers used in CheckM, and was phylogenet-
ically affiliated to the bacterial species Parvibaculum
lavamentivorans, with 44% proteome similarity using
Focus. Further analysis with ANI and JSpeciesWS (TCS),
suggested 82.22% similarity to Pelagicater ubique. ANI >
95% represents over 70% DNA-DNA hybridization
which shows species relatedness, suggesting that Bin 13
falls below the species levels classification. The conflict-
ing results of two kmer based tools, suggests that the
genomes are novel and therefore do not closely match
organisms in the databases. In addition, several data-
bases need to be used in the description of
metagenome-assembled genomes to overcome any data-
base bias. The resulting metagenome-assembled ge-
nomes enable linking taxa to function to understand the
role of the population in the microbial community, and
we are currently investigating the role of these genomes
in the coral reef environment [14]. Our pipeline meets
the minimum standards for metagenome-assembled
genomes [55]. In the process, novel genomes, genes, and
sequences were identified, which can now be deposited
in a database to improve future annotation [29, 32, 56].
Conclusions
We present a set of assembly and binning evaluation
parameters to select for an optimized workflow to
reconstruct metagenome-assembled genomes (see
Additional file 10). The set of parameters provides
biologically relevant information regarding richness,
evenness, and GC content to help infer the optimal
tools for the dataset. Using these parameters, we
present an optimized workflow for four metagenome
projects, to be SPAdes assembly and MetaBat binning
tool regardless of the metagenomic variations. However,
the metagenomic variations within each project did result
in the differential quality of the metagenome-assembled
genomes. Communities that have high coverage of phylo-
genetically distinct organisms and low taxonomic diversity
resulted in better quality genome reconstruction.
Additional files
Additional file 1: Table S1. Metagenomes used in this study. List of
metagenomes used in the analysis and the sequencing statistics. (DOCX 20 kb)
Additional file 2: Figure S1. Microbial diversity in the 4 microbiome
projects. Representation of microbial diversity using, (a) genus richness,
(b) genus evenness, (c) Shannon diversity, and (d) Simpson diversity of
the four projects, which are represented on the x axis. The box
represents 50% of the data ranges around the median. The outliers for
each case are represented as black dots. (DOCX 167 kb)
Additional file 3: Table S2. Post hoc Tukey HSD test results for
diversity analysis. Post hoc Tukey HSD test results for Shannon, Simpson,
Richness and Evenness for the four projects. (DOCX 14 kb)
Additional file 4: Table S3. Assembly statistics. QUAST results for the
12 Contigs files assembled using the three assemblers; IDBA, MetaVelvet,
SPAdes. (DOCX 16 kb)
Additional file 5: Table S4. Post hoc Tukey HSD test results for contig
length. Post hoc Tukey test results comparing contig length of 1000
contigs across assemblers and projects. (DOCX 18 kb)
Additional file 6: Table S5. Post hoc Tukey HSD test results for mean
reads assembled. Post hoc Tukey test results for the mean reads assembled
(%) of 1000 contigs across assemblers and projects. (DOCX 18 kb)
Additional file 7: Table S6. Assembly evaluation parameters. List of all
the assembly evaluation parameters. (DOCX 16 kb)
Additional file 8: Table S7. Binning tool evaluation parameters. List of
the parameters for the GroopM and MetaBat extracted bins. (DOCX 51 kb)
Additional file 9: Table S8. Comparison on metagenome-assembled
genomes. Comparison of the genome parameters of novel
metagenome-assembled genome (coral_IL_high Bin 13) against the three
closest genomes from the database. (DOCX 13 kb)
Additional file 10: Optimized workflow. Guide to optimized workflow
to reconstruct metagenome-assembled genomes. Description of the
programs used in this study at each step and the evaluation parameters
calculation is provided as step by step workflow. (DOCX 148 kb)
Abbreviations
ANI: Average Nucleotide Identity; ANOVA: ANalysis Of VAriance;
BWA: Burrows-Wheeler Aligner; CCOM: Contig Clustering of Metagenomics;
CheckM: Check genome completeness; coral_IL_high: coral high diversity;
coral_IT_low: coral low diversity; FOCUS: Find Organisms by Composition
USage; GroopM: Group Metagenomes; IDBA: Iterative De Bruijn graph
Assembler; IL: Illumina MiSeq; IT: Ion Torrent PGM; kelp_IL_low: kelp low
diversity; kelp_IT_high: kelp high diversity; MetaBat: Metagenome Binning
with Abundance and Tetra-nucleotide frequencies;
Papudeshi et al. BMC Genomics (2017) 18:915 Page 11 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
MetaVelvet: (METAgenomic-Velvet assembler); MG-RAST: MetaGenomics-
Rapid Annotation using Subsystems Technology; PATRIC: Pathosystems
Resource Integration Center; PEAR: Paired-End Read merger;
PhyloSift: Phylogenetic analysis of genomes and metagenomes;
PRINSEQ: PReprocessing and INformation of SEQuence data; QUAST: Quality
Assessment for Genome Assemblies; RAST: Rapid Annotations using
Subsystems Technology; SPAdes: (St. Petersburg genome assembler);
TCS: Tetranucleotide Correlation Search
Acknowledgments
We acknowledge funding from CNPq, FAPERJ, and CAPES and permits provided
by Brazilian federal government license, USA National Fisheries and Wildlife
Permit for sample collection. We would also like to thank National Center for
Genome Analysis Support (NSF Awards DBI-1458641 and ABI-1062432).
Funding
All the work was conducted at San Diego State University. We thank the funding
support from National Science Foundation Grants NSF Division of Undergraduate
Education #1323809, NSF Division of Molecular and Cellular Science #1330800,
and NSF Division of Computer and Network Systems CNS-1305112.
Availability of data and materials
The metagenomes analyzed in this study are available on MG-RAST
repository, and their MG-RAST IDs are in Additional file 1: Table S1.
Authorscontributions
Conceived and designed the experiments: BP and ED. Data collection and
sequencing experiments: JMH, MD, MM, FT, and KW. Performed
metagenomic analysis: BP, DTB, DP, and GGS. Statistical analysis: BP and PZ.
Critical revision of the manuscript: RE and ED. All authors read and approved
the final manuscript.
Ethics approval and consent to participate
No animal ethics approval was required. This research was conducted under
the Brazilian federal government license (SISBIO no. 101122). We received
this license to access protected areas from Parque Nacional Marinho de
Abrolhos/IBAMA (Instituto Brasileiro do Meio Ambiente e dos Recursos
Naturasis Renovaveis). The macroalgae were collected under the USA
National Fisheries and Wildlife Permit # SC - 13075.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Author details
1
Bioinformatics and Medical Informatics, San Diego State University, San
Diego, California, USA.
2
National Center for Genome Analysis Support,
Indiana University, Bloomington, Indiana, USA.
3
Department of Biology, San
Diego State University, 5500 Campanile Drive, San Diego 92115, California,
USA.
4
Computational Science Research Center, San Diego State University,
San Diego, California, USA.
5
Department of Biology, University of New South
Wales, Sydney, New South Wales, Australia.
6
Department of Mathematics and
Statistics, San Diego State University, San Diego, California, USA.
7
Institute of
Biology, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil.
8
Department of Computer Science, San Diego State University, 5500
Campanile Drive, San Diego, California, USA.
Received: 8 June 2017 Accepted: 13 November 2017
References
1. JLaAT MC. Ome Sweet'Omics-a genealogical Treasury of words. Sci. 2001;
17(7):88.
2. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH.
Genome sequences of rare, uncultured bacteria obtained by differential
coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):5338.
3. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome
Biol. 2002;3(2):reviews0003.00018.
4. Locey KJ, Lennon JT. Scaling laws predict global microbial diversity. Proc
Natl Acad Sci. 2016;113(21):59705.
5. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, Wegley L, Hatay M,
Hall D, Brown E, Haynes M, et al. Microbial ecology of four coral atolls in the
northern Line Islands. PLoS One. 2008b;3(2):e1584.
6. Doane MP, Haggerty JM, Kacev D, Papudeshi B, Dinsdale EA. The skin
microbiome of the common thresher shark (Alopias Vulpinus) has low
taxonomic and gene function beta-diversity. Environ Microbiol Rep. 2017;
9(4):35773.
7. Haggerty JM, Dinsdale EA. Distinct biogeographical patterns of marine
bacterial taxonomy and functional genes. Glob Ecol Biogeogr. 2016;26(2):
17790.
8. Haggerty JM, Bhavya Papudeshi, Alejandro Vega, Megan Morris, Michael
Doane, Holly Norman, Dinsdale E: Taxonomic selection and metabolic
strategies during bacterial succession of decomposing giant kelp,
Macrocystis pyrifera. In review.
9. Haggerty JM, Bhavya Papudeshi, Kevin Walsh, Marc B. Turner, Ronaldo
Francini-Filho, Cynthia B. Silveira, Timothy T. Harkins, Robert A. Edwards,
Fabiano L. Thompson, Dinsdale EA: Hunt for the super-heterotroph:
investigating the gene content of rarer coral reef bacterial genera.In review.
10. Morris MJM, Haggerty BN, Papudeshi AA, Vega MS, Edwards EA. Dinsdale 2016
Altered microbial abundance and community composition affect recruitment
and development in gametophytes of giant kelp, Macrocystis pyrifera.
Frontiers in Microbiology. https://doi.org/10.3389/fmicb.2016.01800.
11. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V,
McEwen JE, Wetterstrand KA, Deal C, et al. The NIH human microbiome
project. Genome Res. 2009;19(12):231723.
12. Dinsdale EA, Edwards RA, Bailey BA, Tuba I, Akhter S, McNair K, Schmieder R,
Apkarian N, Creek M, Guan E, et al. Multivariate analysis of functional
metagenomes. Front Genet. 2013;4:41.
13. Coutinho FH, Meirelles PM, Moreira APB, Paranhos RP, Dutilh BE, Thompson
FL. Niche distribution and influence of environmental parameters in marine
microbial communities: a systematic review. PeerJ. 2015;3:e1008.
14. Walsh K, Haggerty JM, Doane M, Hansen J, Morris M, Moreira AP, de Oliveira L,
Leomil L, Garcia G, Thompson FL, Dinsdale EA. Aura-biomes are present in the
water layer above coral reef benthic macro-organisms. Peer J. 2017;5:e3666.
15. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M,
Desnues C, Haynes M, Li L, et al. Functional metagenomic profiling of nine
biomes. Nature. 2008a;452(7187):62932.
16. Kelly LW, Williams GJ, Barott KL, Carlson CA, Dinsdale EA, Edwards RA, Haas
AF, Haynes M, Lim YW, McDole T, et al. Local genomic adaptation of coral
reef-associated microbiomes to gradients of natural variability and
anthropogenic stressors. Proc Natl Acad Sci. 2014;111(28):1022732.
17. Jensen S, Bourne DG, Hovland M, Murrell JC. High diversity of
microplankton surrounds deep-water coral reef in the Norwegian Sea. FEMS
Microbiol Ecol. 2012;82(1):7589.
18. Bruce T, Meirelles PM, Garcia G, Paranhos R, Rezende CE, de Moura RL, Filho
R-F, Coni EOC, Vasconcelos AT, Amado Filho G, et al. Abrolhos Bank reef
health evaluated by means of water quality, microbial diversity, benthic
cover, and fish biomass data. PLoS One. 2012;7(6):e36687.
19. Fernandes N, Steinberg P, Rusch D, Kjelleberg S, Thomas T. Community
structure and functional gene profile of bacteria on healthy and diseased
thalli of the red seaweed Delisea pulchra. PLoS One. 2012;7(12):e50854.
20. Cassman N, Prieto-Davó A, Walsh K, Silva GG, Angly F, Akhter S, Barott K,
Busch J, McDole T, Haggerty JM. Oxygen minimum zones harbor novel viral
communities with low diversity. Environ Microbiol. 2012;14(11):304365.
21. Huggett JF, Laver T, Tamisak S, Nixon G, OSullivan DM, Elaswarapu R,
Studholme DJ, Foy CA. Considerations for the development and application
of control materials to improve metagenomic microbial community
profiling. Accred Qual Assur. 2012;18(2):7783.
22. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droege J, Gregor I,
Majda S, Fiedler J, Dahms E et al: Critical Assessment of Metagenome
Interpretation a benchmark of computational metagenomics software.
bioRxiv 099127; https://doi.org/10.1101/099127.
23. Prakash T, Taylor TD. Functional assignment of metagenomic data:
challenges and applications. Brief Bioinform. 2012;13(6):71127.
Papudeshi et al. BMC Genomics (2017) 18:915 Page 12 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
24. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool
for genome assemblies. Bioinformatics. 2013;29(8):10725.
25. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol.
2009;10(3):110.
26. Garcia-Lopez R, Vazquez-Castellanos JF, Moya A. Fragmentation and
coverage variation in viral metagenome assemblies, and their effect in
diversity calculations. Front Bioeng Biotechnol. 2015;3:141.
27. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS.
SPAdes: a new genome assembly algorithm and its applications to single-
cell sequencing. J Comput Biol. 2012;19:45577.
28. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for
single-cell and metagenomic sequencing data with highly uneven depth.
Bioinformatics. 2012;28(11):14208.
29. Dutilh BE, Schmieder R, Nulton J, Felts B, Salamon P, Edwards RA, Mokili JL.
Reference-independent comparative metagenomics using cross-assembly:
crAss. Bioinformatics. 2012;28(24):322531.
30. Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW.
GroopM: an automated tool for the recovery of population genomes from
related metagenomes. PeerJ. 2014;2:e603.
31. Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately
reconstructing single genomes from complex microbial communities. PeerJ.
2015;3:e1165.
32. Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population
genomes from metagenome datasets. Microbiome. 2016;4:8.
33. Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes
and evolutionary implications. J Bacteriol. 1997;179(12):3899913.
34. Cleary B, Brito IL, Huang K, Gevers D, Shea T, Young S, Alm EJ. Detection of
low-abundance bacterial strains in metagenomic datasets by eigengenome
partitioning. Nat Biotechnol. 2015;33(10):105360.
35. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary
implications of microbial genome tetranucleotide frequency biases.
Genome Res. 2003;13(2):14558.
36. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-
service and a stand-alone program for the analysis and comparison of
tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics.
2004;5(1):163.
37. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM:
assessing the quality of microbial genomes recovered from isolates, single
cells, and metagenomes. Genome Res. 2015;25(7):104355.
38. Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift:
phylogenetic analysis of genomes and metagenomes. PeerJ. 2014;2:e243.
39. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K,
Gerdes S, Glass EM, Kubal M. The RAST server: rapid annotations using
subsystems technology. BMC Genomics. 2008;9:75.
40. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje
JM. DNA-DNA hybridization values and their relationship to whole-genome
sequence similarities. Int J Syst Evol Microbiol. 2007;57(Pt 1):8191.
41. Richter M, Rosselló-Móra R, Oliver Glöckner F, Peplies J. JSpeciesWS: a web
server for prokaryotic species circumscription based on pairwise genome
comparison. Bioinformatics. 2016;32(6):92931.
42. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta
DR, Gautier L, Pedersen AG, Le Chatelier E, et al. Identification and assembly
of genomes and genetic elements in complex metagenomic samples
without using reference genomes. Nat Biotech. 2014;32(8):8228.
43. Gupta A, Kumar S, Prasoodanan VPK, Harish K, Sharma AK, Sharma VK.
Reconstruction of bacterial and viral genomes from multiple Metagenomes.
Front Microbiol. 2016;7:469.
44. Meyer F, Paarmann D, D'Souza M, Olson R, Glass E, Kubal M, Paczian T,
Rodriguez A, Stevens R, Wilke A, et al. The metagenomics RAST server a
public resource for the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics. 2008;9(1):18.
45. Schmieder R, Edwards R. Quality control and preprocessing of
metagenomic datasets. Bioinformatics. 2011;27(6):8634.
46. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina
paired-end reAd mergeR. Bioinformatics. 2013;30(5):61420.
47. Clarke K, Gorley, RN: PRIMER v7: User Manual/Tutorial. PRIMER-E. 2015:
Plymouth, 296pp.
48. Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. FOCUS: an alignment-free
model to identify organisms in metagenomes using non-negative least
squares. PeerJ. 2014;2:e425.
49. Raheema JY: Contig clustering of Metagenomics (CCOM): a tool that
generates population genomes (bins) to analyze and capture uncultured
genomes. Thesis. San Diego: Montezuma Publishing: San Diego State
University; 2016.
50. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
Hamelryck T, Kauff F, Wilczynski B, et al. Biopython: freely available python
tools for computational molecular biology and bioinformatics.
Bioinformatics. 2009;25(11):14223.
51. Wattam AR, Davis JJ, Assaf R, Boisvert S, Brettin T, Bun C, Conrad N, Dietrich
EM, Disz T, Gabbard JL, et al. Improvements to PATRIC, the all-bacterial
bioinformatics database and analysis resource Center. Nucleic Acids Res.
2017;45(D1):D53542.
52. Vollmers J, Wiegand S, Kaster A-K. Comparing and evaluating metagenome
assembly tools from a microbiologists perspective-not only size matters!
PLoS One. 2017;12(1):e0169662.
53. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling
metagenome sequence assembly with probabilistic de Bruijn graphs. Proc
Natl Acad Sci. 2012;109(33):132727.
54. Yuan C, Lei J, Cole J, Sun Y. Reconstructing 16S rRNA genes in
metagenomic data. Bioinformatics. 2015;31(12):i3543.
55. Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy
TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA, et al. Minimum
information about a single amplified genome (MISAG) and a metagenome-
assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;
35(8):72531.
56. Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Alexander Richter R, Valas
R, Novotny M, Yee-Greenbaum J, Selengut JD, Haft DH, et al. Genomic
insights to SAR86, an abundant and uncultivated marine bacterial lineage.
ISME J. 2012;6(6):118699.
We accept pre-submission inquiries
Our selector tool helps you to find the most relevant journal
We provide round the clock customer support
Convenient online submission
Thorough peer review
Inclusion in PubMed and all major indexing services
Maximum visibility for your research
Submit your manuscript at
www.biomedcentral.com/submit
Submit your next manuscript to BioMed Central
and we will help you at every step:
Papudeshi et al. BMC Genomics (2017) 18:915 Page 13 of 13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Now, we perform metagenomic sequencing using massively parallel sequencing or deep sequencing. This involves sequencing millions of small fragments of DNA, and then recreating the genome by connecting the fragment sequences using bioinformatics analyses [7]. ...
Article
Full-text available
The microbiome is an essential part of most ecosystems. It was originally studied mostly through culturing but relatively few microbes can be cultured, so much of the microbiome was left unexplored. The emergence of metagenomic sequencing techniques changed that and allowed the study of microbiomes from all sorts of habitats. Metagenomic sequencing also allowed for a more thorough exploration of prophages, viruses that integrate into bacterial genomes, and how they benefit their hosts. One issue with using open-access metagenomic data is that sequences added to databases often have little to no metadata to work with, so finding enough sequences can be difficult. Many metagenomes have been manually curated but this is a time-consuming process and relies heavily on the uploader to be accurate and thorough when filling in metadata fields and the curators to be working with the same ontologies. Using algorithms to automatically sort metagenomes based on either the taxonomic profile or the functional profile may be a viable solution to the issues with manually curated metagenomes, but it requires that the algorithm is trained on carefully curated datasets and using the most informative profile possible in order to minimize errors.
... Although bin006 was predicted as Pseudomonas with high quality (100% completeness and 1.41% contamination), it was not accurately classified in a detailed analysis to delineate species using taxonomical indices including average nucleotide identity (ANI) value and 16S rRNA gene similarity. Thus, we had excluded bin006 from analysis to reduce misunderstanding by potential biases during MAG reconstruction [26]. As a result, three bins (bin001, bin002, and bin003) were retrieved from all the KMC samples and were predicted as Komagataeibacter (bin001 and 003) and Family Acetobacteraceae (bin002) by the SCG set of Kaiju. ...
Article
Full-text available
Kombucha mutualistic community (KMC) is composed by acetic acid bacteria and yeasts, producing fermented tea with health benefits. As part of the BIOlogy and Mars EXperiment (BIOMEX) project, the effect of Mars-like conditions on the KMC was analyzed. Here, we analyzed metagenomeassembled genomes (MAGs) of the Komagataeibacter, which is a predominant genus in KMC, to understand their roles in the KMC after exposure to Mars-like conditions (outside the International Space Station) based on functional genetic elements. We constructed three MAGs: K. hansenii, K. rhaeticus, and K. oboediens. Our results showed that (i) K. oboediens MAG functionally more complex than K. hansenii, (ii) K. hansenii is a keystone in KMCs with specific functional features to tolerate extreme stress, and (iii) genes related to the PPDK, betaine biosynthesis, polyamines biosynthesis, sulfate-sulfur assimilation pathway as well as type II toxin-antitoxin (TA) system, quorum sensing (QS) system, and cellulose production could play important roles in the resilience of KMC after exposure to Mars-like stress. Our findings show the potential mechanisms through which Komagataeibacter tolerates the extraterrestrial stress and will help to understand minimal microbial composition of KMC for space travelers.
... The quality filtered, trimmed and assembled reads obtained from the previous step were used for assembling genomes from the metagenomes. Binning of the contigs was carried out with three different binning algorithms: Metabat2 [38], Maxbin2 [39] and Concoct [40]. The result of each binning procedure was further improved with Metawrap [41]. ...
Article
From the metagenome of a carbamazepine amended selective enrichment culture the genome of a new to science bacterial species affiliating with the genus Nocardioides was reconstructed. From the same enrichment an aerobic actinobacterium, strain CBZ_1T, sharing 99.4% whole-genome sequence similarity with the reconstructed Nocardioides sp. bin genome was isolated. On the basis of 16S rRNA gene sequence similarity the novel isolate affiliated to the genus Nocardioides, with the closest relatives Nocardioides kongjuensis DSM19082T (98.4%), Nocardioides daeguensis JCM17460T (98.4%) and Nocardioides nitrophenolicus DSM15529T (98.2%). Using a polyphasic approach it was confirmed that the isolate CBZ_1T represents a new phyletic lineage within the genus Nocardioides. According to metagenomic, metatranscriptomic studies and metabolic analyses strain CZB_1T was abundant in both carbamazepine and ibuprofen enrichments, and harbors biodegradative genes involved in the biodegradation of pharmaceutical compounds. Biodegradation studies supported that the new species was capable of ibuprofen biodegradation. After 7 weeks of incubation, in mineral salts solution supplemented with glucose (3 g l⁻¹) as co-substrate, 70% of ibuprofen was eliminated by strain CBZ_1T at an initial conc. of 1.5 mg l⁻¹. The phylogenetic, phenotypic and chemotaxonomic data supported the classification of strain CBZ_1T to the genus Nocardioides, for which the name Nocardioides carbamazepini sp. nov. (CBZ_1T = NCAIM B.0.2663 = LMG 32395) is proposed. To the best of our knowledge, this is the first study that reports simultaneous genome reconstruction of a new to science bacterial species using metagenome binning and at the same time the isolation of the same novel bacterial species.
... A diferencia de los experimentos convencionales de RNA-seq en el que las lecturas obtenidas son mapeadas contra genomas de referencia, en los análisis metatranscriptómicos se suele seguir la estrategia llamada ensamblaje de novo, método que resulta similar a la reconstrucción de genomas a partir de metagenomas, suponiendo a priori de que gran parte de los organismos presentes no cuenta con genomas secuenciados disponibles (Papudeshi et al. 2017 Un aspecto crucial de los análisis (meta)transcriptómicos es la anotación funcional de los transcritos, es decir la asignación de una función biológica respecto a la identidad de la proteína para la cual codifica, para ello se emplean diferentes bases de datos de proteínas clasificadas según su función biológica. ...
Thesis
Enriched microbial consortia are promising tools for obtaining biofuels and value-added products from lignocellulosic biomass. Due to the complexity of these biological systems, their in-depth study requires the use of large-scale analysis tools. One such approach is metatranscriptomics, which is based on the analysis of all expressed genes by a microbial community under certain conditions. Nixtamalized maize pericarp is a lignocellulosic residue generated in large amounts by the food industry. This residue, since it is rich in cellulose and hemicellulose, is a potential source of fermentable sugars. The endogenous microbiota of nixtamalized maize pericarp, through an enrichment-stabilization process, acquires the ability to degrade efficiently that residue. The resulting consortium, which has been called PM-06, is composed of bacteria from the phyla Actinobacteria and Firmicutes, among which the most abundant genera are Aneurinibacillus (29%), Bacillus (28%), Paenibacillus (26%) and Microbacterium (13%). The degradation of the nixtamalized maize pericarp by the action of the PM-06 consortium occurs in sequential stages. These stages are characterized by structural and physicochemical modifications of the substrate accompanied by a cyclic behavior in the dynamics of microbial populations. On the other hand, the transcriptional profile of the consortium shows a temporal pattern that is characterized by a high expression of genes belonging to glycosyl hydrolases and carbohydrate esterases both at the beginning and at the end of the procces. Likewise, through the taxonomic affiliation of these genes, the genera Bacillus and Paenibacillus were identified as the degradators of the complex polymers of the insoluble residue while Microbacterium and Aneurinibacillus would be carrying out the hydrolysis of solubilized oligosaccharides. The findings of this work allowed us to generate a model of the degradation of nixtamalized maize pericarp with a potential biotechnological application.
... Unfortunately, an intragenomic variance of oligonucleotide frequencies is quite high when looking at fragments with a length below 10,000, which are quite common when dealing with assemblies of complex metagenomes (Forouzan et al., 2018;Kang et al., 2019;Papudeshi et al., 2017). Better performing approaches were needed and these typically involve Markov models or variants thereof. ...
Article
Full-text available
The rise of metagenomics offers a leap forward for understanding the genetic diversity of microorganisms in many different complex environments by providing a platform that can identify potentially unlimited numbers of known and novel microorganisms. As such, it is impossible to imagine new major initiatives without metagenomics. Nevertheless, it represents a relatively new discipline with various levels of complexity and demands on bioinformatics. The underlying principles and methods used in metagenomics are often seen as common knowledge and often not detailed or fragmented. Therefore, we reviewed these to guide microbiologists in taking the first steps into metagenomics. We specifically focus on a workflow aimed at reconstructing individual genomes, that is, metagenome‐assembled genomes, integrating DNA sequencing, assembly, binning, identification and annotation.
... The quality-controlled reads were assembled into contigs using SPAdes v 3.11.1 [95] with the "meta" option and k-mer sizes of 21, 31, 41, 51, 71. The assembly quality was checked using the "metaquast" option of QUAST v 3.1 (Quality Assessment for Genome Assemblies) based on weighted median contig size (N50) [50] and percent of reads mapped to the contigs [76,101]. Only the reads mapped to prokaryotic contigs were examined in this study (see the "Taxonomic annotation" and "Functional annotation" sections below). ...
Article
Full-text available
Background Termites primarily feed on lignocellulose or soil in association with specific gut microbes. The functioning of the termite gut microbiota is partly understood in a handful of wood-feeding pest species but remains largely unknown in other taxa. We intend to fill this gap and provide a global understanding of the functional evolution of termite gut microbiota. Results We sequenced the gut metagenomes of 145 samples representative of the termite diversity. We show that the prokaryotic fraction of the gut microbiota of all termites possesses similar genes for carbohydrate and nitrogen metabolisms, in proportions varying with termite phylogenetic position and diet. The presence of a conserved set of gut prokaryotic genes implies that essential nutritional functions were present in the ancestor of modern termites. Furthermore, the abundance of these genes largely correlated with the host phylogeny. Finally, we found that the adaptation to a diet of soil by some termite lineages was accompanied by a change in the stoichiometry of genes involved in important nutritional functions rather than by the acquisition of new genes and pathways. Conclusions Our results reveal that the composition and function of termite gut prokaryotic communities have been remarkably conserved since termites first appeared ~ 150 million years ago. Therefore, the “world’s smallest bioreactor” has been operating as a multipartite symbiosis composed of termites, archaea, bacteria, and cellulolytic flagellates since its inception.
... The latter two factors might be the reason why we did not retrieve any MAGs from Apicomplexa such as dinoflagellates. The intraphylum diversity most likely plays a role, too [99]; populations with low diversity and high coverage have been observed to improve the quality of MAGs recovered by Metabat [100]. Viridiplantae show low diversity, and especially members from the Prasinophytes have small genomes and are abundant in the surface ocean [101], which might explain why we retrieved several MAGs from different classes. ...
Article
Full-text available
Background Phytoplankton communities significantly contribute to global biogeochemical cycles of elements and underpin marine food webs. Although their uncultured genomic diversity has been estimated by planetary-scale metagenome sequencing and subsequent reconstruction of metagenome-assembled genomes (MAGs), this approach has yet to be applied for complex phytoplankton microbiomes from polar and non-polar oceans consisting of microbial eukaryotes and their associated prokaryotes. Results Here, we have assembled MAGs from chlorophyll a maximum layers in the surface of the Arctic and Atlantic Oceans enriched for species associations (microbiomes) with a focus on pico- and nanophytoplankton and their associated heterotrophic prokaryotes. From 679 Gbp and estimated 50 million genes in total, we recovered 143 MAGs of medium to high quality. Although there was a strict demarcation between Arctic and Atlantic MAGs, adjacent sampling stations in each ocean had 51–88% MAGs in common with most species associations between Prasinophytes and Proteobacteria . Phylogenetic placement revealed eukaryotic MAGs to be more diverse in the Arctic whereas prokaryotic MAGs were more diverse in the Atlantic Ocean. Approximately 70% of protein families were shared between Arctic and Atlantic MAGs for both prokaryotes and eukaryotes. However, eukaryotic MAGs had more protein families unique to the Arctic whereas prokaryotic MAGs had more families unique to the Atlantic. Conclusion Our study provides a genomic context to complex phytoplankton microbiomes to reveal that their community structure was likely driven by significant differences in environmental conditions between the polar Arctic and warm surface waters of the tropical and subtropical Atlantic Ocean.
... In addition to the whole microbial community and analyses its genetic potential, the recovery of MAGs provides significant insight into microbial evolution and metabolism [55]. MAGs are usually retrieved from more abundant reads of more abundant species in the community [56]; however, the outcome is also affected by many other factors, including the genetic diversity of the microbial community, in addition to the metagenome sequencing depth and coverage [57]. In the current study, 68 MAGs were recovered from 17.4 GB of raw data and 18 of those MAGs were of good or high quality. ...
Article
Full-text available
The anthropogenic release of oil hydrocarbons into the cold marine environment is an increasing concern due to the elevated usage of sea routes and the exploration of new oil drilling sites in Arctic areas. The aim of this study was to evaluate prokaryotic community structures and the genetic potential of hydrocarbon degradation in the metagenomes of seawater, sea ice, and crude oil encapsulating the sea ice of the Norwegian fjord, Ofotfjorden. Although the results indicated substantial differences between the structure of prokaryotic communities in seawater and sea ice, the crude oil encapsulating sea ice (SIO) showed increased abundances of many genera-containing hydrocarbon-degrading organisms, including Bermanella, Colwellia, and Glaciecola. Although the metagenome of seawater was rich in a variety of hydrocarbon degradation-related functional genes (HDGs) associated with the metabolism of n-alkanes, and mono-and polyaromatic hydrocarbons, most of the normalized gene counts were highest in the clean sea ice metagenome, whereas in SIO, these counts were the lowest. The long-chain alkane degradation gene almA was detected from all the studied metagenomes and its counts exceeded ladA and alkB counts in both sea ice meta-genomes. In addition, almA was related to the most diverse group of prokaryotic genera. Almost all 18 good-and high-quality metagenome-assembled genomes (MAGs) had diverse HDGs profiles. The MAGs recovered from the SIO metagenome belonged to the abundant taxa, such as Glaciecola, Bermanella, and Rhodobacteracea, in this environment. The genera associated with HDGs were often previously known as hydrocarbon-degrading genera. However, a substantial number of new associations , either between already known hydrocarbon-degrading genera and new HDGs or between genera not known to contain hydrocarbon degraders and multiple HDGs, were found. The super-imposition of the results of comparing HDG associations with taxonomy, the HDG profiles of MAGs, and the full genomes of organisms in the KEGG database suggest that the found relationships need further investigation and verification.
... The remarkable advances of DNA sequencing technology (Mardis, 2017) have led to the accumulation of a vast amount of DNA sequence information, including that for uncultured microbes, i.e., metagenome-assembled genomes (Papudeshi et al., 2017) or single-cell genomes (Xu and Zhao, 2018). Currently, genomic DNA sequences of more than 300,000 microbes are registered in public databases, such as the National Center for Biotechnology Information (NCBI) database (as of June 2021) 1 . ...
Article
Full-text available
Substrate-induced gene expression (SIGEX) is a high-throughput promoter-trap method. It is a function-based metagenomic screening tool that relies on transcriptional activation of a reporter gene green fluorescence protein ( gfp ) by a metagenomic DNA library upon induction with a substrate. However, its use is limited because of the relatively small size of metagenomic DNA libraries and incompatibility with screening metagenomes from anaerobic environments. In this study, these limitations of SIGEX were addressed by fine-tuning metagenome DNA library construction protocol and by using Evoglow, a green fluorescent protein that forms a chromophore even under anaerobic conditions. Two metagenomic libraries were constructed for subseafloor sediments offshore Shimokita Peninsula (Pacific Ocean) and offshore Joetsu (Japan Sea). The library construction protocol was improved by (a) eliminating short DNA fragments, (b) applying topoisomerase-based high-efficiency ligation, (c) optimizing insert DNA concentration, and (d) column-based DNA enrichment. This led to a successful construction of metagenome DNA libraries of approximately 6 Gbp for both samples. SIGEX screening using five aromatic compounds (benzoate, 3-chlorobenzoate, 3-hydroxybenzoate, phenol, and 2,4-dichlorophenol) under aerobic and anaerobic conditions revealed significant differences in the inducible clone ratios under these conditions. 3-Chlorobenzoate and 2,4-dichlorophenol led to a higher induction ratio than that for the other non-chlorinated aromatic compounds under both aerobic and anaerobic conditions. After the further screening of induced clones, a clone induced by 3-chlorobenzoate only under anaerobic conditions was isolated and characterized. The clone harbors a DNA insert that encodes putative open reading frames of unknown function. Previous aerobic SIGEX attempts succeeded in the isolation of gene fragments from anaerobes. This study demonstrated that some gene fragments require a strict in vivo reducing environment to function and may be potentially missed when screened by aerobic induction. The newly developed anaerobic SIGEX scheme will facilitate functional exploration of metagenomes from the anaerobic biosphere.
... The quality-controlled reads were assembled into contigs using SPAdes v 3.11.1 (Nurk et al., 2017) with the "meta" option and k-mer sizes of 21, 31, 41, 51, 71. The assembly quality was checked using the "metaquast" option of QUAST v 3.1 (Quality Assessment for Genome Assemblies) based on weighted median contig size (N50) (Gurevich et al., 2013) and percent of reads mapped to the contigs (Langmead et al., 2012;Papudeshi et al., 2017). Only the reads mapped to prokaryotic contigs were examined in this study (see the taxonomic annotation and functional annotation sections below). ...
Preprint
Full-text available
Termites primarily feed on lignocellulose or soil in association with specific gut microbes. The functioning of the termite gut microbiota is partly understood in a handful of wood-feeding pest species, but remains largely unknown in other taxa. We intend to feel this gap and provide a global understanding of the functional evolution of termite gut microbiota. We sequenced the gut metagenomes of 145 samples representative of the termite diversity. We show that the prokaryotic fraction of the gut microbiota of all termites possesses similar genes for carbohydrate and nitrogen metabolisms, in proportions varying with termite phylogenetic position and diet. The presence of a conserved set of gut prokaryotic genes implies that key nutritional functions were present in the ancestor of modern termites. Furthermore, the abundance of these genes largely correlated with the host phylogeny. Finally, we found that the adaptation to a diet of soil by some termite lineages was accompanied by a change in the stoichiometry of genes involved in important nutritional functions rather than by the acquisition of new genes and pathways. Our results reveal that the composition and function of termite gut prokaryotic communities have been remarkably conserved since termites first appeared ~150 million years ago. Therefore, the world smallest bioreactor has been operating as a multipartite symbiosis composed of termites, archaea, bacteria, and cellulolytic flagellates since its inception.
Preprint
Full-text available
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.
Article
Full-text available
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.
Article
Full-text available
As coral reef habitats decline worldwide, some reefs are transitioning from coral-to algal-dominated benthos with the exact cause for this shift remaining elusive. Increases in the abundance of microbes in the water column has been correlated with an increase in coral disease and reduction in coral cover. Here we investigated how multiple reef organisms influence microbial communities in the surrounding water column. Our study consisted of a field assessment of microbial communities above replicate patches dominated by a single macro-organism. Metagenomes were constructed from 20 L of water above distinct macro-organisms, including (1) the coral Mussismilia braziliensis, (2) fleshy macroalgae (Stypopodium, Dictota and Canistrocarpus), (3) turf algae, and (4) the zoanthid Palythoa caribaeorum and were compared to the water microbes collected 3 m above the reef. Microbial genera and functional potential were annotated using MG-RAST and showed that the dominant benthic macro-organisms influence the taxa and functions of microbes in the water column surrounding them, developing a specific ''aura-biome''. The coral aura-biome reflected the open water column, and was associated with Synechococcus and functions suggesting oligotrophic growth, while the fleshy macroalgae aura-biome was associated with Ruegeria, Pseudomonas, and microbial functions suggesting low oxygen conditions. The turf algae aura-biome was associated with Vibrio, Flavobacterium, and functions suggesting pathogenic activity, while zoanthids were associated with Alteromonas and functions suggesting a stressful environment. Because each benthic organism has a distinct aura-biome, a change in benthic cover will change the microbial community of the water, which may lead to either the stimulation or suppression of the recruitment of benthic organisms.
Article
Full-text available
We present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a Metagenome-Assembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Gene Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.
Article
Full-text available
The health of sharks is linked with emergent properties of its microbiome. Most marine organisms have mucus overlying the skin, but shark have dermal denticles that protrude above the mucus. We characterized the microbiome from the skin of the common thresher shark (Alopias vulpinus) to investigate the structure and composition of the skin microbiome. An average of 618,812 reads per metagenomic library contained open reading frames (80.9% ± S.D. 0.44%), and 7.6 to 12.8% matched known protein sequences. Genera distinguishing the A. vulpinus microbiome from the water column included, Pseudoalteromonas (12.8% ± 4.7 of sequences), Erythrobacter (5. 3% ± 0.5), Limnobacter (4.1% ± 1.4), and Idiomarina (4.2% ± 1.2) and gene pathways included, cobalt, zinc, and cadmium resistance (2.2% ± 0.1); iron acquisition (1.2% ± 0.1); ton/tol transport (1.3% ± 0.08); and n-Phenylalkanoic acid degradation (0.9% ± 0.08). Taxonomic β-diversity of the shark (77.6) was higher than the water column (70.6) and a reference host microbiome (algae: 71.5), and functional β-diversity of the shark (87.4) was similar to water (82.9) and algae (87.5). We conclude the A. vulpinus skin microbiome is influenced by filtering processes, that include biochemical and biophysical components of the shark skin and result in a highly structured microbiome, confirmed by high β-diversity. This article is protected by copyright. All rights reserved.
Article
Full-text available
With the constant improvement in cost-efficiency and quality of Next Generation Sequencing technologies, shotgun-sequencing approaches -such as metagenomics- have nowadays become the methods of choice for studying and classifying microorganisms from various habitats. The production of data has dramatically increased over the past years and processing and analysis steps are becoming more and more of a bottleneck. Limiting factors are partly the availability of computational resources, but mainly the bioinformatics expertise in establishing and applying appropriate processing and analysis pipelines. Fortunately, a large diversity of specialized software tools is nowadays available. Nevertheless, choosing the most appropriate methods for answering specific biological questions can be rather challenging, especially for non-bioinformaticians. In order to provide a comprehensive overview and guide for the microbiological scientific community, we assessed the most common and freely available metagenome assembly tools with respect to their output statistics, their sensitivity for low abundant community members and variability in resulting community profiles as well as their ease-of-use. In contrast to the highly anticipated "Critical Assessment of Metagenomic Interpretation" (CAMI) challenge, which uses general mock community-based assembler comparison we here tested assemblers on real Illumina metagenome sequencing data from natural communities of varying complexity sampled from forest soil and algal biofilms. Our observations clearly demonstrate that different assembly tools can prove optimal, depending on the sample type, available computational resources and, most importantly, the specific research goal. In addition, we present detailed descriptions of the underlying principles and pitfalls of publically available assembly tools from a microbiologist’s perspective, and provide guidance regarding the user-friendliness, sensitivity and reliability of the resulting phylogenetic profiles.
Article
Full-text available
Marine microbes mediate key ecological processes in kelp forest ecosystems and interact with macroalgae. Pelagic and biofilm-associated microbes interact with macroalgal propagules at multiple stages of recruitment, yet these interactions have not been described for Macrocystis pyrifera. Here we investigate the influence of microbes from coastal environments on recruitment of giant kelp, M. pyrifera. Through repeated laboratory experiments, we tested the effects of altered pelagic microbial abundance on the settlement and development of the microscopic propagules of M. pyrifera during recruitment. M. pyrifera zoospores were reared in laboratory microcosms exposed to environmental microbial communities from seawater during the complete haploid stages of the kelp recruitment cycle, including zoospore release, followed by zoospore settlement, to gametophyte germination and development. We altered the microbial abundance states differentially in three independent experiments with repeated trials, where microbes were (a) present or absent in seawater, (b) altered in community composition, and (c) altered in abundance. Within the third experiment, we also tested the effect of nearshore versus offshore microbial communities on the macroalgal propagules. Distinct pelagic microbial communities were collected from two southern California temperate environments reflecting contrasting intensity of human influence, the nearshore Point Loma kelp forest and the offshore Santa Catalina Island kelp forest. The Point Loma kelp forest is a high impacted coastal region adjacent to the populous San Diego Bay; whereas the kelp forest at Catalina Island is a low impacted region of the Channel Islands, 40 km offshore the southern California coast, and is adjacent to a marine protected area. Kelp gametophytes reared with nearshore Point Loma microbes showed lower survival, growth, and deteriorated morphology compared to gametophytes with the offshore Catalina Island microbial community, and these effects were magnified under high microbial abundances. Reducing abundance of Point Loma microbes restored M. pyrifera propagule success. Yet an intermediate microbial abundance was optimal for kelp propagules reared with Catalina Island microbes, suggesting that microbes also have a beneficial influence on kelp. Our study shows that pelagic microbes from nearshore and offshore environments are differentially influencing kelp propagule success, which has significant implications for kelp recruitment and kelp forest ecosystem health.
Article
Full-text available
While paradigms of macroecology are challenged by the high rates of reproduction, dispersal and horizontal gene exchange of bacterial communities, environmental DNA sequencing makes community profiles accessible. We test fundamental hypotheses of macroecological theories, showing that both taxonomic and functional classifications have distinct biogeographical variation across distance and environments depending on trophic composition. Studies spanning the global oceans. Taxonomic and functional profiles were obtained from metagenomes and were compared across oceanographic regions and tested for patterns of co-occurrence. The influences of sampling method (filter size), environmental variables and geographical distribution were compared with distance-based linear models to test predictors of taxonomic and functional composition. Macroecological drivers were compared with bacterial community structure to test four biogeographical hypotheses: (1) no biogeographical patterns, (2) community structure reflects environmental dissimilarity, (3) community structure reflects distance, (4) community structure reflects environment and distance. Bacterial families were clustered into four trophic groups – phototrophic, oligotrophic, eutrophic and copiotrophic – by changes in abundance across oceanographic regions and co-occurrence with core functions. Changes in community composition were best modelled by longitude for free-living communities and dissolved oxygen for mixed communities of free-living and particle-associated bacteria. Both microhabitat and community assignment had an impact on biogeographical patterns, with taxonomic compositions following our hypotheses 2 and 4 and functional gene compositions following hypotheses 3 and 4. We described four trophic groups adding to the current dichotomy of the classification of marine bacteria as oligotrophic or copiotrophic. Taxonomic composition of mixed communities reflected environmental differences but not geographical distance, whereas functional gene composition in free-living communities was independent of environmental dissimilarity and reflected geographical distance. Patterns of biogeography in bacterial communities differed depending on the description of taxa or function. Therefore, we developed a new paradigm for bacterial ecology which shows that some aspects of bacterial evolution depend on trophic complexity, history and current environmental conditions.
Article
The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by 'virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.