ArticlePDF Available

PARTIE: A Partition Engine to Separate Metagenomics and Amplicon Projects in the Sequence Read Archive

Authors:

Abstract

Motivation: The Sequence Read Archive (SRA) contains raw data from many different types of sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence (10¹⁶ bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is complicated by both the amount of data and the detail within those annotations. Here, we introduce PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random) and amplicon (targeted) sequence data sets. Results: PARTIE subsamples reads from the sequencing file and calculates four different statistics: k-mer frequency, 16S abundance, prokaryotic- and viral-read abundance. These metrics are used to create a RandomForest decision tree to classify the sequencing data, and PARTIE provides mechanisms for both supervised and unsupervised classification. We demonstrate the accuracy of PARTIE for classifying SRA data, discuss the probable error rates in the SRA annotations, and introduce a resource assessing SRA data. Availability: PARTIE and reclassified metagenome SRA entries are available from https://github.com/linsalrob/partie Contact:redwards@mail.sdsu.edu Supplementary information : Supplementary data are available at Bioinformatics online.
Bioinformatics, YYYY, 0–0
Bioinformatics APPLICATIONS NOTE
Sequence Analysis
PARTIE: A Partition Engine to Separate Meta-
genomics and Amplicon Projects in the Se-
quence Read Archive
Pedro J. Torres
1
, Robert A. Edwards
1,2,3
and Katelyn McNair
2,*
1
Department of Biology,
2
Computational Science Research Center,
3
Department of Computer Science, San
Diego State University, San Diego, CA 920182
*Katelyn McNair deprecate@gmail.com
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation:
The Sequence Read Archive (SRA) contains raw data from many different types of
sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence
(10
16
bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is
complicated by both the amount of data and the detail within those annotations. Here, we introduce
PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random)
and amplicon (targeted) sequence data sets.
Results:
PARTIE subsamples reads from the sequencing file and calculates four different statis-
tics: k-mer frequency, 16S abundance, prokaryotic- and viral-read abundance. These metrics are
used to create a RandomForest decision tree to classify the sequencing data, and PARTIE provides
mechanisms for both supervised and unsupervised classification. We demonstrate the accuracy of
PARTIE for classifying SRA data, discuss the probable error rates in the SRA annotations, and intro-
duce a resource assessing SRA data.
Availability: PARTIE and reclassified metagenome SRA entries are available from
https://github.com/linsalrob/partie
Contact:
redwards@mail.sdsu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The combination of high-throughput sequencing technologies and ad-
vanced bioinformatics techniques are rapidly accelerating genomic and
metagenomic analysis (Meyer et al., 2008; Aziz et al., 2008) and leading
to the explosive growth of sequence data (Kodama et al., 2012; Cochrane
et al., 2013). The NIH Sequence Read Archive (SRA) was started in
2009 and is the primary archive of high throughput sequence data (Na-
tional Center for Biotechnology Information, 2009). Sequence data was
deposited into the SRA at more than 10Tbp per day in 2016 (data from
https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/).
Sequence data deposited in the SRA is necessarily dependent on the
submitters for accurate classification of the data. The SRA curators strive
to accurately capture appropriate metadata on the deposited sequences;
however, annotations are not uniform or standard leading to a variety of
ways to describe samples deposited to the databases. DNA sequencing
has revolutionized microbial ecology (Silva et al., 2017), however there
are two orthogonal approaches commonly used to explore the microbial
universe: amplicon where a part of a single gene (usually the 16S gene)
is amplified and sequenced (Human Microbiome Project Consortium,
2012), and shotgun metagenomics (random) (Handelsman, 2004) where
all the DNA is extracted and sequenced (Edwards, 2006; DeLong et al.,
2006). The former provides a rapid, portable, and cheap method to iden-
tify the organisms in a sample, while the latter provides details about
those organisms and the functions that they are performing (Dinsdale et
al., 2008). Unfortunately, these two techniques, which provide different
data sets and require different analyses, are often included under the
“metagenomics” umbrella in the SRA.
We created the partition engine, PARTIE to curate metagenomics data
from the SRA into amplicon (targeted) and shotgun metagenomic (ran-
dom) data sets. PARTIE analyzes four aspects of the sequence file: the
unique k-mer frequency, the abundance of 16S rRNA sequences, and the
prokaryotic- and viral-read abundance. We demonstrate the accuracy of
PARTIE for classifying SRA data, discuss the probable error rates in the
SRA annotations, and introduce a resource assessing SRA data.
. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the
original work is properly cited.
© The Author(s) 2 017
Associate Editor: Prof. Alfonso Valencia
P.J. Torres et al.
2 Methods
Three sequence databases were created: a 16S rRNA database (9,254
genes), a phage database (2,662 genomes), and a prokaryotic genome
database (1,650 genomes). The 16S and prokaryotic databases were
downloaded from the GenBank ftp site. The phage genomes were down-
loaded from the PHANTOME website.
The sra-toolkit’s fastq-dump program is used to extract the first 10,000
reads from the SRA file and to output the reads in fasta format. These
reads are aligned against the three previously discussed databases using
the program Bowtie2, and the percentage of reads that hit to each data-
bases is calculated (Langmead and Salzberg, 2012). The percentage of
“unique k-mer” is also calculated for each metagenome by using the
program Jellyfish to find all k-mer (default, k=15) in the metagenome
read subset, and counting those k-mer that appear 10 or less times (Mar-
çais and Kingsford, 2011). This criterion relies on the observation that
samples containing amplicon sequences have a high number of similar k-
mer resulting in a decrease in unique k-mer abundance. Conversely,
samples containing shotgun metagenomic sequences have more random
sequences, and thus a wider distribution of unique k-mer.
The four frequency traits (16S, phage, prokaryotic, unique k-mer) are
calculated for each of the downloaded SRA metagenomes, along with
the response type (Amplicon, Other, WGS). Initially, an unsupervised
RandomForest using the R library (Breiman 2001) was used to classify
the data, and then we pruned some to generate a refined classification
engine.
3 Discussion
PARTIE was first used to calculate the parameters for 211,787 SRA
datasets in which the sequencing strategy was annotated by the submitter
as either Amplicon (160,247 samples), WGS (44,651 samples) or a com-
bined data set that were classified as “Other” (6,889 samples). The “Oth-
er” is a combination of different sequencing library construction ap-
proaches where there are too few of any individual data sets to build a
robust classifier for them (Supplementary Table 1). The partition engine
workflow begins by identifying all the potential metagenomes from the
Sequence Read Archive. The SRA SQLite dumps from SRAdb (Zhu et
al., 2013) are used to identify all potential metagenome sequences. We
currently identify samples where the library source is
“METAGENOMIC”, the study type is “METAGENOMICS”, or where
the sample's scientific name can be expanded from microbiome or meta-
genome. We focus on correctly classifying the whole genome shotgun
(WGS) sequencing data sets, and so we filter those to remove any in
which the annotators identify the library strategy as AMPLICON or
PCR. The relative contribution of each of the approaches is shown in
Supplementary Fig. 1. Those metagenomes are downloaded using the
sra-toolkit’s prefetch capability and the Aspera ascp-client (National
Center for Biotechnology Information, 2009). The initial classification of
these samples (Fig. 1) by the random forest resulted in a 5.4% out of bag
error with the most important predictor variables being the percent
unique k-mer sequences and the percent 16S rRNA (Supplementary
Figure 2). Random Forests also predicted that both the instrument type
and read length are minor predictors of metagenome type. However,
there is an uneven distribution of sequencing with different machines,
with currently many more amplicon sequences generated by the Illumina
Fig. 1.
Scatter plot of percent 16S rRNA vs percent unique k-mer. The sequence
source annotation was obtained directly from the sequence read archive (SRA) database.
Eighteen different sequence source annotations were lumped into the ‘Other’ category.
MiSeq and many more WGS data sets generated by the Illumina HiSeq
2000 (data not shown). This is not a variable that is dependent on the
sequencing per se, and is likely to change over time, and therefore was
excluded from the analysis.
It was apparent from the data that the classification could be improved
through manual curating. Since the fraction of unique k-mer was the
most important predictor, a threshold value was calculated to reclassify
each metagenome solely on the k-mer abundance. When the k-mer fre-
quency data was plotted on a histogram, a distinct bimodal distribution
was apparent (Supplementary Figure 3). The centroids of the two peaks
were identified using k-means clustering (Hartigan, 1975) resulting in a
midpoint value at 47%, which was rounded to 50% for stringency and
simplicity. Using this revised calculation, several questionable data sets
were omitted from the training data sets. The amplicon test set was de-
creased by 3,502 data sets to 156,745 data sets. The WGS data was de-
creased by 7,032 data sets to 37,619 data sets and the other data sets
were reduced by 7. This robust training set was used to build an automat-
ic classification and partition engine that had a 2.45% error rate (Sup-
plementary Table 2).
The PARTIE analysis package is being used to routinely reclassify data
sets from the SRA. Over 270,000 datasets have been reclassified as of
March 1st 2017, and an up to date list is available at
https://github.com/linsalrob/partie/. The number of data sets of each type
that were reclassified is shown in the matrix in Supplementary Table 3.
One fifth of the random sequencing datasets have been reclassified as
amplicon projects. We also recommend examining the four calculated
parameters as there are cases in which both WGS and amplicon sequenc-
ing is used (e.g, Run ID ERR162903), and no automatic partition ap-
proach will correctly classify this library.
Conflict of Interest: none declared.
References
Aziz,R.K. et al. (2008) The RAST Se rver: rapid annotations using subsyste ms
technology. BMC Genomics, 9, 75.
Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32.
Cochrane,G. et al. (2013) Facing growt h in the European Nucleotide Arc hive.
Nucleic Acids Res., 41, D30–35.
DeLong,E.F. et al. (2006) Community genomics among stratified microbial assem-
blages in the ocean’s interior. Science , 311, 496–503.
Dinsdale,E.A. et al. (2008) Functional metagenomic profiling of nine biomes.
PARTIE
Nature, 452, 629–632.
Dinsdale,E.A. et al. (2013) Multivariate analysis of functional metageno mes.
Front. Genet., 4, 41.
Edwards,R. (2006) Random Community Genomics. Whitepaper.
Hartigan,J.A. (1975) Clustering algor ithms Wiley.
Human Microbiome Project Consort ium (2012) A framework for human microbi-
ome research. Nature, 486, 215–221.
Handelsman,J. (2004) Metagenomics: applicaition of genomics to uncultured
microorganisms. Mmicrobiol Mol Biol Rev.,68, 669-685
Kodama,Y. et al. (2012) The Sequence Read Archive: explosive growth of se-
quencing data. Nucleic Acids Res. , 40, D54–6.
Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2.
Nat. Methods, 9, 357–359.
Marçais,G. and Kingsford,C. (2011) A fast, lock-free approach for efficie nt parallel
counting of occurrences of k-mers. Bioinformatics, 27, 764–770.
Meyer,F. et al. (2008) al.(2008). The metagenomics RAST server--a public re-
source for the automatic phylogenetic and functional analysis of meta-
genomes. BMC Bioinformatics, 9, 386.
National Center for Biotechnology I nformation (2009) SRA Handbook Bet hesda,
MD : National Center for Biotechnolo gy Information.
Silva,G.G.Z. et al. (2017) Ecological Implications of Metagenomics Data Analysis.
In review.
Zhu,Y. et al. (2013) SRAdb: query and use public next-generation sequencing data
from within R. BMC Bioinformatics, 14, 19.
... The datasets used to train such an algorithm would also need to be carefully curated to avoid contaminated or low-quality sequences, or sequences collected using very different methods. For example, in the SRA database, both amplicon sequencing and whole-genome shotgun sequencing can be categorised as metagenomic data [41]. This can be an issue as they provide different information that could cause issues for an algorithm trying to compare them. ...
... This can be an issue as they provide different information that could cause issues for an algorithm trying to compare them. An algorithm can successfully differentiate between the two types of data; however, as the partition engine, PARTIE, has sorted many sequences from the SRA into amplicon and whole genome sequencing datasets [41]. Any attempts to automatically curate metagenomic data from the SRA database will need to take the broad definition of metagenomic data into account when collecting the training data. ...
Article
Full-text available
The microbiome is an essential part of most ecosystems. It was originally studied mostly through culturing but relatively few microbes can be cultured, so much of the microbiome was left unexplored. The emergence of metagenomic sequencing techniques changed that and allowed the study of microbiomes from all sorts of habitats. Metagenomic sequencing also allowed for a more thorough exploration of prophages, viruses that integrate into bacterial genomes, and how they benefit their hosts. One issue with using open-access metagenomic data is that sequences added to databases often have little to no metadata to work with, so finding enough sequences can be difficult. Many metagenomes have been manually curated but this is a time-consuming process and relies heavily on the uploader to be accurate and thorough when filling in metadata fields and the curators to be working with the same ontologies. Using algorithms to automatically sort metagenomes based on either the taxonomic profile or the functional profile may be a viable solution to the issues with manually curated metagenomes, but it requires that the algorithm is trained on carefully curated datasets and using the most informative profile possible in order to minimize errors.
... A similar study by Parks et al. [12], using more than 1500 public metagenomes, enlarged the phylogenetic diversity of bacterial and archaeal genome trees by over 30%. However, mining metagenomes from public repositories can be daunting due to unavailable, misleading, or incomplete metadata [13]. This fact signi cantly contributes to the underutilization of publicly available metagenomes by the scienti c community. ...
... Metadata from SRA was retrieved using the following: (i) List of sample identi ers (SRA run IDs) labeled as whole-genome sequencing (WGS) or amplicon sequencing was downloaded from PARTIE (13) (https://github.com/linsalrob/partie). PARTIE is a Machine Learning model based on supervised and unsupervised classi cation, and it was optimized to differentiate sequence read data into WGS and amplicon sequence data sets; (ii) Sample identi ers labeled as WGS were extracted from the list; (iii) Metadata of WGS samples was retrieved using SRAdb R package. ...
... Metadata available in repositories, such as the SRA, are not standardized, creating an overly complex environment for sample reanalysis. As a result, the data is underutilized (15), which is not in the public interest. The difficulty associated with properly accessing metagenome data has led to initiatives such as the Genomic Standards Consortium (16), and the BioProject and BioSample project (17), which defined the minimum necessary information about a metagenomic sample (18). ...
... Data retrieval and non-whole genome sequencing removal. For the SRA repository, the first step to retrieve the metadata and remove non-WGS data was to use the PARTIE tool (15). This tool classifies the samples, using a predictive model induced by machine learning, as either WGS or amplicon sequencing, and provides the sample identification runs (i.e. ...
... Metadata available in repositories, such as the SRA, are not standardized, creating an overly complex environment for sample reanalysis. As a result, the data is underutilized (15), which is not in the public interest. The difficulty associated with properly accessing metagenome data has led to initiatives such as the Genomic Standards Consortium (16), and the BioProject and BioSample project (17), which defined the minimum necessary information about a metagenomic sample (18). ...
... Data retrieval and non-whole genome sequencing removal. For the SRA repository, the first step to retrieve the metadata and remove non-WGS data was to use the PARTIE tool (15). This tool classifies the samples, using a predictive model induced by machine learning, as either WGS or amplicon sequencing, and provides the sample identification runs (i.e. ...
Article
Metagenomics became a standard strategy to comprehend the functional potential of microbial communities, including the human microbiome. Currently, the number of metagenomes in public repositories is increasing exponentially. The Sequence Read Archive (SRA) and the MG-RAST are the two main repositories for metagenomic data. These databases allow scientists to reanalyze samples and explore new hypotheses. However, mining samples from them can be a limiting factor, since the metadata available in these repositories is often misannotated, misleading, and decentralized, creating an overly complex environment for sample reanalysis. The main goal of the HumanMetagenomeDB is to simplify the identification and use of public human metagenomes of interest. HumanMetagenomeDB version 1.0 contains metadata of 69 822 metagenomes. We standardized 203 attributes, based on standardized ontologies, describing host characteristics (e.g. sex, age and body mass index), diagnosis information (e.g. cancer, Crohn's disease and Parkinson), location (e.g. country, longitude and latitude), sampling site (e.g. gut, lung and skin) and sequencing attributes (e.g. sequencing platform, average length and sequence quality). Further, HumanMetagenomeDB version 1.0 metagenomes encompass 58 countries, 9 main sample sites (i.e. body parts), 58 diagnoses and multiple ages, ranging from just born to 91 years old. The HumanMetagenomeDB is publicly available at https://webapp.ufz.de/hmgdb/.
... 2020-11-30) was used to address niche association. Further, SEARCH-SRA (Stewart et al., 2015;Torres et al., 2017;Towns et al., 2014) online portal was used to interrogate the SRA database (246,329 records) by aligning metagenomic datasets to our MAGs. Only records that mapped 10 or more reads from the SRA collection were further inspected. ...
Article
Full-text available
Loss of basic utilities, such as drinking water and electricity distribution, were sustained for months in the aftermath of Hurricane Maria's (HM) landfall in Puerto Rico (PR) in September 2017. The goal of this study was to assess if there was deterioration in biological quality of drinking water due to these disruptions. This study characterized the microbial composition of drinking water following HM across nine drinking water systems (DWSs) in PR and utilized an extended temporal sampling campaign to determine if changes in the drinking water microbiome were indicative of HM associated disturbance followed by recovery. In addition to monitoring water chemistry, the samples were subjected to culture independent targeted and non-targeted microbial analysis including quantitative PCR (qPCR) and genome-resolved metagenomics. The qPCR results showed that residual disinfectant was the major driver of bacterial concentrations in tap water with marked decrease in concentrations from early to late sampling timepoints. While Mycobacterium avium and Pseudomonas aeruginosa were not detected in any sampling locations and timepoints, genetic material from Leptospira and Legionella pneumophila were transiently detected in a few sampling locations. The majority of metagenome assembled genomes (MAGs) recovered from these samples were not associated with pathogens and were consistent with bacterial community members routinely detected in DWSs. Further, whole metagenome-level comparisons between drinking water samples collected in this study with samples from other full-scale DWS indicated no significant deviation from expected community membership of the drinking water microbiome. Overall, our results suggest that disruptions due to HM did not result in significant and sustained deterioration of biological quality of drinking water at our study sites.
... (Schmieder and Edwards, 2011). PARTIE (Torres et al., 2017) was used to remove amplicon sequence files mistakenly annotated as WGS. Sequences longer than 600 bases (the maximum read length of Illumina MiSeq and HiSeq) were removed, because they were likely artifacts. ...
Article
Full-text available
The recent prevalence of high-throughput sequencing has been producing numerous prokaryotic community structure datasets. While the trait-based approach is useful to interpret those datasets from ecological perspectives, available trait information is biased towards culturable prokaryotes, especially those of clinical and public health relevance, and thus may not represent the breadth of microbiota found across many of Earth environments. To facilitate habitat-based analysis free of such bias, here we report a ready-to-use prokaryotic habitat database, ProkAtlas. ProkAtlas comprehensively links 16S rRNA gene sequences to prokaryotic habitats, using public shotgun metagenome datasets. We also developed a computational pipeline for habitat-based analysis of given prokaryotic community structures. After confirmation of the method effectiveness using 16S rRNA gene sequence datasets from individual genomes and the Earth Microbiome Project, we showed its validness and effectiveness in drawing ecological insights by applying it to six empirical prokaryotic community datasets from soil, aquatic, and human gut samples.
Article
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever‐widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad‐scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Article
Full-text available
Phages are generally described as species specific or even strain specific, implying an inherent limitation for some to be maintained and spread in diverse bacterial communities. Moreover, phage isolation and host range determination rarely consider the phage ecological context, likely biasing our notion on phage specificity. Here we isolated and characterized a novel group of six promiscuous phages, named Atoyac, existing in rivers and sewage by using a diverse collection of over 600 bacteria retrieved from the same environments as potential hosts. These podophages isolated from different regions in Mexico display a remarkably broad host range, infecting bacteria from six genera: Aeromonas, Pseudomonas, Yersinia, Hafnia, Escherichia, and Serratia Atoyac phage genomes are ∼42 kb long and highly similar to each other, but not to those currently available in genome and metagenome public databases. Detailed comparison of the phages' efficiency of plating (EOP) revealed variation among bacterial genera, implying a cost associated with infection of distant hosts, and between phages, despite their sequence similarity. We show, through experimental evolution in single or alternate hosts of different genera, that efficiency of plaque production is highly dynamic and tends toward optimization in hosts rendering low plaque formation. However, adaptation to distinct hosts differed between similar phages; whereas one phage optimized its EOP in all tested hosts, the other reduced plaque production in one host, suggesting that propagation in multiple bacteria may be key to maintain promiscuity in some viruses. Our study expands our knowledge of the virosphere and uncovers bacterium-phage interactions overlooked in natural systems.IMPORTANCE In natural environments, phages coexist and interact with a broad variety of bacteria, posing a conundrum for narrow-host-range phage maintenance in diverse communities. This context is rarely considered in the study of host-phage interactions, typically focused on narrow-host-range viruses and their infectivity in target bacteria isolated from sources distinct to where the phages were retrieved from. By studying phage-host interactions in bacteria and viruses isolated from river microbial communities, we show that novel phages with promiscuous host range encompassing multiple bacterial genera can be found in the environment. Assessment of hundreds of interactions in diverse hosts revealed that similar phages exhibit different infection efficiency and adaptation patterns. Understanding host range is fundamental in our knowledge of bacterium-phage interactions and their impact on microbial communities. The dynamic nature of phage promiscuity revealed in our study has implications in different aspects of phage research such as horizontal gene transfer or phage therapy.
Preprint
Full-text available
Phages are generally described as species- or even strain-specific viruses, implying an inherent limitation for some to be maintained and spread in diverse bacterial communities. Moreover, phage isolation and host range determination rarely consider the phage ecological context, likely biasing our notion on phage specificity. Here we identified and characterized a novel group of promiscuous phages existing in rivers by using diverse bacteria isolated from the same samples, and then used this biological system to investigate infection dynamics in distantly related hosts. We assembled a diverse collection of over 600 native bacterial strains and used them to isolate six podophages, named Atoyac, from different geographic origin and capable of infecting six genera in the Gammaproteobacteria. Atoyac phage genomes are highly similar to each other but not to those currently available in the genome and metagenome public databases. Detailed comparison of the phage’s infectivity in diverse hosts and trough hundreds of interactions revealed variation in plating efficiency amongst bacterial genera, implying a cost associated with infection of distant hosts, and between phages, despite their sequence similarity. We show, through experimental evolution in single or alternate hosts of different genera, that plaque production efficiency is highly dynamic and tends towards optimization in hosts rendering low plaque formation. Complex adaptation outcomes observed in the evolution experiments differed between highly similar phages and suggest that propagation in multiple hosts may be key to maintain promiscuity in some viruses. Our study expands our knowledge of the virosphere and uncovers bacteria-phage interactions overlooked in natural systems. Importance In natural environments, phages co-exist and interact with a broad variety of bacteria, posing a conundrum for narrow-host-range phages maintenance in diverse communities. This context is rarely considered in the study of host-phage interactions, typically focused on narrow-host-range viruses and their infectivity in target bacteria isolated from sources distinct to where the phages were retrieved from. By studying phage-host interactions in bacteria and viruses isolated from river microbial communities, we show that novel phages with promiscuous host range encompassing multiple bacterial genera can be found in the environment. Assessment of hundreds of interactions in diverse hosts revealed that similar phages exhibit different infection efficiency and adaptation patterns. Understanding host range is fundamental in our knowledge of bacteria-phage interactions and their impact in microbial communities. The dynamic nature of phage promiscuity revealed in our study has implications in different aspects of phage research such as horizontal gene transfer or phage therapy.
Article
Our emerging view of the gut microbiome largely focuses on bacteria, while less is known about other microbial components, such as bacteriophages (phages). Though phages are abundant in the gut, very few phages have been isolated from this ecosystem. Here, we report the genomes of 27 phages from the United States and Bangladesh that infect the prevalent human gut bacterium Bacteroides thetaiotaomicron. These phages are mostly distinct from previously sequenced phages with the exception of two, which are crAss-like phages. We compare these isolates to existing human gut metagenomes, revealing similarities to previously inferred phages and additional unexplored phage diversity. Finally, we use host tropisms of these phages to identify alleles of phage structural genes associated with infectivity. This work provides a detailed view of the gut’s “viral dark matter” and a framework for future efforts to further integrate isolation- and sequencing-focused efforts to understand gut-resident phages.
Article
Full-text available
Metagenomics is a primary tool for the description of microbial and viral communities. The sheer magnitude of the data generated in each metagenome makes identifying key differences in the function and taxonomy between communities difficult to elucidate. Here we discuss the application of seven different data mining and statistical analyses by comparing and contrasting the metabolic functions of 212 microbial metagenomes within and between 10 environments. Not all approaches are appropriate for all questions, and researchers should decide which approach addresses their questions. This work demonstrated the use of each approach: for example, random forests provided a robust and enlightening description of both the clustering of metagenomes and the metabolic processes that were important in separating microbial communities from different environments. All analyses identified that the presence of phage genes within the microbial community was a predictor of whether the microbial community was host-associated or free-living. Several analyses identified the subtle differences that occur with environments, such as those seen in different regions of the marine environment.
Article
Full-text available
Background The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the next generation of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc), Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, and others. Results SRAdb is an attempt to make queries of the metadata associated with SRA submission, study, sample, experiment and run more robust and precise, and make access to sequencing data in the SRA easier. We have parsed all the SRA metadata into a SQLite database that is routinely updated and can be easily distributed. The SRAdb R/Bioconductor package then utilizes this SQLite database for querying and accessing metadata. Full text search functionality makes querying metadata very flexible and powerful. Fastq files associated with query results can be downloaded easily for local analysis. The package also includes an interface from R to a popular genome browser, the Integrated Genomics Viewer. Conclusions SRAdb Bioconductor package provides a convenient and integrated framework to query and access SRA metadata quickly and powerfully from within R.
Article
Full-text available
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/) collects, maintains and presents comprehensive nucleic acid sequence and related information as part of the permanent public scientific record. Here, we provide brief updates on ENA content developments and major service enhancements in 2012 and describe in more detail two important areas of development and policy that are driven by ongoing growth in sequencing technologies. First, we describe the ENA data warehouse, a resource for which we provide a programmatic entry point to integrated content across the breadth of ENA. Second, we detail our plans for the deployment of CRAM data compression technology in ENA.
Article
Full-text available
A variety of microbial communities and their genes (the microbiome) exist throughout the human body, with fundamental roles in human health and disease. The National Institutes of Health (NIH)-funded Human Microbiome Project Consortium has established a population-scale framework to develop metagenomic protocols, resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far. In parallel, approximately 800 reference strains isolated from the human body have been sequenced. Collectively, these data represent the largest resource describing the abundance and variety of the human microbiome, while providing a framework for current and future studies.
Article
Full-text available
New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key to the progress of reproducible science. The SRA was established as a public repository for next-generation sequence data as a part of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC is composed of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at www.ncbi.nlm.nih.gov/sra from NCBI, at www.ebi.ac.uk/ena from EBI and at trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA and report on updated metadata structures, submission file formats and supported sequencing platforms. We also briefly outline our various responses to the challenge of explosive data growth.
Article
Full-text available
Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Article
Full-text available
Microbial activities shape the biogeochemistry of the planet and macroorganism health. Determining the metabolic processes performed by microbes is important both for understanding and for manipulating ecosystems (for example, disruption of key processes that lead to disease, conservation of environmental services, and so on). Describing microbial function is hampered by the inability to culture most microbes and by high levels of genomic plasticity. Metagenomic approaches analyse microbial communities to determine the metabolic processes that are important for growth and survival in any given environment. Here we conduct a metagenomic comparison of almost 15 million sequences from 45 distinct microbiomes and, for the first time, 42 distinct viromes and show that there are strongly discriminatory metabolic profiles across environments. Most of the functional diversity was maintained in all of the communities, but the relative occurrence of metabolisms varied, and the differences between metagenomes predicted the biogeochemical conditions of each environment. The magnitude of the microbial metabolic capabilities encoded by the viromes was extensive, suggesting that they serve as a repository for storing and sharing genes among their microbial hosts and influence global evolutionary and metabolic processes.
Article
Full-text available
Random community genomes (metagenomes) are now commonly used to study microbes in different environments. Over the past few years, the major challenge associated with metagenomics shifted from generating to analyzing sequences. High-throughput, low-cost next-generation sequencing has provided access to metagenomics to a wide range of researchers. A high-throughput pipeline has been constructed to provide high-performance computing to all researchers interested in using metagenomics. The pipeline produces automated functional assignments of sequences in the metagenome by comparing both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are incorporated into the standard views. User access is controlled to ensure data privacy, but the collaborative environment underpinning the service provides a framework for sharing datasets between multiple users. In the metagenomics RAST, all users retain full control of their data, and everything is available for download in a variety of formats. The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes. With built-in support for multiple data sources and a back end that houses abstract data types, the metagenomics RAST is stable, extensible, and freely available to all researchers. This service has removed one of the primary bottlenecks in metagenome sequence analysis - the availability of high-performance computing for annotating the data. http://metagenomics.nmpdr.org.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Metagenomics (also referred to as environmental and community genomics) is the genomic analysis of microorganisms by direct extraction and cloning of DNA from an assemblage of microorganisms. The development of metagenomics stemmed from the ineluctable evidence that as-yet-uncultured microorganisms represent the vast majority of organisms in most environments on earth. This evidence was derived from analyses of 16S rRNA gene sequences amplified directly from the environment, an approach that avoided the bias imposed by culturing and led to the discovery of vast new lineages of microbial life. Although the portrait of the microbial world was revolutionized by analysis of 16S rRNA genes, such studies yielded only a phylogenetic description of community membership, providing little insight into the genetics, physiology, and biochemistry of the members. Metagenomics provides a second tier of technical innovation that facilitates study of the physiology and ecology of environmental microorganisms. Novel genes and gene products discovered through metagenomics include the first bacteriorhodopsin of bacterial origin; novel small molecules with antimicrobial activity; and new members of families of known proteins, such as an Na(+)(Li(+))/H(+) antiporter, RecA, DNA polymerase, and antibiotic resistance determinants. Reassembly of multiple genomes has provided insight into energy and nutrient cycling within the community, genome structure, gene function, population genetics and microheterogeneity, and lateral gene transfer among members of an uncultured community. The application of metagenomic sequence information will facilitate the design of better culturing strategies to link genomic analysis with pure culture studies.