ArticlePDF Available

ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples

Authors:
  • Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)

Abstract and Figures

Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct' assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file (containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/
Content may be subject to copyright.
open access www.bioinformation.net Software
Volume 6(2)
ISSN 0973-2063 (online) 0973-8894 (print)
Bioinformation 6(2): 91-94 (2011) 91 © 2011 Biomedical Informatics
ProViDE: A software tool for accurate estimation
of viral diversity in metagenomic samples
Tarini Shankar Ghosh, Monzoorul Haque Mohammed, Dinakar Komanduri, Sharmila
Shekhar Mande*
Bio-Sciences Division, Innovation Labs, Tata Consultancy Services, 1 Software Units Layout, Hyderabad 500 081, Andhra Pradesh, India; Sharmila Shekhar
Mande - Email: sharmila@atc.tcs.com; *Corresponding author
Received March 04, 2011; Accepted March 07, 2011; Published March 26, 2011
Selected publications from Asia Pacific Bioinformatics Network (APBioNet) Ninth International Conference on Bioinformatics (InCoB 2010), Tokyo, Japan 26-
28 September 2010.
Abstract:
Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral
metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most
sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS
algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads
originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence
divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom
lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized
set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the
non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct'
assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is
around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file
(containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/
Background:
A number of metagenomic studies have been initiated in the past 3-4 years to
explore, characterize and compare the taxonomic diversity of viruses present in
various environments [1, 2]. Besides cataloguing viral diversity, these studies
have identified several hitherto unknown groups of viruses that play a critical
role in transferring genes involved in a variety of metabolic functions [1, 3].
Given the absence of universal marker genes (such as 16S rRNA in bacteria /
archaea) in the viral kingdom, researchers typically use similarity-based
approaches like BLAST (with stringent E-values) for taxonomic classification
of viral metagenomic sequences. However, since a majority of sequences in
typical metagenomes originate from hitherto unknown viral groups, the use of
such stringent thresholds will result in a large fraction of sequences remaining
unclassified. Furthermore, using less stringent E-values (observed for BLAST
hits with poor alignment quality) will result in a high number of incorrect
taxonomic assignments. The recently published SOrt-ITEMS algorithm
provides an approach to address the above issues [4]. Based on alignment
parameters, an elaborate work-flow is followed by SOrt-ITEMS for assigning
reads originating from genomes of hitherto unknown archaeal/bacterial
organisms. Alignment parameter thresholds used by SOrt-ITEMS are generated
by observing the pattern of sequence divergence within and across various
taxonomic groups belonging to bacterial and archaeal kingdoms. However,
majority of taxonomic groups within the viral kingdom are characterized by the
absence of a typical Linnaean-like taxonomic hierarchy (phylum, class, order,
family, genus and species). This motivated us to develop ProViDE (Program
for Viral Diversity Estimation), a novel algorithm that uses a customized set of
alignment parameter thresholds/ranges, specifically suited for the accurate
taxonomic labelling of viral metagenomic sequences. These thresholds take
into the account the pattern of sequence divergence and the non-uniform
taxonomic hierarchy observed within/across various taxonomic groups of the
viral kingdom.
Methodology:
Determination of alignment parameter thresholds:
Using MetaSim [5], simulated data sets were generated from 50 diverse viral
genomes (Supplementary Table 1). Subsequently the alignment parameter
thresholds were determined (Supplementary Figures 1-4, Supplementary
Tables 2-5) using a methodology similar to that adopted in SOrt-ITEMS [4].
Based on these, flow charts (Figure 1) were devised (for various query lengths)
in order to identify an appropriate taxonomic level of assignment for a given
query sequence.
BIOINFORMATION open access
ISSN 0973-2063 (online) 0973-8894 (print)
Bioinformation 6(2): 91-94 (2011) 92 © 2011 Biomedical Informatics
Figure 1: Flow-charts showing various steps followed to arrive at an appropriate taxonomic level, where the assignment of each read (A) Sanger (~800 bp length)
(B) 454 – Titanium (~400 bp length) (C) 454- Standard (~ 250 bp) and (D) 454-GS20 (~ 100 bp). (I: Identity; P: Positives) is to be restricted. Hit taxon denotes the
taxon/organism corresponding to the hit sequence.
BIOINFORMATION open access
ISSN 0973-2063 (online) 0973-8894 (print)
Bioinformation 6(2): 91-94 (2011) 93 © 2011 Biomedical Informatics
Steps followed for taxonomic classification of viral metagenomic
sequences:
Supplementary Figure 5 depicts the various steps followed by ProViDE
algorithm. The output of a BLASTx search against the nr database is taken as
input for ProViDE. For each hit, ProViDE first parses the values of various
alignment parameters. For each read, based on its length, an appropriate
taxonomic level of assignment (TL) is subsequently identified (Figure 1). The
taxonomic assignment of the read is done using the orthology approach as used
in SOrt-ITEMS [4]. The final taxonomic assignment of the read is thus
restricted to taxonomic level that lies at or above the TL.
Data-sets and Database variants used for evaluating binning accuracy and
specificity:
1,40,000 sequences were generated from 35 viral genomes (Supplementary
Table 6). These genomes were different from the ones taken for obtaining
alignment parameter thresholds. Based on their length, these sequences were
divided into four test data-sets, namely Sanger, 454-400, 454-250 and 454-100.
To evaluate the performance of ProViDE with respect to sequences originating
from unknown viral genomes, sequences in each data-set were queried (using
BLASTx) against 2 variants of the nr database, namely, (a) nr database
excluding sequences belonging to the query genome ('MINUS SPECIES') and
(b) nr database excluding all sequences which fall under the immediate higher
level taxonomic group to which the query species belongs ('MINUS ONE
LEVEL UP'). The BLASTx outputs obtained were given as input to ProViDE.
The results of ProViDE were also compared with corresponding results
generated with a similarity based binning method, MEGAN [6]. Both the
programs were run using a min-support value of 2 and a bit score threshold
value of 35.
Categorization of taxonomic assignments:
The assignments of a read to a taxon that lies in the path between the root and
the taxon corresponding to the source organism of the read was categorized as
'correct'. To quantify the specificity, these 'Correct assignments' were sub-
grouped into two categories. All correct assignments at the level of root or
cellular organisms or super-kingdom (Viruses) were considered as 'non -
specific'. Assignments below the level of super-kingdom were considered as
'specific assignments'. The assignment of a read to a taxon that does not lie in
the path between the root and the taxon corresponding to the source organism
of the read was categorized as 'Wrong'. Reads having hits having a bit-score
less than 35 and/or an alignment length of less than 25 were categorized as
'Unassigned'. All reads with no BLAST hits were categorized as 'No hits'.
Discussion:
Table 1 shows evaluation results with respect to the total number of correct
assignments, wrong assignments, and the number of sequences categorized as
unassigned. As expected, the percentage of total correct assignments is seen to
increase with increasing read length. However, it is observed that (for all four
test data-sets), the percentage of 'correct' assignments by ProViDE is around
1.7 to 3 times higher than that by MEGAN. Since for both methods, most (if
not all) correct assignments are at specific levels, the relative specificity
obtained with ProViDE is around 1.7 to 3 times higher than that with MEGAN.
Furthermore, the percentage of sequences misclassified by ProViDE is in the
range of 3-19% (as compared to 5 - 42% by MEGAN) indicating significantly
better assignment accuracy. A similar number of sequences are categorized as
'unassigned' by both programs indicating that the relatively high levels of
accuracy obtained using ProViDE are not at the cost of decreased number of
assignments. One of the important aspects of metagenomic sequence analysis is
to assign metagenomic sequences to correct taxonomic groups. Given that
metagenomic sequence data sets typically contain millions of sequences,
majority of which originate from new/hitherto unknown organisms, accurate
and specific taxonomic assignment of metagenomic sequences still remains a
major computational challenge.
In the current study, we have presented an algorithm (ProViDE) that is
specifically customized for taxonomic analysis of viral metagenome data sets.
Majority of reads in viral metagenomic data-sets originate from hitherto
unknown viral groups, the sequences of which are absent in existing reference
databases. Consequently, a majority of these sequences generate poor quality
alignments with sequences in reference databases. Assignment of these
sequences directly to the taxon corresponding to the best hit (irrespective of
alignment quality) is expected to generate a large number of incorrect
assignments. Besides, validation results generated in the present study also
indicate that the popular binning algorithm, namely MEGAN, which is based
on the principle of least common ancestor approach, also has an extremely high
misclassification rate (which is as high as 40% for some of the data sets). This
high misclassification rate of MEGAN is expected since it uses a single
alignment parameter (bit-score) for judging alignment quality (prior to
assignment). Consequently, MEGAN ends up misclassifying a majority of
reads, especially those having poor quality alignments (with identities as low as
20%). Furthermore, as demonstrated by earlier studies [4], the least common
ancestor (LCA) approach used by MEGAN is generally associated with poor
binning specificity (especially in metagenomic scenarios wherein majority of
reads originate from unknown organisms).
In contrast, multiple alignment parameters like bit-score, identities, positives
(thresholds of which were specially identified for viral metagenomic
sequences) are used by ProViDE for ascertaining the quality of the alignment.
This ensures that assignment of reads at specific levels is done only for those
reads that generate high quality alignments with database sequences. As the
alignment quality decreases, ProViDE assigns these reads at progressively
higher taxonomic levels. Validation results have indicated that employing this
approach helps in significantly reducing the number of incorrectly assigned
sequences. Validation results also indicate that ProViDE correctly assigns a
greater number of sequences at specific levels (as compared to MEGAN). This
indicates the overall utility of the ProViDE algorithm for accurate and specific
taxonomic assignment of viral metagenomic sequences. A comparative
evaluation of binning time indicates that the ProViDE algorithm takes
approximately an hour to process the blastx output obtained for a data-set
having 100,000 reads. This is marginally higher than the time taken by
MEGAN for analysing the same number of reads. Supplementary Figure 6
gives a time comparison analysis plot of this analysis.
Conclusion:
Performance evaluation with data-sets or database variants simulating typical
metagenomic scenarios indicates that ProViDE has significantly high
specificity and accuracy. To the best of our knowledge, ProViDE is the first
ever similarity-based binning algorithm that provides an accurate and specific
taxonomic label to most of the reads constituting viral metagenomic data sets.
References:
[1] Williamson SJ et al. PLoS ONE. 2008 3(1): e1456 [PMID: 18213365]
[2] Lindell D et al. Nature 2005 438: 86 [PMID: 16222247]
[3] Willner D et al. PLoS ONE. 2009 4(10): e7370 [PMID: 19816605]
[4] Monzoorul Haque M et al. Bioinformatics 2009 25:1722 [PMID:
19439565]
[5] Richter DC et al. PLoS ONE. 2008 3(10): e3373 [PMID: 18841204]
[6] Huson DH et al. Genome Res. 2007 17: 377 [PMID: 17255551]
Edited by TW Tan
Citation: Ghosh et al. Bioinformation 6(2): 91-94 (2011)
License statement: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes,
provided the original author and source are credited.
BIOINFORMATION open access
ISSN 0973-2063 (online) 0973-8894 (print)
Bioinformation 6(2): 91-94 (2011) 94 © 2011 Biomedical Informatics
Supplementary material:
Table 1: Comparison of the percentage of reads assigned under various bin categories by ProViDE and MEGAN for the (A) 454-100 data sets (B) 454-250 data
sets (C) 454-400 data sets, and (D) Sanger data sets. In this table the terms 'MINUS SPECIES', and 'MINUS ONE LEVEL UP' refer to the database variants used.
A detailed description of the database variants is given in the Methodology section of the manuscript. Note that the subtotals may vary by a value of 0.1, since the
individual values were rounded off to single decimals.
(A) 454_100
ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP
ProViDE MEGAN ProViDE MEGAN
NON SPECIFIC LEVELS 0 1.2 0 0
SPECIFIC LEVELS 25.4 13.5 5 2.4
TOTAL CORRECT ASSIGNMENTS 25.4 14.7 5 2.4
WRONG 5.2 12.3 2.7 5.2
UNASSIGNED + NO HITS 69.4 73.1 92.4 92.4
(B) 454_250
ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP
ProViDE MEGAN ProViDE MEGAN
NON SPECIFIC LEVELS 0 1.7 0 0
SPECIFIC LEVELS 44.2 23.0 18.8 6.7
TOTAL CORRECT ASSIGNMENTS 44.2 24.7 18.8 6.7
WRONG 5.2 24.7 3.5 15.4
UNASSIGNED + NO HITS 50.6 50.7 77.7 77.8
(C) 454_400
ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP
ProViDE MEGAN ProViDE MEGAN
NON SPECIFIC LEVELS 0 1.7 0 0.1
SPECIFIC LEVELS 52.5 26.2 27.2 8.7
TOTAL CORRECT ASSIGNMENTS 52.5 27.9 27.2 8.8
WRONG 4.7 29.4 3.4 21.8
UNASSIGNED + NO HITS 42.7 42.8 69.4 69.4
(D) SANGER
ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP
ProViDE MEGAN ProViDE MEGAN
NON SPECIFIC LEVELS 0 2.3 0 0.5
SPECIFIC LEVELS 60.2 32.0 35.3 14.2
TOTAL CORRECT ASSIGNMENTS 60.2 34.3 35.3 14.7
WRONG 14.2 41.1 19.7 41.9
UNASSIGNED + NO HITS 25.7 24.7 45.0 43.4

Supplementary resource (1)

... According to the discriminate criteria, they can be roughly categorized as alignment-based, gene-based, k-mer-based and deep learning-based methods. Alignment-based methods aim at matching the similarities between query sequences and known virus reference genomes, such as ProViDE (Tarini et al., 2011), Metavir (Roux et al., 2011), DIAMOND (Buchfink et al., 2015), MetaPhlAn (Truong et al., 2015), Centrifuge (Kim et al., 2016) and Genome Detective (Vilsker et al., 2019). These methods require extensive computation times because of the mechanism and the requirement of large reference databases (Bonnie et al., 2016). ...
... The area under an ROC curve (AUROC) is utilized to evaluate the prediction performance, where a higher AUROC value indicates a better performance. When dealing with a highly imbalanced dataset, precision-recall (PR) curves give a more informative picture of the performance than ROC curves (Tarini et al., 2011). The area under a PR curve (AUPRC) is also utilized to evaluate the prediction performance on an imbalanced dataset. ...
Article
Motivation Viruses, the most abundant biological entities on earth, are important components of microbial communities, and as major human pathogens, they are responsible for human mortality and morbidity. The identification of viral sequences from metagenomes is critical for viral analysis. As massive quantities of short sequences are generated by next-generation sequencing (NGS), most methods utilize discrete and sparse one-hot vectors to encode nucleotide sequences, which are usually ineffective in viral identification. Results In this paper, Virtifier, a deep learning-based viral identifier for sequences from metagenomic data, is proposed. It includes a meaningful nucleotide sequence encoding method named Seq2Vec and a variant viral sequence predictor with an attention-based Long Short-Term Memory (LSTM) network. By utilizing a fully trained embedding matrix to encode codons, Seq2Vec can efficiently extract the relationships among those codons in a nucleotide sequence. Combined with an attention layer, the LSTM neural network can further analyze the codon relationships and sift the parts that contribute to the final features. Experimental results of three datasets have shown that Virtifier can accurately identify short viral sequences (< 500 bp) from metagenomes, surpassing three widely used methods, VirFinder, DeepVirFinder and PPR-Meta. Meanwhile, a comparable performance was achieved by Virtifier at longer lengths (> 5,000bp). Availability A Python implementation of Virtifier and the Python code developed for this study have been provided on Github https://github.com/crazyinter/Seq2Vec. Supplementary information Supplementary data are available at Bioinformatics online.
... Viruses display high genetic diversity both within and among viral species, as well as within and among infected hosts. The composition of mixed samples can be assessed by metagenomics approaches [26], such as sequence read annotation by taxonomic classification using existing reference genomes and databases [27,28]. Such data will be generated within VIROINF. ...
Article
Full-text available
Many recent studies highlight the fundamental importance of viruses. Besides their important role as human and animal pathogens, their beneficial, commensal or harmful functions are poorly understood. By developing and applying tailored bioinformatical tools in important virological models, the Marie Skłodowska-Curie Initiative International Training Network VIROINF will provide a better understanding of viruses and the interaction with their hosts. This will open the door to validate methods of improving viral growth, morphogenesis and development, as well as to control strategies against unwanted microorganisms. The key feature of VIROINF is its interdisciplinary nature, which brings together virologists and bioinformaticians to achieve common goals.
... This workflow includes in general four steps: preprocessing, annotation, assembly and, finally, the estimation of genotypes, abundances, community, structure and diversity. During the annotation, several databases are specifically used for viruses, such as ProVide [44], MGTAXA [http://mgtaxa.jcvi.org], MetaVir [45], VIROME [Bhavsar et al. in preparation] and VMGAP [46]. ...
Chapter
Viruses are a diverse biological group capable of infecting several hosts such as bacteria, plants, and animals, including humans. Viral infections constitute a threat to the human population as they may cause high mortality rates, decrease food production, and generate large economical losses. Viruses co-evolve with their hosts and this constant evolution must be clarified to better predict possible viral outbreaks, and to develop improved diagnostic methods and therapeutical approaches. In this review, we summarize several viral databases that store key information retrieved from a variety of omics approaches. Furthermore, we explore the use of such databases to predict Virus-Host interactions through artificial intelligence algorithms, focusing on the latest methodologies to characterize biological networks.
Chapter
1) Brief Introduction about Metagenomics 2) History of the metagenomic approach 3) Approach, strategies, and tools used in the metagenomic analysis 4) Application of the metagenomic approach
Article
Full-text available
Cybercrime is a kind of crime that happens in “cyberspace”, that is crime that happens in the world of computer and the Internet. Although many people have a limited knowledge of “cybercrime”, this kind of crime has the serious potential for severe impact on our lives and society, because our society is becoming an information society, full of information exchange happening in “cyberspace”. Elderly is that vulnerable group who has been deprived from any information regarding latest technologies and innovation especially in the area of computer world and has lack of knowledge about internet and become the victim of different types of cybercrime. The main objective was to assess the types of cyber crime faced by the elderly. The research design was cross-sectional in nature. 60 respondents each residing in their homes and old age homes respectively were selected from different areas of Bhubaneswar city. Total sample size was 120. The purposive random sampling technique was used to collect the data. Finding of the study revealed that majority of respondents reported that they were not affected by cyber pornography, phishing, money laundering, password sniffer, credit card fraud and even web jacking either residing in own homes or old age homes.
Article
Full-text available
The interaction between the human microbiome and immune system has an effect on several human metabolic functions and impacts our well-being. Additionally, the interaction between humans and microbes can also play a key role in determining the wellness or disease status of the human body. Dysbiosis is related to a plethora of diseases, including skin, inflammatory, metabolic, and neurological disorders. A better understanding of the host-microbe interaction is essential for determining the diagnosis and appropriate treatment of these ailments. The significance of the microbiome on host health has led to the emergence of new therapeutic approaches focused on the prescribed manipulation of the host microbiome, either by removing harmful taxa or reinstating missing beneficial taxa and the functional roles they perform. Culturing large numbers of microbial taxa in the laboratory is problematic at best, if not impossible. Consequently, this makes it very difficult to comprehensively catalog the individual members comprising a specific microbiome, as well as understanding how microbial communities function and influence host-pathogen interactions. Recent advances in sequencing technologies and computational tools have allowed an increasing number of metagenomic studies to be performed. These studies have provided key insights into the human microbiome and a host of other microbial communities in other environments. In the present review, the role of the microbiome as a therapeutic agent and its significance in human health and disease is discussed. Advances in high-throughput sequencing technologies for surveying host-microbe interactions are also discussed. Additionally, the correlation between the composition of the microbiome and infectious diseases as described in previously reported studies is covered as well. Lastly, recent advances in state-of-the-art bioinformatics software, workflows, and applications for analysing metagenomic data are summarized.
Article
Full-text available
Metagenomics poses opportunities for clinical and public health virology applications by offering a way to assess complete taxonomic composition of a clinical sample in an unbiased way. However, the techniques required are complicated and analysis standards have yet to develop. This, together with the wealth of different tools and workflows that have been proposed, poses a barrier for new users. We evaluated 49 published computational classification workflows for virus metagenomics in a literature review. To this end, we described the methods of existing workflows by breaking them up into five general steps and assessed their ease-of-use and validation experiments. Performance scores of previous benchmarks were summarized and correlations between methods and performance were investigated. We indicate the potential suitability of the different workflows for (1) time-constrained diagnostics, (2) surveillance and outbreak source tracing, (3) detection of remote homologies (discovery), and (4) biodiversity studies. We provide two decision trees for virologists to help select a workflow for medical or biodiversity studies, as well as directions for future developments in clinical viral metagenomics.
Article
Full-text available
The human respiratory tract is constantly exposed to a wide variety of viruses, microbes and inorganic particulates from environmental air, water and food. Physical characteristics of inhaled particles and airway mucosal immunity determine which viruses and microbes will persist in the airways. Here we present the first metagenomic study of DNA viral communities in the airways of diseased and non-diseased individuals. We obtained sequences from sputum DNA viral communities in 5 individuals with cystic fibrosis (CF) and 5 individuals without the disease. Overall, diversity of viruses in the airways was low, with an average richness of 175 distinct viral genotypes. The majority of viral diversity was uncharacterized. CF phage communities were highly similar to each other, whereas Non-CF individuals had more distinct phage communities, which may reflect organisms in inhaled air. CF eukaryotic viral communities were dominated by a few viruses, including human herpesviruses and retroviruses. Functional metagenomics showed that all Non-CF viromes were similar, and that CF viromes were enriched in aromatic amino acid metabolism. The CF metagenomes occupied two different metabolic states, probably reflecting different disease states. There was one outlying CF virome which was characterized by an over-representation of Guanosine-5'-triphosphate,3'-diphosphate pyrophosphatase, an enzyme involved in the bacterial stringent response. Unique environments like the CF airway can drive functional adaptations, leading to shifts in metabolic profiles. These results have important clinical implications for CF, indicating that therapeutic measures may be more effective if used to change the respiratory environment, as opposed to shifting the taxonomic composition of resident microbiota.
Article
Full-text available
One of the first steps in metagenomic analysis is the assignment of reads/contigs obtained from various sequencing technologies to their correct taxonomic bins. Similarity-based binning methods assign a read to a taxon/clade, based on the pattern of significant BLAST hits generated against sequence databases. Existing methods, which use bit-score as the sole parameter to ascertain the significance of BLAST hits, have limited specificity and accuracy of binning. A new binning algorithm, called SOrt-ITEMS is introduced, which addresses these limitations. The method uses alignment parameters besides the bit score to first identify an appropriate taxonomic level where the read can be assigned. An orthology-based approach is subsequently used by the method for the final assignment. The performance of SOrt-ITEMS has been validated with reads simulating sequences from 454 and Sanger sequencing technologies. In addition, the taxonomic composition of the Sargasso Sea data set has been analyzed using SOrt-ITEMS. SOrt-ITEMS shows improved specificity and accuracy of assignments especially in simulated scenarios, wherein sequences corresponding to the source organism of the reads are absent in the reference database. SOrt-ITEMS software is available for download from: http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS. No license is needed for academic and nonprofit use.
Article
Full-text available
The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets. To facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree. MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.
Article
Full-text available
Cyanobacteria, and the viruses (phages) that infect them, are significant contributors to the oceanic 'gene pool'. This pool is dynamic, and the transfer of genetic material between hosts and their phages probably influences the genetic and functional diversity of both. For example, photosynthesis genes of cyanobacterial origin have been found in phages that infect Prochlorococcus and Synechococcus, the numerically dominant phototrophs in ocean ecosystems. These genes include psbA, which encodes the photosystem II core reaction centre protein D1, and high-light-inducible (hli) genes. Here we show that phage psbA and hli genes are expressed during infection of Prochlorococcus and are co-transcribed with essential phage capsid genes, and that the amount of phage D1 protein increases steadily over the infective period. We also show that the expression of host photosynthesis genes declines over the course of infection and that replication of the phage genome is a function of photosynthesis. We thus propose that the phage genes are functional in photosynthesis and that they may be increasing phage fitness by supplementing the host production of these proteins.
Article
Full-text available
Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high abundance of viral sequences, representing approximately 3% of the total predicted proteins. Cluster analyses of the viral sequences revealed hundreds to thousands of viral genes encoding various metabolic and cellular functions. Quantitative analyses of viral genes of host origin performed on the viral fraction of aquatic samples confirmed the viral nature of these sequences and suggested that significant portions of aquatic viral communities behave as reservoirs of such genetic material. Distributional and phylogenetic analyses of these host-derived viral sequences also suggested that viral acquisition of environmentally relevant genes of host origin is a more abundant and widespread phenomenon than previously appreciated. The predominant viral sequences identified within microbial fractions originated from tailed bacteriophages and exhibited varying global distributions according to viral family. Recruitment of GOS viral sequence fragments against 27 complete aquatic viral genomes revealed that only one reference bacteriophage genome was highly abundant and was closely related, but not identical, to the cyanomyovirus P-SSM4. The co-distribution across all sampling sites of P-SSM4-like sequences with the dominant ecotype of its host, Prochlorococcus supports the classification of the viral sequences as P-SSM4-like and suggests that this virus may influence the abundance, distribution and diversity of one of the most dominant components of picophytoplankton in oligotrophic oceans. In summary, the abundance and broad geographical distribution of viral sequences within microbial fractions, the prevalence of genes among viral sequences that encode microbial physiological function and their distinct phylogenetic distribution lend strong support to the notion that viral-mediated gene acquisition is a common and ongoing mechanism for generating microbial diversity in the marine environment.
Article
Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against sequence databases of known sequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmid and BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very high throughput are paving the way to low-cost random "shotgun" approaches. This paper introduces MEGAN, a new computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI taxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxa such that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The software allows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers. It provides graphical and statistical output for comparing different data sets. The approach is applied to several data sets, including the Sargasso Sea data set, a recently published metagenomic data set sampled from a mammoth bone, and several complete microbial genomes. Also, simulations that evaluate the performance of the approach for different read lengths are presented.