ArticlePDF Available

ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples

March 2011
Bioinformation 6(2):91-4

March 2011
6(2):91-4

DOI:10.6026/97320630006091

Source
PubMed

Authors:

Tarini Shankar Ghosh

Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)

Mohammed Monzoorul Haque

Tata Consultancy Services Limited

Sharmila S Mande

Tata Consultancy Services Limited

Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct' assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file (containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/

Flow-charts showing various steps followed to arrive at an appropriate taxonomic level, where the assignment of each read (A) Sanger (~800 bp length) (B) 454-Titanium (~400 bp length) (C) 454-Standard (~ 250 bp) and (D) 454-GS20 (~ 100 bp). (I: Identity; P: Positives) is to be restricted. Hit taxon denotes the taxon/organism corresponding to the hit sequence.

…

Figures - uploaded by Mohammed Monzoorul Haque

Content may be subject to copyright.

Content uploaded by Mohammed Monzoorul Haque

Content may be subject to copyright.

open access www.bioinformation.net Software

Volume 6(2)

ISSN 0973-2063 (online) 0973-8894 (print)



ProViDE: A software tool for accurate estimation

of viral diversity in metagenomic samples

Tarini Shankar Ghosh, Monzoorul Haque Mohammed, Dinakar Komanduri, Sharmila

Shekhar Mande*

Bio-Sciences Division, Innovation Labs, Tata Consultancy Services, 1 Software Units Layout, Hyderabad 500 081, Andhra Pradesh, India; Sharmila Shekhar

Mande - Email: sharmila@atc.tcs.com; *Corresponding author

Received March 04, 2011; Accepted March 07, 2011; Published March 26, 2011

Selected publications from Asia Pacific Bioinformatics Network (APBioNet) Ninth International Conference on Bioinformatics (InCoB 2010), Tokyo, Japan 26-

28 September 2010.

Abstract:

Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral

metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most

sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS

algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads

originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence

divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom

lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized

set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the

non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct'

assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is

around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file

(containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/

Background:

A number of metagenomic studies have been initiated in the past 3-4 years to

explore, characterize and compare the taxonomic diversity of viruses present in

various environments [1, 2]. Besides cataloguing viral diversity, these studies

have identified several hitherto unknown groups of viruses that play a critical

role in transferring genes involved in a variety of metabolic functions [1, 3].

Given the absence of universal marker genes (such as 16S rRNA in bacteria /

archaea) in the viral kingdom, researchers typically use similarity-based

approaches like BLAST (with stringent E-values) for taxonomic classification

of viral metagenomic sequences. However, since a majority of sequences in

typical metagenomes originate from hitherto unknown viral groups, the use of

such stringent thresholds will result in a large fraction of sequences remaining

unclassified. Furthermore, using less stringent E-values (observed for BLAST

hits with poor alignment quality) will result in a high number of incorrect

taxonomic assignments. The recently published SOrt-ITEMS algorithm

provides an approach to address the above issues [4]. Based on alignment

parameters, an elaborate work-flow is followed by SOrt-ITEMS for assigning

reads originating from genomes of hitherto unknown archaeal/bacterial

organisms. Alignment parameter thresholds used by SOrt-ITEMS are generated

by observing the pattern of sequence divergence within and across various

taxonomic groups belonging to bacterial and archaeal kingdoms. However,

majority of taxonomic groups within the viral kingdom are characterized by the

absence of a typical Linnaean-like taxonomic hierarchy (phylum, class, order,

family, genus and species). This motivated us to develop ProViDE (Program

for Viral Diversity Estimation), a novel algorithm that uses a customized set of

alignment parameter thresholds/ranges, specifically suited for the accurate

taxonomic labelling of viral metagenomic sequences. These thresholds take

into the account the pattern of sequence divergence and the non-uniform

taxonomic hierarchy observed within/across various taxonomic groups of the

viral kingdom.

Methodology:

Determination of alignment parameter thresholds:

Using MetaSim [5], simulated data sets were generated from 50 diverse viral

genomes (Supplementary Table 1). Subsequently the alignment parameter

thresholds were determined (Supplementary Figures 1-4, Supplementary

Tables 2-5) using a methodology similar to that adopted in SOrt-ITEMS [4].

Based on these, flow charts (Figure 1) were devised (for various query lengths)

in order to identify an appropriate taxonomic level of assignment for a given

query sequence.

BIOINFORMATION open access

ISSN 0973-2063 (online) 0973-8894 (print)



Figure 1: Flow-charts showing various steps followed to arrive at an appropriate taxonomic level, where the assignment of each read (A) Sanger (~800 bp length)

(B) 454 – Titanium (~400 bp length) (C) 454- Standard (~ 250 bp) and (D) 454-GS20 (~ 100 bp). (I: Identity; P: Positives) is to be restricted. Hit taxon denotes the

taxon/organism corresponding to the hit sequence.

BIOINFORMATION open access

ISSN 0973-2063 (online) 0973-8894 (print)



Steps followed for taxonomic classification of viral metagenomic

sequences:

Supplementary Figure 5 depicts the various steps followed by ProViDE

algorithm. The output of a BLASTx search against the nr database is taken as

input for ProViDE. For each hit, ProViDE first parses the values of various

alignment parameters. For each read, based on its length, an appropriate

taxonomic level of assignment (TL) is subsequently identified (Figure 1). The

taxonomic assignment of the read is done using the orthology approach as used

in SOrt-ITEMS [4]. The final taxonomic assignment of the read is thus

restricted to taxonomic level that lies at or above the TL.

Data-sets and Database variants used for evaluating binning accuracy and

specificity:

1,40,000 sequences were generated from 35 viral genomes (Supplementary

Table 6). These genomes were different from the ones taken for obtaining

alignment parameter thresholds. Based on their length, these sequences were

divided into four test data-sets, namely Sanger, 454-400, 454-250 and 454-100.

To evaluate the performance of ProViDE with respect to sequences originating

from unknown viral genomes, sequences in each data-set were queried (using

BLASTx) against 2 variants of the nr database, namely, (a) nr database

excluding sequences belonging to the query genome ('MINUS SPECIES') and

(b) nr database excluding all sequences which fall under the immediate higher

level taxonomic group to which the query species belongs ('MINUS ONE

LEVEL UP'). The BLASTx outputs obtained were given as input to ProViDE.

The results of ProViDE were also compared with corresponding results

generated with a similarity based binning method, MEGAN [6]. Both the

programs were run using a min-support value of 2 and a bit score threshold

value of 35.

Categorization of taxonomic assignments:

The assignments of a read to a taxon that lies in the path between the root and

the taxon corresponding to the source organism of the read was categorized as

'correct'. To quantify the specificity, these 'Correct assignments' were sub-

grouped into two categories. All correct assignments at the level of root or

cellular organisms or super-kingdom (Viruses) were considered as 'non -

specific'. Assignments below the level of super-kingdom were considered as

'specific assignments'. The assignment of a read to a taxon that does not lie in

the path between the root and the taxon corresponding to the source organism

of the read was categorized as 'Wrong'. Reads having hits having a bit-score

less than 35 and/or an alignment length of less than 25 were categorized as

'Unassigned'. All reads with no BLAST hits were categorized as 'No hits'.

Discussion:

Table 1 shows evaluation results with respect to the total number of correct

assignments, wrong assignments, and the number of sequences categorized as

unassigned. As expected, the percentage of total correct assignments is seen to

increase with increasing read length. However, it is observed that (for all four

test data-sets), the percentage of 'correct' assignments by ProViDE is around

1.7 to 3 times higher than that by MEGAN. Since for both methods, most (if

not all) correct assignments are at specific levels, the relative specificity

obtained with ProViDE is around 1.7 to 3 times higher than that with MEGAN.

Furthermore, the percentage of sequences misclassified by ProViDE is in the

range of 3-19% (as compared to 5 - 42% by MEGAN) indicating significantly

better assignment accuracy. A similar number of sequences are categorized as

'unassigned' by both programs indicating that the relatively high levels of

accuracy obtained using ProViDE are not at the cost of decreased number of

assignments. One of the important aspects of metagenomic sequence analysis is

to assign metagenomic sequences to correct taxonomic groups. Given that

metagenomic sequence data sets typically contain millions of sequences,

majority of which originate from new/hitherto unknown organisms, accurate

and specific taxonomic assignment of metagenomic sequences still remains a

major computational challenge.

In the current study, we have presented an algorithm (ProViDE) that is

specifically customized for taxonomic analysis of viral metagenome data sets.

Majority of reads in viral metagenomic data-sets originate from hitherto

unknown viral groups, the sequences of which are absent in existing reference

databases. Consequently, a majority of these sequences generate poor quality

alignments with sequences in reference databases. Assignment of these

sequences directly to the taxon corresponding to the best hit (irrespective of

alignment quality) is expected to generate a large number of incorrect

assignments. Besides, validation results generated in the present study also

indicate that the popular binning algorithm, namely MEGAN, which is based

on the principle of least common ancestor approach, also has an extremely high

misclassification rate (which is as high as 40% for some of the data sets). This

high misclassification rate of MEGAN is expected since it uses a single

alignment parameter (bit-score) for judging alignment quality (prior to

assignment). Consequently, MEGAN ends up misclassifying a majority of

reads, especially those having poor quality alignments (with identities as low as

20%). Furthermore, as demonstrated by earlier studies [4], the least common

ancestor (LCA) approach used by MEGAN is generally associated with poor

binning specificity (especially in metagenomic scenarios wherein majority of

reads originate from unknown organisms).

In contrast, multiple alignment parameters like bit-score, identities, positives

(thresholds of which were specially identified for viral metagenomic

sequences) are used by ProViDE for ascertaining the quality of the alignment.

This ensures that assignment of reads at specific levels is done only for those

reads that generate high quality alignments with database sequences. As the

alignment quality decreases, ProViDE assigns these reads at progressively

higher taxonomic levels. Validation results have indicated that employing this

approach helps in significantly reducing the number of incorrectly assigned

sequences. Validation results also indicate that ProViDE correctly assigns a

greater number of sequences at specific levels (as compared to MEGAN). This

indicates the overall utility of the ProViDE algorithm for accurate and specific

taxonomic assignment of viral metagenomic sequences. A comparative

evaluation of binning time indicates that the ProViDE algorithm takes

approximately an hour to process the blastx output obtained for a data-set

having 100,000 reads. This is marginally higher than the time taken by

MEGAN for analysing the same number of reads. Supplementary Figure 6

gives a time comparison analysis plot of this analysis.

Conclusion:

Performance evaluation with data-sets or database variants simulating typical

metagenomic scenarios indicates that ProViDE has significantly high

specificity and accuracy. To the best of our knowledge, ProViDE is the first

ever similarity-based binning algorithm that provides an accurate and specific

taxonomic label to most of the reads constituting viral metagenomic data sets.

References:

[1] Williamson SJ et al. PLoS ONE. 2008 3(1): e1456 [PMID: 18213365]

[2] Lindell D et al. Nature 2005 438: 86 [PMID: 16222247]

[3] Willner D et al. PLoS ONE. 2009 4(10): e7370 [PMID: 19816605]

[4] Monzoorul Haque M et al. Bioinformatics 2009 25:1722 [PMID:

19439565]

[5] Richter DC et al. PLoS ONE. 2008 3(10): e3373 [PMID: 18841204]

[6] Huson DH et al. Genome Res. 2007 17: 377 [PMID: 17255551]

Edited by TW Tan

Citation: Ghosh et al. Bioinformation 6(2): 91-94 (2011)

License statement: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes,

provided the original author and source are credited.

BIOINFORMATION open access

ISSN 0973-2063 (online) 0973-8894 (print)



Supplementary material:

Table 1: Comparison of the percentage of reads assigned under various bin categories by ProViDE and MEGAN for the (A) 454-100 data sets (B) 454-250 data

sets (C) 454-400 data sets, and (D) Sanger data sets. In this table the terms 'MINUS SPECIES', and 'MINUS ONE LEVEL UP' refer to the database variants used.

A detailed description of the database variants is given in the Methodology section of the manuscript. Note that the subtotals may vary by a value of 0.1, since the

individual values were rounded off to single decimals.

(A) 454_100

ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP

ProViDE MEGAN ProViDE MEGAN

NON SPECIFIC LEVELS 0 1.2 0 0

SPECIFIC LEVELS 25.4 13.5 5 2.4

TOTAL CORRECT ASSIGNMENTS 25.4 14.7 5 2.4

WRONG 5.2 12.3 2.7 5.2

UNASSIGNED + NO HITS 69.4 73.1 92.4 92.4

(B) 454_250

ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP

ProViDE MEGAN ProViDE MEGAN

NON SPECIFIC LEVELS 0 1.7 0 0

SPECIFIC LEVELS 44.2 23.0 18.8 6.7

TOTAL CORRECT ASSIGNMENTS 44.2 24.7 18.8 6.7

WRONG 5.2 24.7 3.5 15.4

UNASSIGNED + NO HITS 50.6 50.7 77.7 77.8

ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP

ProViDE MEGAN ProViDE MEGAN

NON SPECIFIC LEVELS 0 1.7 0 0.1

SPECIFIC LEVELS 52.5 26.2 27.2 8.7

TOTAL CORRECT ASSIGNMENTS 52.5 27.9 27.2 8.8

WRONG 4.7 29.4 3.4 21.8

UNASSIGNED + NO HITS 42.7 42.8 69.4 69.4

(D) SANGER

ASSIGNMENT CATEGORIES MINUS SPECIES MINUS ONE LEVEL UP

ProViDE MEGAN ProViDE MEGAN

NON SPECIFIC LEVELS 0 2.3 0 0.5

SPECIFIC LEVELS 60.2 32.0 35.3 14.2

TOTAL CORRECT ASSIGNMENTS 60.2 34.3 35.3 14.7

WRONG 14.2 41.1 19.7 41.9

UNASSIGNED + NO HITS 25.7 24.7 45.0 43.4

Data 1

Data

March 2011

Tarini Shankar Ghosh · Mohammed Monzoorul Haque · Dinakar Komanduri · Sharmila S Mande

Virtifier: A deep learning-based identifier for viral sequences from metagenomes

Article

Dec 2021

Motivation Viruses, the most abundant biological entities on earth, are important components of microbial communities, and as major human pathogens, they are responsible for human mortality and morbidity. The identification of viral sequences from metagenomes is critical for viral analysis. As massive quantities of short sequences are generated by next-generation sequencing (NGS), most methods utilize discrete and sparse one-hot vectors to encode nucleotide sequences, which are usually ineffective in viral identification. Results In this paper, Virtifier, a deep learning-based viral identifier for sequences from metagenomic data, is proposed. It includes a meaningful nucleotide sequence encoding method named Seq2Vec and a variant viral sequence predictor with an attention-based Long Short-Term Memory (LSTM) network. By utilizing a fully trained embedding matrix to encode codons, Seq2Vec can efficiently extract the relationships among those codons in a nucleotide sequence. Combined with an attention layer, the LSTM neural network can further analyze the codon relationships and sift the parts that contribute to the final features. Experimental results of three datasets have shown that Virtifier can accurately identify short viral sequences (< 500 bp) from metagenomes, surpassing three widely used methods, VirFinder, DeepVirFinder and PPR-Meta. Meanwhile, a comparable performance was achieved by Virtifier at longer lengths (> 5,000bp). Availability A Python implementation of Virtifier and the Python code developed for this study have been provided on Github https://github.com/crazyinter/Seq2Vec. Supplementary information Supplementary data are available at Bioinformatics online.

ITN-VIROINF: Understanding (Harmful) Virus-Host Interactions by Linking Virology and Bioinformatics

Article

Full-text available

Apr 2021

Many recent studies highlight the fundamental importance of viruses. Besides their important role as human and animal pathogens, their beneficial, commensal or harmful functions are poorly understood. By developing and applying tailored bioinformatical tools in important virological models, the Marie Skłodowska-Curie Initiative International Training Network VIROINF will provide a better understanding of viruses and the interaction with their hosts. This will open the door to validate methods of improving viral growth, morphogenesis and development, as well as to control strategies against unwanted microorganisms. The key feature of VIROINF is its interdisciplinary nature, which brings together virologists and bioinformaticians to achieve common goals.

Metagenomics and Diagnosis of Zoonotic Diseases

Chapter

Full-text available

Mar 2018

Discovery of Virus-Host interactions using bioinformatic tools

Chapter

Mar 2022
METHOD CELL BIOL

Viruses are a diverse biological group capable of infecting several hosts such as bacteria, plants, and animals, including humans. Viral infections constitute a threat to the human population as they may cause high mortality rates, decrease food production, and generate large economical losses. Viruses co-evolve with their hosts and this constant evolution must be clarified to better predict possible viral outbreaks, and to develop improved diagnostic methods and therapeutical approaches. In this review, we summarize several viral databases that store key information retrieved from a variety of omics approaches. Furthermore, we explore the use of such databases to predict Virus-Host interactions through artificial intelligence algorithms, focusing on the latest methodologies to characterize biological networks.

Plant proteomics

Chapter

Jan 2019

Metagenomics a modern approach to reveal the secrets of unculturable microbes

Chapter

Jan 2019

Metagenomics a modern approach to reveal the secrets of unculturable microbes

Chapter

May 2019

1) Brief Introduction about Metagenomics 2) History of the metagenomic approach 3) Approach, strategies, and tools used in the metagenomic analysis 4) Application of the metagenomic approach

Cyber Crime Assessments

Article

Full-text available

Dec 2018

Yerra Shankar Rao

Cybercrime is a kind of crime that happens in “cyberspace”, that is crime that happens in the world of computer and the Internet. Although many people have a limited knowledge of “cybercrime”, this kind of crime has the serious potential for severe impact on our lives and society, because our society is becoming an information society, full of information exchange happening in “cyberspace”. Elderly is that vulnerable group who has been deprived from any information regarding latest technologies and innovation especially in the area of computer world and has lack of knowledge about internet and become the victim of different types of cybercrime. The main objective was to assess the types of cyber crime faced by the elderly. The research design was cross-sectional in nature. 60 respondents each residing in their homes and old age homes respectively were selected from different areas of Bhubaneswar city. Total sample size was 120. The purposive random sampling technique was used to collect the data. Finding of the study revealed that majority of respondents reported that they were not affected by cyber pornography, phishing, money laundering, password sniffer, credit card fraud and even web jacking either residing in own homes or old age homes.

Exploring the Human Microbiome: The Potential Future Role of Next-Generation Sequencing in Disease Diagnosis and Treatment

Article

Full-text available

Nov 2018

The interaction between the human microbiome and immune system has an effect on several human metabolic functions and impacts our well-being. Additionally, the interaction between humans and microbes can also play a key role in determining the wellness or disease status of the human body. Dysbiosis is related to a plethora of diseases, including skin, inflammatory, metabolic, and neurological disorders. A better understanding of the host-microbe interaction is essential for determining the diagnosis and appropriate treatment of these ailments. The significance of the microbiome on host health has led to the emergence of new therapeutic approaches focused on the prescribed manipulation of the host microbiome, either by removing harmful taxa or reinstating missing beneficial taxa and the functional roles they perform. Culturing large numbers of microbial taxa in the laboratory is problematic at best, if not impossible. Consequently, this makes it very difficult to comprehensively catalog the individual members comprising a specific microbiome, as well as understanding how microbial communities function and influence host-pathogen interactions. Recent advances in sequencing technologies and computational tools have allowed an increasing number of metagenomic studies to be performed. These studies have provided key insights into the human microbiome and a host of other microbial communities in other environments. In the present review, the role of the microbiome as a therapeutic agent and its significance in human health and disease is discussed. Advances in high-throughput sequencing technologies for surveying host-microbe interactions are also discussed. Additionally, the correlation between the composition of the microbiome and infectious diseases as described in previously reported studies is covered as well. Lastly, recent advances in state-of-the-art bioinformatics software, workflows, and applications for analysing metagenomic data are summarized.

Overview of Virus Metagenomic Classification Methods and Their Biological Applications

Article

Full-text available

Apr 2018

Metagenomics poses opportunities for clinical and public health virology applications by offering a way to assess complete taxonomic composition of a clinical sample in an unbiased way. However, the techniques required are complicated and analysis standards have yet to develop. This, together with the wealth of different tools and workflows that have been proposed, poses a barrier for new users. We evaluated 49 published computational classification workflows for virus metagenomics in a literature review. To this end, we described the methods of existing workflows by breaking them up into five general steps and assessed their ease-of-use and validation experiments. Performance scores of previous benchmarks were summarized and correlations between methods and performance were investigated. We indicate the potential suitability of the different workflows for (1) time-constrained diagnostics, (2) surveillance and outbreak source tracing, (3) detection of remote homologies (discovery), and (4) biodiversity studies. We provide two decision trees for virologists to help select a workflow for medical or biodiversity studies, as well as directions for future developments in clinical viral metagenomics.

Metagenomic Analysis of Respiratory Tract DNA Viral Communities in Cystic Fibrosis and Non-Cystic Fibrosis Individuals

Article

Full-text available

Oct 2009
PLOS ONE

The human respiratory tract is constantly exposed to a wide variety of viruses, microbes and inorganic particulates from environmental air, water and food. Physical characteristics of inhaled particles and airway mucosal immunity determine which viruses and microbes will persist in the airways. Here we present the first metagenomic study of DNA viral communities in the airways of diseased and non-diseased individuals. We obtained sequences from sputum DNA viral communities in 5 individuals with cystic fibrosis (CF) and 5 individuals without the disease. Overall, diversity of viruses in the airways was low, with an average richness of 175 distinct viral genotypes. The majority of viral diversity was uncharacterized. CF phage communities were highly similar to each other, whereas Non-CF individuals had more distinct phage communities, which may reflect organisms in inhaled air. CF eukaryotic viral communities were dominated by a few viruses, including human herpesviruses and retroviruses. Functional metagenomics showed that all Non-CF viromes were similar, and that CF viromes were enriched in aromatic amino acid metabolism. The CF metagenomes occupied two different metabolic states, probably reflecting different disease states. There was one outlying CF virome which was characterized by an over-representation of Guanosine-5'-triphosphate,3'-diphosphate pyrophosphatase, an enzyme involved in the bacterial stringent response. Unique environments like the CF airway can drive functional adaptations, leading to shifts in metabolic profiles. These results have important clinical implications for CF, indicating that therapeutic measures may be more effective if used to change the respiratory environment, as opposed to shifting the taxonomic composition of resident microbiota.

Haque MM, Ghosh TS, Komanduri D, Mande SS.. SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25: 1722-1730

Article

Full-text available

Jun 2009
BIOINFORMATICS

One of the first steps in metagenomic analysis is the assignment of reads/contigs obtained from various sequencing technologies to their correct taxonomic bins. Similarity-based binning methods assign a read to a taxon/clade, based on the pattern of significant BLAST hits generated against sequence databases. Existing methods, which use bit-score as the sole parameter to ascertain the significance of BLAST hits, have limited specificity and accuracy of binning. A new binning algorithm, called SOrt-ITEMS is introduced, which addresses these limitations. The method uses alignment parameters besides the bit score to first identify an appropriate taxonomic level where the read can be assigned. An orthology-based approach is subsequently used by the method for the final assignment. The performance of SOrt-ITEMS has been validated with reads simulating sequences from 454 and Sanger sequencing technologies. In addition, the taxonomic composition of the Sargasso Sea data set has been analyzed using SOrt-ITEMS. SOrt-ITEMS shows improved specificity and accuracy of assignments especially in simulated scenarios, wherein sequences corresponding to the source organism of the reads are absent in the reference database. SOrt-ITEMS software is available for download from: http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS. No license is needed for academic and nonprofit use.

MetaSim—A Sequencing Simulator for Genomics and Metagenomics

Article

Full-text available

Feb 2008
PLOS ONE

The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets. To facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree. MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.

Photosynthesis genes in marine viruses yield proteins during host infection

Article

Full-text available

Dec 2005
NATURE

Cyanobacteria, and the viruses (phages) that infect them, are significant contributors to the oceanic 'gene pool'. This pool is dynamic, and the transfer of genetic material between hosts and their phages probably influences the genetic and functional diversity of both. For example, photosynthesis genes of cyanobacterial origin have been found in phages that infect Prochlorococcus and Synechococcus, the numerically dominant phototrophs in ocean ecosystems. These genes include psbA, which encodes the photosystem II core reaction centre protein D1, and high-light-inducible (hli) genes. Here we show that phage psbA and hli genes are expressed during infection of Prochlorococcus and are co-transcribed with essential phage capsid genes, and that the amount of phage D1 protein increases steadily over the infective period. We also show that the expression of host photosynthesis genes declines over the course of infection and that replication of the phage genome is a function of photosynthesis. We thus propose that the phage genes are functional in photosynthesis and that they may be increasing phage fitness by supplementing the host production of these proteins.

The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples

Article

Full-text available

Feb 2008
PLOS ONE

Viruses are the most abundant biological entities on our planet. Interactions between viruses and their hosts impact several important biological processes in the world's oceans such as horizontal gene transfer, microbial diversity and biogeochemical cycling. Interrogation of microbial metagenomic sequence data collected as part of the Sorcerer II Global Ocean Expedition (GOS) revealed a high abundance of viral sequences, representing approximately 3% of the total predicted proteins. Cluster analyses of the viral sequences revealed hundreds to thousands of viral genes encoding various metabolic and cellular functions. Quantitative analyses of viral genes of host origin performed on the viral fraction of aquatic samples confirmed the viral nature of these sequences and suggested that significant portions of aquatic viral communities behave as reservoirs of such genetic material. Distributional and phylogenetic analyses of these host-derived viral sequences also suggested that viral acquisition of environmentally relevant genes of host origin is a more abundant and widespread phenomenon than previously appreciated. The predominant viral sequences identified within microbial fractions originated from tailed bacteriophages and exhibited varying global distributions according to viral family. Recruitment of GOS viral sequence fragments against 27 complete aquatic viral genomes revealed that only one reference bacteriophage genome was highly abundant and was closely related, but not identical, to the cyanomyovirus P-SSM4. The co-distribution across all sampling sites of P-SSM4-like sequences with the dominant ecotype of its host, Prochlorococcus supports the classification of the viral sequences as P-SSM4-like and suggests that this virus may influence the abundance, distribution and diversity of one of the most dominant components of picophytoplankton in oligotrophic oceans. In summary, the abundance and broad geographical distribution of viral sequences within microbial fractions, the prevalence of genes among viral sequences that encode microbial physiological function and their distinct phylogenetic distribution lend strong support to the notion that viral-mediated gene acquisition is a common and ongoing mechanism for generating microbial diversity in the marine environment.

MEGAN analysis of metagenomic data

Article

Apr 2007
GENOME RES

Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against sequence databases of known sequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmid and BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very high throughput are paving the way to low-cost random "shotgun" approaches. This paper introduces MEGAN, a new computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI taxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxa such that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The software allows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers. It provides graphical and statistical output for comparing different data sets. The approach is applied to several data sets, including the Sargasso Sea data set, a recently published metagenomic data set sampled from a mammoth bone, and several complete microbial genomes. Also, simulations that evaluate the performance of the approach for different read lengths are presented.

ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples

Abstract and Figures

Supplementary resource (1)

Recommended publications

Additional file 1

File S2

Table S1

Estimation of blast loads for studying the dynamic effects of coeffcient of friction on buried pipes...