Article

Accurate phylogenetic classification of variable-length DNA fragments.

Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10598, USA.
Nature Methods (Impact Factor: 23.57). 01/2007; 4(1):63-72. DOI: 10.1038/nmeth976
Source: PubMed

ABSTRACT Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.

0 Bookmarks
 · 
161 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The main limitations in the analysis of viral metagenomes are perhaps the high genetic variability and the lack of information in extant databases. To address these issues, several bioinformatic tools have been specifically designed or adapted for metagenomics by improving read assembly and creating more sensitive methods for homology detection. This study compares the performance of different available assemblers and taxonomic annotation software using simulated viral-metagenomic data. We simulated two 454 viral metagenomes using genomes from NCBI's RefSeq database based on the list of actual viruses found in previously published metagenomes. Three different assembly strategies, spanning six assemblers, were tested for performance: overlap-layout-consensus algorithms Newbler, Celera and Minimo; de Bruijn graphs algorithms Velvet and MetaVelvet; and read probabilistic model Genovo. The performance of the assemblies was measured by the length of resulting contigs (using N50), the percentage of reads assembled and the overall accuracy when comparing against corresponding reference genomes. Additionally, the number of chimeras per contig and the lowest common ancestor were estimated in order to assess the effect of assembling on taxonomic and functional annotation. The functional classification of the reads was evaluated by counting the reads that correctly matched the functional data previously reported for the original genomes and calculating the number of over-represented functional categories in chimeric contigs. The sensitivity and specificity of tBLASTx, PhymmBL and the k-mer frequencies were measured by accurate predictions when comparing simulated reads against the NCBI Virus genomes RefSeq database. Assembling improves functional annotation by increasing accurate assignations and decreasing ambiguous hits between viruses and bacteria. However, the success is limited by the chimeric contigs occurring at all taxonomic levels. The assembler and its parameters should be selected based on the focus of each study. Minimo's non-chimeric contigs and Genovo's long contigs excelled in taxonomy assignation and functional annotation, respectively.tBLASTx stood out as the best approach for taxonomic annotation for virus identification. PhymmBL proved useful in datasets in which no related sequences are present as it uses genomic features that may help identify distant taxa. The k-frequencies underperformed in all viral datasets.
    BMC Genomics 01/2014; 15(1):37. · 4.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The evolutionary classification of influenza genes into lineages is a first step in understanding their molecular epidemiology and can inform the subsequent implementation of control measures. We introduce a novel approach called Lineage Assignment By Extended Learning (LABEL) to rapidly determine cladistic information for any number of genes without the need for time-consuming sequence alignment, phylogenetic tree construction, or manual annotation. Instead, LABEL relies on hidden Markov model profiles and support vector machine training to hierarchically classify gene sequences by their similarity to pre-defined lineages. We assessed LABEL by analyzing the annotated hemagglutinin genes of highly pathogenic (H5N1) and low pathogenicity (H9N2) avian influenza A viruses. Using the WHO/FAO/OIE H5N1 evolution working group nomenclature, the LABEL pipeline quickly and accurately identified the H5 lineages of uncharacterized sequences. Moreover, we developed an updated clade nomenclature for the H9 hemagglutinin gene and show a similarly fast and reliable phylogenetic assessment with LABEL. While this study was focused on hemagglutinin sequences, LABEL could be applied to the analysis of any gene and shows great potential to guide molecular epidemiology activities, accelerate database annotation, and provide a data sorting tool for other large-scale bioinformatic studies.
    PLoS ONE 01/2014; 9(1):e86921. · 3.73 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bacteriophages have key roles in microbial communities, to a large extent shaping the taxonomic and functional composition of the microbiome, but data on the connections between phage diversity and the composition of communities are scarce. Using taxon-specific marker genes, we identified and monitored 20 viral taxa in 252 human gut metagenomic samples, mostly at the level of genera. On average, five phage taxa were identified in each sample, with up to three of these being highly abundant. The abundances of most phage taxa vary by up to four orders of magnitude between the samples, and several taxa that are highly abundant in some samples are absent in others. Significant correlations exist between the abundances of some phage taxa and human host metadata: for example, 'Group 936 lactococcal phages' are more prevalent and abundant in Danish samples than in samples from Spain or the United States of America. Quantification of phages that exist as integrated prophages revealed that the abundance profiles of prophages are highly individual-specific and remain unique to an individual over a 1-year time period, and prediction of prophage lysis across the samples identified hundreds of prophages that are apparently active in the gut and vary across the samples, in terms of presence and lytic state. Finally, a prophage-host network of the human gut was established and includes numerous novel host-phage associations.
    The ISME Journal 03/2014; · 8.95 Impact Factor

Full-text (2 Sources)

View
45 Downloads
Available from
May 27, 2014