Accurate Phylogenetic Classification of Variable-length DNA fragments

Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10598, USA.
Nature Methods (Impact Factor: 32.07). 01/2007; 4(1):63-72. DOI: 10.1038/nmeth976
Source: PubMed


Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.

Download full-text


Available from: Isidore Rigoutsos
  • Source
    • "We used TagCleaner [31] to trim tags from the 5′ end of each sequence in the cellular metagenome. Assembly of both the viral and cellular metagenomes was conducted in Geneious [32] using the “Medium Sensitivity” method, with a word length of 14, a maximum gap size of 2, maximum gaps per read of 15, and maximum mismatches of 2. To classify sequences using di, tri, and tetranucleotide analysis, we created a boutique database of bacterial and archaeal virus sequences (Table S2) as a training set to accompany the existing cellular dataset in PhylopythiaS [33], which was used to identify archaea, bacteria, archaeal viruses and bacterial viruses. Metagenomes were assembled in Geneious prior to classification with PhylopythiaS; only contigs over 1000bp in length were used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The deep-sea hydrothermal vent habitat hosts a diverse community of archaea and bacteria that withstand extreme fluctuations in environmental conditions. Abundant viruses in these systems, a high proportion of which are lysogenic, must also withstand these environmental extremes. Here, we explore the evolutionary strategies of both microorganisms and viruses in hydrothermal systems through comparative analysis of a cellular and viral metagenome, collected by size fractionation of high temperature fluids from a diffuse flow hydrothermal vent. We detected a high enrichment of mobile elements and proviruses in the cellular fraction relative to microorganisms in other environments. We observed a relatively high abundance of genes related to energy metabolism as well as cofactors and vitamins in the viral fraction compared to the cellular fraction, which suggest encoding of auxiliary metabolic genes on viral genomes. Moreover, the observation of stronger purifying selection in the viral versus cellular gene pool suggests viral strategies that promote prolonged host integration. Our results demonstrate that there is great potential for hydrothermal vent viruses to integrate into hosts, facilitate horizontal gene transfer, and express or transfer genes that manipulate the hosts' functional capabilities.
    Full-text · Article · Oct 2014 · PLoS ONE
  • Source
    • "This proliferation of metagenomic sequence data has resulted in the development of novel analytical approaches. Feature selection approaches exploit features from genomic patterns or composition [5,6], preserved sequence segments [7-9] or predetermined clade markers [10,11]. Assembly-based methods [12-15] have recently gained in popularity due to their increased sensitivity for strain identification. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. Results We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. Conclusions The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at:
    Full-text · Article · Sep 2014
  • Source
    • "Composition-based approaches rely on comparing signatures of short motifs from a query fragment to a reference genome -- for instance, a particular GC content, gene and protein family content, or k-mer frequency and distribution [71]. Composition based approaches include Phylopythia [118], PhylopythiaS [119], Phymm [120], the Naive Bayes Classifier [121], Sequedex [122], the Livermore Metagenomic Analysis Toolkit (LMAT) [97], GENIUS [96] and Kraken [99]. Alignment-based approaches compare reads to a set of labeled reference genomes using a basic local alignment search tool (BLAST)-based approach. "
    [Show abstract] [Hide abstract]
    ABSTRACT: High throughput sequencing (HTS) generates large amounts of high quality sequence data for microbial genomics. The value of HTS for microbial forensics is the speed at which evidence can be collected and the power to characterize microbial-related evidence to solve biocrimes and bioterrorist events. As HTS technologies continue to improve, they provide increasingly powerful sets of tools to support the entire field of microbial forensics. Accurate, credible results allow analysis and interpretation, significantly influencing the course and/or focus of an investigation, and can impact the response of the government to an attack having individual, political, economic or military consequences. Interpretation of the results of microbial forensic analyses relies on understanding the performance and limitations of HTS methods, including analytical processes, assays and data interpretation. The utility of HTS must be defined carefully within established operating conditions and tolerances. Validation is essential in the development and implementation of microbial forensics methods used for formulating investigative leads attribution. HTS strategies vary, requiring guiding principles for HTS system validation. Three initial aspects of HTS, irrespective of chemistry, instrumentation or software are: 1) sample preparation, 2) sequencing, and 3) data analysis. Criteria that should be considered for HTS validation for microbial forensics are presented here. Validation should be defined in terms of specific application and the criteria described here comprise a foundation for investigators to establish, validate and implement HTS as a tool in microbial forensics, enhancing public safety and national security.
    Full-text · Article · Jul 2014 · Investigative Genetics
Show more