Accurate phylogenetic classification of variable-length DNA fragments.

Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10598, USA.
Nature Methods (Impact Factor: 23.57). 01/2007; 4(1):63-72. DOI: 10.1038/nmeth976
Source: PubMed

ABSTRACT Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A strategy to understand the microbial components of the human genetic and metabolic landscape and how they contribute to normal physiology and predisposition to disease.
    05/2011: pages 307 - 312; , ISBN: 9781118010518
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.
    Computational biology and chemistry 06/2011; 35(3):199-209. · 1.37 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A major goal in microbial ecology is to link specific microbial populations to environmental processes (e.g. biogeochemical transformations). The cultivation and characterization of isolates using genetic, biochemical and physiological tests provided direct links between organisms and their activities, but did not provide an understanding of the process networks in situ. Cultivation-independent molecular techniques have extended capabilities in this regard, and yet, for two decades, the focus has been on monitoring microbial community diversity and population dynamics by means of rRNA gene abundances or rRNA molecules. However, these approaches are not always well suited for establishing metabolic activity or microbial roles in ecosystem function. The current approaches, microbial community metagenomic and metatranscriptomic techniques, have been developed as other ways to study microbial assemblages, giving rise to exponentially increasing collections of information from numerous environments. This review considers some advantages and limitations of nucleic acid-based 'omic' approaches and discusses the potential for the integration of multiple molecular or computational techniques for a more effective assessment of links between specific microbial populations and ecosystem processes in situ. Establishing such connections will enhance the predictive power regarding ecosystem response to parameters or perturbations, and will bring us closer to integrating microbial data into ecosystem- and global-scale process measurements and models.
    FEMS Microbiology Ecology 01/2011; 75(1):2-16. · 3.56 Impact Factor

Full-text (2 Sources)

Available from
May 27, 2014