Accurate phylogenetic classification of variable-length DNA fragments

Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10598, USA.
Nature Methods (Impact Factor: 25.95). 01/2007; 4(1):63-72. DOI: 10.1038/nmeth976
Source: PubMed

ABSTRACT Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments >or=1 kb with high specificity.

Download full-text


Available from: Isidore Rigoutsos, Jun 18, 2015
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bacteriophages have key roles in microbial communities, to a large extent shaping the taxonomic and functional composition of the microbiome, but data on the connections between phage diversity and the composition of communities are scarce. Using taxon-specific marker genes, we identified and monitored 20 viral taxa in 252 human gut metagenomic samples, mostly at the level of genera. On average, five phage taxa were identified in each sample, with up to three of these being highly abundant. The abundances of most phage taxa vary by up to four orders of magnitude between the samples, and several taxa that are highly abundant in some samples are absent in others. Significant correlations exist between the abundances of some phage taxa and human host metadata: for example, 'Group 936 lactococcal phages' are more prevalent and abundant in Danish samples than in samples from Spain or the United States of America. Quantification of phages that exist as integrated prophages revealed that the abundance profiles of prophages are highly individual-specific and remain unique to an individual over a 1-year time period, and prediction of prophage lysis across the samples identified hundreds of prophages that are apparently active in the gut and vary across the samples, in terms of presence and lytic state. Finally, a prophage-host network of the human gut was established and includes numerous novel host-phage associations.
    The ISME Journal 03/2014; DOI:10.1038/ismej.2014.30 · 9.27 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The soil line is an important concept that describes the linear relationship between reflectance of bare soils in the near infrared (NIR) and red (R) spectral bands. Bare soil line parameters (slope and intercept) are used in calculating several vegetation indices. Previous studies have proposed both manual and empirical procedures in estimating the bare soil parameters. Manual procedures introduce some amount of subjectivity in indentifying the soil line. Empirical methods often suffer because of variations caused by soil type, moisture, and organic matter contents. The existence of non-bare soil pixels also affects these procedures. In this study, we proposed an automated supervised learning algorithm using relevance vector machine (RVM) for extracting the soil line from Landsat images. The ten-fold cross validation (10-fold CV) indicated 92% accuracy for distinguishing bare soil and other non-bare soil pixels from an image. The area under the receiver operating characteristic (ROC) curve reached a value of 0.98 indicating a significant predicting power of the proposed procedure. Additionally, this procedure was evaluated using data from ten bare soil fields in the Texas High Plains region in 2008 and 2009. Statistical analysis indicated no significant difference between the observed and estimated bare soil line parameters. The proposed RVM-based procedure successfully incorporated machine learning algorithms into agricultural remote sensing, and eliminated the dependency on empiricism and minimized subjectivity.
    Remote Sensing Letters 01/2014; 5(2). DOI:10.1080/2150704X.2014.890759 · 1.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Like all organisms on the planet, environmental microbes are subject to the forces of molecular evolution. Metagenomic sequencing provides a means to access the DNA sequence of uncultured microbes. By combining DNA sequencing of microbial communities with evolutionary modeling and phylogenetic analysis we might obtain new insights into microbiology and also provide a basis for practical tools such as forensic pathogen detection. In this work we present an approach to leverage phylogenetic analysis of metagenomic sequence data to conduct several types of analysis. First, we present a method to conduct phylogeny-driven Bayesian hypothesis tests for the presence of an organism in a sample. Second, we present a means to compare community structure across a collection of many samples and develop direct associations between the abundance of certain organisms and sample metadata. Third, we apply new tools to analyze the phylogenetic diversity of microbial communities and again demonstrate how this can be associated to sample metadata. These analyses are implemented in an open source software pipeline called PhyloSift. As a pipeline, PhyloSift incorporates several other programs including LAST, HMMER, and pplacer to automate phylogenetic analysis of protein coding and RNA sequences in metagenomic datasets generated by modern sequencing platforms (e.g., Illumina, 454).
    01/2014; 2:e243. DOI:10.7717/peerj.243