TWARIT: an extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences.
ABSTRACT Phylogenetic assignment of individual sequence reads to their respective taxa, referred to as 'taxonomic binning', constitutes a key step of metagenomic analysis. Existing binning methods have limitations either with respect to time or accuracy/specificity of binning. Given these limitations, development of a method that can bin vast amounts of metagenomic sequence data in a rapid, efficient and computationally inexpensive manner can profoundly influence metagenomic analysis in computational resource poor settings. We introduce TWARIT, a hybrid binning algorithm, that employs a combination of short-read alignment and composition-based signature sorting approaches to achieve rapid binning rates without compromising on binning accuracy and specificity. TWARIT is validated with simulated and real-world metagenomes and the results demonstrate significantly lower overall binning times compared to that of existing methods. Furthermore, the binning accuracy and specificity of TWARIT are observed to be comparable/superior to them. A web server implementing TWARIT algorithm is available at http://metagenomics.atc.tcs.com/Twarit/
- SourceAvailable from: Tungadri Bose[Show abstract] [Hide abstract]
ABSTRACT: Paired-end sequencing protocols, offered by next generation sequencing (NGS) platforms like Illumia, generate a pair of reads for every DNA fragment in a sample. Although this protocol has been utilized for several metagenomics studies, most taxonomic binning approaches classify each of the reads (forming a pair), independently. The present work explores some simple but effective strategies of utilizing pairing-information of Illumina short reads for improving the accuracy of taxonomic binning of metagenomic datasets. The strategies proposed can be used in conjunction with all genres of existing binning methods. Validation results suggest that employment of these "Binpairs" strategies can provide significant improvements in the binning outcome. The quality of the taxonomic assignments thus obtained are often comparable to those that can only be achieved with relatively longer reads obtained using other NGS platforms (such as Roche). An implementation of the proposed strategies of utilizing pairing information is freely available for academic users at https://metagenomics.atc.tcs.com/binning/binpairs.PLoS ONE 12/2014; 9(12):e114814. · 3.53 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicate that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAAGenomics 02/2014; · 2.79 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Sequencing is accepted as the "gold" standard for genetic analysis and continues to be used as a validation and reference tool. The idea of using sequence analysis directly for sample characterization has been met with skepticism. However, herein, utility of direct use of sequencing to identify multiple genomes present in samples is presented and reviewed. All samples and "pure" isolates are populations of genomes. Population-Sequencing is the use of probabilistic matching tools in combination with large volumes of sequence information to identify genomes present, based on DNA analysis across entire genomes to determine genome assignments, to calculate confidence scores of major and minor genome content. Accurate genome identification from mixtures without culture purification steps can achieve phylogenetic classification by direct analysis of millions of DNA fragments. Genome sequencing data of mixtures can function as biomarkers for use to interrogate genetic content of samples and to establish a sample profile, inclusive of major and minor genome components, drill down to identify rare SNP and mutation events, compare relatedness of genetic content between samples, profile-to-profile, and provide a probabilistic or statistical scoring confidence for sample characterization and attribution. The application of Population-Sequencing will facilitate sample characterization and genome identification strategies.