Conference Paper

Evolutionary placement of short sequence reads on multi-core architectures.

DOI: 10.1109/AICCSA.2010.5586973 Conference: The 8th ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2010, Hammamet, Tunisia, May 16-19, 2010
Source: DBLP

ABSTRACT The application of high performance computing methods in bioinformatics becomes increasingly important because of the masses of data generated by novel short-read DNA sequencers. One important application of such short reads, is the analysis of microbial communities where the anonymous short reads need to be identified by sequence comparison to a set of reference sequences. This identification is required to analyze the microbial composition and biological diversity of the sample. We briefly introduce a new algorithm for evolutionary (phylogenetic) placement of short reads under the Maximum Likelihood criterion and implement it in RAxML. While this algorithm is significantly more accurate than plain pair-wise sequence comparison it can become highly compute-intensive when a typical number of 100,000 reads and more need to be placed into an existing phylogenetic tree. Therefore, we deploy multi-grain parallelism to improve parallel efficiency of this algorithm on 16-core and 32-core architectures. Via this multi-grain approach, we achieve parallel execution time improvements of 25% and super-linear speedups on 16 cores, as well as near-linear speedups and improvements exceeding 50% on 32-cores on two large real-world microbial datasets. Evolutionary placement of 100,000 reads into a tree with more than 4,000 taxa now only requires less than 2 hours of execution time on 32 cores.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.
    Systematic Biology 03/2011; 60(3):291-302. · 12.17 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Plant beneficial microorganisms, such as arbuscular mycorrhiza fungi (AMF), increasingly attract scientific and agronomic attention due to their capacity to increase nutrient accessibility for plants and to reduce inorganic fertilizer requirements. AMF are thought to form symbioses with most land plants, obtaining carbon from the autotrophic host whilst enhancing uptake of poorly available nutrients. The species of AMF are mainly identified by spore morphology, which is time consuming, requires expertise and is rarely applicable to AMF identification in roots. Molecular tools such as analysis of standardized DNA fragment sequences may allow the recognition of species through a ‘DNA barcode’, which may partly overcome this problem. The focus of this study was to evaluate different regions of widely used rDNA repeats for their use as DNA barcodes for AMF including the small subunit rRNA gene (SSU), the internal transcribed spacer (ITS) and the large subunit rRNA gene (LSU). Closely related species in the genus Ambispora, members of which have dimorphic spores, could not be separated by analysis of the SSU region, but of the ITS region. Consequently, the SSU was not used for subsequent analysis, but a DNA fragment covering a small part of the SSU, the entire ITS region and about 800 bp of the LSU (SSUmCf-LSUmBr fragment) was analysed, providing phylogenetic resolution to species. New AMF specific primers for these potential barcoding regions were developed and can be applied, without amplification of non-target organisms, for AMF species determination, including identification from field and root samples. Analyses based on the application of the SSUmCf-LSUmBr fragment showed that the widely used AMF model organism Glomus sp. DAOM197198 (formerly called Glomus intraradices) is not conspecific with Gl. intraradices. The SSUmCf-LSUmBr fragment clearly provides a much higher species resolution capacity when compared with the formerly preferred ITS and LSU regions. Further study of several groups of AMF species using different regions of the SSUmCf-LSUmBr fragment revealed that only the complete SSUmCf-LSUmBr fragment allowed separation of all analysed species. Based on these results, an extended DNA barcode covering the ITS region and parts of the LSU region is suggested as a DNA barcode for AMF. The complete SSUmCf-LSUmBr fragment sequences can serve as a database backbone for also using smaller rDNA fragments as barcodes. Although the smallest fragment (approximately 400 bp) analysed in this study was not able to discriminate among AMF species completely, such short regions covering the ITS2 or LSU D2 regions, respectively, would most likely be suitable for community analyses with 454 GS-FLX Titanium sequencing, providing that the analyses is based on the longer DNA sequences.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Selective sweep detection localizes targets of recent and strong positive selection by analyzing single nucleotide polymorphisms (SNPs) in intra-species multiple sequence alignments. Substantial advances in wet-lab sequencing technologies currently allow for generating unprecedented amounts of molecular data. The increasing number of sequences and number of SNPs in such large multiple sequence alignments cause prohibiting long execution times for population genetics data analyses that rely on selective sweep theory. To alleviate this problem, we have recently implemented fine- and coarse-grain parallel versions of our open-source tool OmegaPlus for selective sweep detection that is based on the ω statistic. A performance issue with the coarse-grain parallelization is that individual coarse-grain tasks exhibit significant run-time differences, and hence cause load imbalance. Here, we introduce a significantly improved multi-grain parallelization scheme which outperforms both the fine-grain as well as the coarse-grain versions of OmegaPlus with respect to parallel efficiency. The multi-grain approach exploits both coarse-grain and fine-grain operations by using available threads/cores that have completed their coarse-grain tasks to accelerate the slowest task by means of fine-grain parallelism. A performance assessment on real-world and simulated datasets showed that the multi-grain version is up to 39% and 64.4% faster than the coarse-grain and the fine-grain versions, respectively, when the same number of threads is used.
    Algorithms and Architectures for Parallel Processing. 01/2012;

Full-text (2 Sources)

Available from
May 20, 2014