Conference Paper

Evolutionary placement of short sequence reads on multi-core architectures

DOI: 10.1109/AICCSA.2010.5586973 Conference: The 8th ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2010, Hammamet, Tunisia, May 16-19, 2010
Source: DBLP


The application of high performance computing methods in bioinformatics becomes increasingly important because of the masses of data generated by novel short-read DNA sequencers. One important application of such short reads, is the analysis of microbial communities where the anonymous short reads need to be identified by sequence comparison to a set of reference sequences. This identification is required to analyze the microbial composition and biological diversity of the sample. We briefly introduce a new algorithm for evolutionary (phylogenetic) placement of short reads under the Maximum Likelihood criterion and implement it in RAxML. While this algorithm is significantly more accurate than plain pair-wise sequence comparison it can become highly compute-intensive when a typical number of 100,000 reads and more need to be placed into an existing phylogenetic tree. Therefore, we deploy multi-grain parallelism to improve parallel efficiency of this algorithm on 16-core and 32-core architectures. Via this multi-grain approach, we achieve parallel execution time improvements of 25% and super-linear speedups on 16 cores, as well as near-linear speedups and improvements exceeding 50% on 32-cores on two large real-world microbial datasets. Evolutionary placement of 100,000 reads into a tree with more than 4,000 taxa now only requires less than 2 hours of execution time on 32 cores.

Download full-text


Available from: Alexandros Stamatakis, Feb 16, 2014
  • Source
    • "The EPA also is relatively straight-forward to parallelize by applying a multi-grain parallelization technique (Stamatakis et al., 2010). On a multi-core system with 32 cores and 64GB of main memory, we were able to place 100,627 QS in parallel into a RT with 4,874 taxa within 1.5 hours. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.
    Systematic Biology 03/2011; 60(3):291-302. DOI:10.1093/sysbio/syr010 · 14.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SUMMARY: *Currently, no official DNA barcode region is defined for the Fungi. The COX1 gene DNA barcode is difficult to apply. The internal transcribed spacer (ITS) region has been suggested as a primary barcode candidate, but for arbuscular mycorrhizal fungi (AMF; Glomeromycota) the region is exceptionably variable and does not resolve closely related species. *DNA barcoding analyses were performed with datasets from several phylogenetic lineages of the Glomeromycota. We tested a c. 1500 bp fragment spanning small subunit (SSU), ITS region, and large subunit (LSU) nuclear ribosomal DNA for species resolving power. Subfragments covering the complete ITS region, c. 800 bp of the LSU rDNA, and three c. 400 bp fragments spanning the ITS2, the LSU-D1 or LSU-D2 domains were also analysed. *Barcode gap analyses did not resolve all species, but neighbour joining analyses, using Kimura two-parameter (K2P) distances, resolved all species when based on the 1500 bp fragment. The shorter fragments failed to separate closely related species. *We recommend the complete 1500 bp fragment as a basis for AMF DNA barcoding. This will also allow future identification of AMF at species level based on 400 or 1000 bp amplicons in deep sequencing approaches.
    New Phytologist 04/2010; 187(2):461-74. DOI:10.1111/j.1469-8137.2010.03262.x · 7.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets. This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service.
    BMC Bioinformatics 10/2010; 11(1):538. DOI:10.1186/1471-2105-11-538 · 2.58 Impact Factor
Show more