SEPP: SATe-enabled phylogenetic placement

Department of Computer Science University of Texas at Austin Austin, TX 78712, USA.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 05/2012; DOI: 10.1142/9789814366496_0024
Source: PubMed

ABSTRACT We address the problem of Phylogenetic Placement, in which the objective is to insert short molecular sequences (called query sequences) into an existing phylogenetic tree and alignment on full-length sequences for the same gene. Phylogenetic placement has the potential to provide information beyond pure “species identification ” (i.e., the association of metagenomic reads to existing species), because it can also give information about the evolutionary relationships between these query sequences and to known species. Approaches for phylogenetic placement have been developed that operate in two steps: first, an alignment is estimated for each query sequence to the alignment of the full-length sequences, and then that alignment is used to find the optimal location in the phylogenetic tree for the query sequence. Recent methods of this type include HMMALIGN+EPA, HMMALIGN+pplacer, and PaPaRa+EPA. We report on a study evaluating phylogenetic placement methods on biological and simulated data. This study shows that these methods have extremely good accuracy and computational tractability under conditions where the input contains a highly accurate alignment and tree for the full-length sequences, and the set of full-length sequences is sufficiently small and not too evolutionarily diverse; however, we also show that under other conditions accuracy declines and the computational requirements for memory and time exceed acceptable limits. We present SEPP, a general

9 Reads
  • Source
    • "This approach is particularly useful in a case where there are no close relatives to the query sequence in databases, due primarily to the low DB coverage, which is the case with microbial eukaryotes. Tools using phylogenetic methods in taxonomic assignment include the evolutionary placement algorithm (EPA) [49], pplacer [50], and SEPP [51], and these algorithms were recently incorporated into QIIME [28] and AMPHORA2 [52]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Metagenomics has become one of the indispensable tools in microbial ecology for the last few decades, and a new revolution in metagenomic studies is now about to begin, with the help of recent advances of sequencing techniques. The massive data production and substantial cost reduction in next-generation sequencing have led to the rapid growth of metagenomic research both quantitatively and qualitatively. It is evident that metagenomics will be a standard tool for studying the diversity and function of microbes in the near future, as fingerprinting methods did previously. As the speed of data accumulation is accelerating, bioinformatic tools and associated databases for handling those datasets have become more urgent and necessary. To facilitate the bioinformatics analysis of metagenomic data, we review some recent tools and databases that are used widely in this field and give insights into the current challenges and future of metagenomics from a bioinformatics perspective.
    09/2013; 11(3):102-113. DOI:10.5808/GI.2013.11.3.102
  • Source
    • "To test their performance, we used the same datasets as those used in the PAGAN article (Lo¨ytynoja et al., 2012). These are two independently simulated datasets, prepared by Lo¨ytynoja et al. (2012) and Mirarab et al. (2012). We included two representative methods, PaPaRa version 2.0 and PAGAN version 0.38, in the comparison because they can be used in situations similar to ours. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the ‘–add’ and ‘–addfragments’ options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, whereas the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment and the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the “–addfragments” option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems.Availability: information: Supplementary data are available at Bioinformatics online
    Bioinformatics 09/2012; 28(23). DOI:10.1093/bioinformatics/bts578 · 4.98 Impact Factor
  • Source
    • "We did not use other datasets from the original study because the hybrid CPU-GPU algorithm targets larger datasets. The second dataset (16S.B.ALL) is from a recent study comparing PaPaRa to a newly developed algorithm [19]. The unoptimized, proof-of-concept implementation of PaPaRa performs better than competing alignment approaches on the latter real-world dataset at the cost of substantially higher runtimes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: BackgroundAligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel.ResultsWe optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (Single Instruction, Multiple Data) intrinsics for x86 SIMD architectures and multi-threading, we obtained a 9-fold acceleration on a single core as well as linear speedups with respect to the number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga Cell Updates per Second) using all four physical cores on an Intel i7 2600 CPU running at 3.4 GHz. The average CPU performance (averaged over all test runs) is 12.33 GCUPS. We also used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, Multiple Threads) architecture. A NVIDIA GeForce 560 GPU delivered peak and average performance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD and SIMT implementations into a hybrid CPU-GPU system that achieved an accumulated peak performance of 33.8 GCUPS.ConclusionsThis accelerated version of PaPaRa (available at provides a significant performance improvement that allows for analyzing larger datasets in less time. We observe that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the “competing programmer approach” is deployed. Finally, we show that overall performance can be substantially increased by designing a hybrid CPU-GPU system with appropriate load distribution mechanisms.
    BMC Bioinformatics 08/2012; 13(1):196. DOI:10.1186/1471-2105-13-196 · 2.58 Impact Factor
Show more