Fast and accurate read alignment for resequencing.

Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA, Bina Technologies Inc., Redwood City and Department of Statistics, Stanford University, Stanford, CA 94305, USA.
Bioinformatics (Impact Factor: 4.62). 07/2012; 28(18):2366-73. DOI: 10.1093/bioinformatics/bts450
Source: PubMed

ABSTRACT Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.
We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.
Linux and Mac OS X binaries free for academic use are available at

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
    PLoS ONE 05/2014; 9(5):e97277. · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Fast and robust algorithms and aligners have been developed to help the researchers in the analysis of genomic data whose size has been dramatically increased in the last decade due to the technological advancements in DNA sequencing. It was not only the size, but the characteristics of the data have been changed. One of the current concern is that the length of the reads is increasing. Although existing algorithms can still be used to process this fresh data, considering its size and changing structure, new and more efficient approaches are required. In this work, we address the problem of accurate sequence alignment on GPUs and propose a new tool, Masher, which processes long (and short) reads efficiently and accurately. The algorithm employs a novel indexing technique that produces an index for the 3, 137Mbp hg19 with a memory footprint small enough to be stored in a restricted-memory device such as a GPU. The results show that Masher is faster than state-of-the-art tools and obtains a good accuracy/sensitivity on sequencing data with various characteristics.
    Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $10 6 prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment. Contribution: We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA's superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine. Availability: ARYANA with complete source code can be obtained from
    BMC Bioinformatics 09/2014; 15(Suppl 9):S12. · 2.67 Impact Factor


Available from
May 21, 2014