Article

BFAST: An Alignment Tool for Large Scale Genome Resequencing

Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA.
PLoS ONE (Impact Factor: 3.53). 11/2009; 4(11):e7767. DOI: 10.1371/journal.pone.0007767
Source: PubMed

ABSTRACT The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation.
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.
We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net).

Download full-text

Full-text

Available from: Stanley F Nelson, Jul 30, 2014
3 Followers
 · 
213 Views
  • Source
    • "The location with the highest score is chosen as the alignment location for a short read. For example, BLAST [4] uses a hash table of all fixed length k-mers in the reference to find seeds, and a banded version of the Smith-Waterman algorithm to compute high scoring gapped alignments. RMAP [15] uses a hash table of non-overlapping k-mers of length m/(k þ 1) in the reads to find seeds. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The next generation genome sequencing problem with short (long) reads is an emerging field in numerous scientific and big data research domains. However, data sizes and ease of access for scientific researchers are growing and most current methodologies rely on one acceleration approach and so cannot meet the requirements imposed by explosive data scales and complexities. In this paper, we propose a novel FPGA-based acceleration solution with MapReduce framework on multiple hardware accelerators. The combination of hardware acceleration and MapReduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. To evaluate the performance and other metrics, we conducted a theoretical speedup analysis on a MapReduce programming platform, which demonstrates that our proposed architecture have efficient potential to improve the speedup for large scale genome sequencing applications. Also, as a practical study, we have built a hardware prototype on the real Xilinx FPGA chip. Significant metrics on speedup, sensitivity, mapping quality, error rate, and hardware cost are evaluated, respectively. Experimental results demonstrate that the proposed platform could efficiently accelerate the next generation sequencing problem with satisfactory accuracy and acceptable hardware cost.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2015; 12(1):166-178. DOI:10.1109/TCBB.2014.2351800 · 1.54 Impact Factor
  • Source
    • "According to Fig.2(b), Suffix array interval [5] [5] refers to the position 3 (that is, S A[5] [5] = 3) in the reference genome X. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The alignment of millions of short DNA fragments to a large genome is a very important aspect of the modern computational biology. However, software-based DNA sequence alignment takes many hours to complete. This paper proposes an FPGA-based hardware accelerator to reduce the alignment time. We apply a data encoding scheme that reduces the data size by 96%, and propose a pipelined hardware decoder to decode the data. We also design customized data paths to efficiently use the limited bandwidth of the DDR3 memories. The proposed accelerator can align a few hundred million short DNA fragments in an hour by using 80 processing elements in parallel. The proposed accelerator has the same mapping quality compared to the software-based methods.
    IEEE Transactions on Parallel and Distributed Systems 01/2015; DOI:10.1109/TPDS.2015.2444376 · 2.17 Impact Factor
  • Source
    • "To detect similarities in bio-sequences, in the so called hit and extend strategy framework, spaced seeds are now a frequently used technique to define the hit (Keich et al., 2004). Several tools have been proposed that use spaced seeds (Chen et al., 2009, David et al., 2011, Harris, 2007, Homer et al., 2009, Ilie et al., 2013, Kiee lbasa et al., 2011, Li et al., 2004, Lin et al., 2008, Zhou et al., 2010), or to design spaced seeds (Buhler et al., 2005, Do Duc et al., 2012, Ilie et al., 2011, Kucherov et al., 2006, Marschall et al., 2012, Nuel, 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances, and to provide a lower misclassification rate when used with Support Vector Machines (SVMs). We confirm by independent experiments these two results, and propose in this article to use a coverage criterion to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.
    Journal of computational biology: a journal of computational molecular cell biology 11/2014; 21(12). DOI:10.1089/cmb.2014.0173 · 1.67 Impact Factor
Show more