Conference Paper

Optimization Strategies for Smith-Waterman Algorithm on FPGA Platform

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Different techniques can be used to optimize sequence alignment on FGPAs. Dynamic programming [4] is one such solution that breaks down the complex problem into smaller sub-problems. The initialization proves with involves conversion of input sequences to elements that can be processed which uses Ethernet for communication. ...
Chapter
Full-text available
Sequence alignment is a problem in bioinformatics that involves arranging sequences of proteins, RNA or DNA so that similar regions between two or more sequences may be determined. The Smith-Waterman algorithm is a key algorithm for aligning sequences. This paper uses the OpenMP application-programming interface along with the Single-Instruction Multiple-Data (SIMD) instructions. Advanced Vector Instructions 2 (AVX2) is used to implement the SIMD paradigm. It utilizes both fine-level and coarse-level parallelism to improve resource utilization without requiring support from multiple nodes in a distributed memory system. The algorithm shows a multifold decrease in execution time in comparison to an implementation that is sequentially executed.
Article
An optimized software and hardware digital implementation of two widely used DNA sequence alignment algorithms based on lookup table(LUT) is illustrated in this study. These algorithms are the best means for identifying similar regions between sequences. The proposed implementation relies on the complete parallelization of these foundational algorithms under certain limitations to overcome most of the problems of dynamic programming and hardware implementation. The proposed method takes O(N/4) calculation steps, where N is the length of each sequence with a minimum value of four (i.e., N = 4,8,12,…). A performance comparison between the state of art and our proposed algorithm is conducted for software and hardware implementation. Combinational circuits are used for FPGA-based hardware implementation of DNA sequence alignment algorithms. Performance and device resource usage are evaluated for different hardware designs. A customized convolution neural network model is used to implement global alignment and achieve 98.3% accuracy.
Article
Although dynamic programming (DP) is an optimization approach used to solve a complex problem fast, the time required to solve it is still not efficient and grows polynomially with the size of the input. In this contribution, we improve the computation time of the dynamic programming based algorithms by proposing a novel technique, which is called "SDP: Segmented Dynamic programming". SDP finds the best way of splitting the compared sequences into segments and then applies the dynamic programming algorithm to each segment individually. This will reduce the computation time dramatically. SDP may be applied to any dynamic programming based algorithm to improve its computation time. As case studies, we apply the SDP technique on two different dynamic programming based algorithms; "Needleman-Wunsch (NW)", the widely used program for optimal sequence alignment, and the LCS algorithm, which finds the "Longest Common Subsequence" between two input strings. The results show that applying the SDP technique in conjunction with the DP based algorithms improves the computation time by up to 80% in comparison to the sole DP algorithms, but with small or ignorable degradation in comparing results. This degradation is controllable and it is based on the number of split segments as an input parameter. However, we compare our results with the well-known heuristic FASTA sequence alignment algorithm, "GGSEARCH". We show that our results are much closer to the optimal results than the "GGSEARCH" algorithm. The results are valid independent from the sequences length and their level of similarity. To show the functionality of our technique on the hardware and to verify the results, we implement it on the Xilinx Zynq-7000 FPGA.
Article
Dynamic Programming (DP) is used to solve combinatorial optimization problems and constitutes one of the 13 High Performance Computing (HPC) patterns. DP suffers from irregular, data-dependent memory accesses that deteriorates performance. The Knapsack 0/1 belongs to the simplest DP algorithms which is called Serial Monadic and has been treated in software with cache-efficient algorithms as well as parallel threads, OpenMP or MPI. In this paper we propose a shared memory, parametrizable architecture to compute the DP matrix for the Knapsack 0/1. Our system has a parallel runtime of Θ(mC/q) for a knapsack of capacity C with m items and q operators. Using additional off-chip space and DMA transfers it can solve knapsacks of any size. The architecture is implemented on the ZYNQ-7020 System On Chip (SoC) that contains a dual-core ARM plus Artix FPGA fabric. Under such architecture we make use of 64-bit High Performance ports for off-chip transfers and asymmetric 32-bit write/64-bit read BRAMs to minimize data exchange times. We also exploit computation synchronization to minimize BRAM address propagation and reduce routing congestion. We present results for a base system with 70 Processing Elements (PEs) capable of solving problems with a maximum item weight ωmax=1024. For more complex instances we configure the architecture with 58 PEs and ωmax=6144, where a single BRAM is shared among 13 computing units. We thus solve problems with 6 × bigger weights than previous works, attain a 16 × speed-up versus an optimized software on an Intel Xeon E5 and get the highest efficiency per core versus other architectures. We achieve between 2.4−3.3× acceleration versus previous FPGA solutions.
Article
The emergence of bioinformatics has led to many new discoveries in living organisms. These discoveries would not have been possible without the developments made in the sequence alignment techniques. Many sequence alignment algorithms were developed to make the alignment process fast and accurate. However, the more precise algorithms take longer than their less precise counterparts. Researchers came with innovative approaches to combat the time consuming constraint. Their aim was to speed up the computational process by using more efficient implementations of the algorithms using state-of-the-art hardware platforms. Smith Waterman (SW) algorithm, being the most accurate in the alignment process, has been implemented on various high performance computing platforms for the same purpose. However, the intrinsic structure of the algorithm has got little attention. In this paper, we present a novel structure of the SW algorithm that takes less number of cycles at the cost of utilizing a minimal amount of extra hardware resources as compared to its existing form. The newly proposed architecture achieves up to 25% performance gain.
Article
Full-text available
The most pervasive compute operation carried out in almost all bioinformatics applications is pairwise sequence homology detection (or sequence alignment). Due to exponentially growing sequence databases, computing this operation at a large-scale is becoming expensive. An effective approach to speed up this operation is to integrate a very high number of processing elements in a single chip so that the massive scales of fine-grain parallelism inherent in several bioinformatics applications can be exploited efficiently. Network-on-chip (NoC) is a very efficient method to achieve such large-scale integration. In this work, we propose to bridge the gap between data generation and processing in bioinformatics applications by designing NoC architectures for the sequence alignment operation. Specifically, we 1) propose optimized NoC architectures for different sequence alignment algorithms that were originally designed for distributed memory parallel computers and 2) provide a thorough comparative evaluation of their respective performance and energy dissipation. While accelerators using other hardware architectures such as FPGA, general purpose graphics processing unit (GPU), and the cell broadband engine (CBE) have been previously designed for sequence alignment, the NoC paradigm enables integration of a much larger number of processing elements on a single chip and also offers a higher degree of flexibility in placing them along the die to suit the underlying algorithm. The results show that our NoC-based implementations can provide above 102-103-fold speedup over other hardware accelerators and above 104-fold speedup over traditional CPU architectures. This is significant because it will drastically reduce the time required to perform the millions of alignment operations that are typical in large-scale bioinformatics projects. To the best of our knowledge, this work embodies the first attempt to accelerate a bioinformatics application - using NoC.
Article
Full-text available
We develop novel single-GPU parallelizations of the Smith-Waterman algorithm for pairwise sequence alignment. Our algorithms, which are suitable for the alignment of a single pair of very long sequences, can be used to determine the alignment score as well as the actual alignment. Experimental results demonstrate an order of magnitude reduction in run time relative to competing GPU algorithms.
Article
Full-text available
This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs.
Article
Full-text available
Genomics research requires analysis of large sequence databases and is heavily dependent upon bioinformatics computation. Several complementary computing strategies are available to perform the analysis in reasonable time. The most common hardware solutions are CPU clusters and dedicated bioinformatics hardware accelerators. Here, we present an overview on how these systems can be used in practice to accelerate common bioinformatics sequence analysis algorithms and techniques.
Conference Paper
Full-text available
This paper analyses two methods of organizing parallelism for the Smith-Waterman algorithm, and show how they perform relative to peak performance when the amount of parallelism varies. A novel systolic design is introduced, with a processing element optimized for computing the affine gap cost function. Our FPGA design is significantly more energy-efficient than GPU designs. For example, our design for the XC5VLX330T FPGA achieves around 16 GCUPS/W, while CPUs and GPUs have a power efficiency of lower than 0.5 GCUPS/W.
Conference Paper
Full-text available
Sequence alignment and its many variants are a fundamental tool in computational biology. There is considerable recent interest in using the cell broadband engine, a heterogenous multi-core chip that provides high performance, for biological applications. However, work so far has been limited to computing optimal alignment scores using quadratic space under the basic global/local alignment algorithm. In this paper, we present a comprehensive study of developing sequence alignment algorithms on the Cell exploiting its thread and data level parallelism features. First, we develop a cell implementation that computes optimal alignments and adopts Hirschberg's linear space technique. The former is essential as merely computing optimal alignment scores is not useful while the latter is needed to permit alignments of longer sequences. We then present cell implementations of two advanced alignment techniques - spliced alignments and syntenic alignments. In a spliced alignment, consecutive non-overlapping portions of a sequence align with ordered non-overlapping, but non-consecutive portions of another sequence. Spliced alignments are useful in aligning mRNA sequences with corresponding genomic sequences to uncover gene structure. Syntenic alignments are used to discover conserved exons and other sequences between long genomic sequences from different organisms. We present experimental results for these three types of alignments on the Cell BE and report speedups of about 4 on six SPUs on the Playstation 3, when compared to the respective best serial algorithms on the Cell BE and the Pentium 4 processor.
Conference Paper
Full-text available
The Smith-Waterman (SW) algorithm is the only optimal local sequence alignment algorithm. There are many SW implementations on FPGA, which show speedups of up to 100x as compared to a general-purpose-processor (GPP). In this paper, we propose a design of the SW traceback, which is done in parallel with the matrix fill stage and which gives the optimal alignment after once scanning through the whole database. Beside that, we have proposed the hardware design for the RVEP SW FPGA implementation, which demonstrates that this solution can be realized with off-the-shelf FPGA boards.
Article
Full-text available
Biological sequence alignment is an essential tool used in molecular biology and biomedical applications. The growing volume of genetic data and the complexity of sequence alignment present a challenge in obtaining alignment results in a timely manner. Known methods to accelerate alignment on reconfigurable hardware only address sequence comparison, limit the sequence length, or exhibit memory and I/O bottlenecks. A space-efficient, global sequence alignment algorithm and architecture is presented that accelerates the forward scan and traceback in hardware without memory and I/O limitations. With 256 processing elements in FPGA technology, a performance gain over 300 times that of a desktop computer is demonstrated on sequence lengths of 16000. For greater performance, the architecture is scalable to more processing elements.
Conference Paper
Full-text available
Sequence alignment is one of the most important activities in bioinformatics. With the ever increasing volume of data in bioinformatics databases, the time for comparing a query sequence with the available databases is always increasing. Many algorithms have been proposed to perform and accelerate sequence alignment activities. This paper introduces a taxonomy of the various sequence alignment algorithms found in the literature, with particular emphasis on the Smith-Waterman (S-W) algorithm. The paper also provides a classification of the available hardware acceleration methods used to speed up the S-W algorithm.
Conference Paper
Full-text available
Presents the Smith and Waterman algorithm-specific ASIC design (SWASAD) project. This is a hardware solution that implements the S and W algorithm.. The SWASAD is an improved implementation of the biological information signal processor (BISP) design. The SWASAD chip fabricated on a 0.5 μm process achieves 3200 million matrix cells per second (MCPS) per chip, with a layout size of 7.1 mm by 7.1 mm. This is a large improvement over existing designs and improves data throughput by using a smaller datawidth
Article
Full-text available
Eight synergistic processor units enable the Cell Broadband Engine's breakthrough performance. The SPU architecture implements a novel, pervasively data-parallel architecture combining scalar and SIMD processing on a wide data path. A large number of SPUs per chip provide high thread-level parallelism. The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations. These design decisions have enabled the Cell BE to deliver unprecedented supercomputer-class compute power for consumer applications
Article
Several emerging application domains in scientific computing demand high computation throughputs to achieve terascale or higher performance. Dedicated centers hosting scientific computing tools on a few high-end servers could rely on hardware accelerator co-processors that contain multiple lightweight custom cores interconnected through an on-chip network. With increasing workloads, these many-core platforms need to deliver high overall computation throughput while also being energy-efficient. Conventional multicore architectures can achieve a limited computational throughput due to the inherent multi-hop nature of the on-chip network infrastructure. By inserting long-range links that act as shortcuts in a regular network-on-chip (NoC) architecture, both the achievable bandwidth and energy efficiency of a multicore platform can be significantly enhanced. In this paper, we first propose a NoC-driven use-case model for throughput-oriented scientific applications, and subsequently use the model to study the effect of using long-range links in conjunction with different resource allocation strategies on reducing the overall on-chip communication and enhancing computational throughput. NoCs with both wired and on-chip wireless links are explored in the study. We also evaluate our NoC-based platforms with respect to energy-efficiency and power consumption. We analyze how throughput and power consumption are correlated with the statistical properties of the application traffic. In addition, we compare and analyze chip-level thermal profiles for these alternatives. Our experiments using kernels from a popular phylogenetic inference application suite show that we can deliver computation throughput over 1011 operations per second, consuming ∼0.5 nJ per operation, while ensuring that on-chip temperature variation is within 26 °C.
Article
This paper presents the design and implementation of the most parameterisable field-programmable gate array (FPGA)-based skeleton for pairwise biological sequence alignment reported in the literature. The skeleton is parameterised in terms of the sequence symbol type, i.e., DNA, RNA, or protein sequences, the sequence lengths, the match score, i.e., the score attributed to a symbol match, mismatch or gap, and the matching task, i.e., the algorithm used to match sequences, which includes global alignment, local alignment, and overlapped matching. Instances of the skeleton implement the Smith-Waterman and the Needleman-Wunsch algorithms. The skeleton has the advantage of being captured in the Handel-C language, which makes it FPGA platform-independent. Hence, the same code could be ported across a variety of FPGA families. It implements the sequence alignment algorithm in hand using a pipeline of basic processing elements, which are tailored to the algorithm parameters. This paper presents a number of optimizations built into the skeleton and applied at compile-time depending on the user-supplied parameters. These result in high performance FPGA implementations tailored to the algorithm in hand. For instance, actual hardware implementations of the Smith-Waterman algorithm for Protein sequence alignment achieve speedups of two orders of magnitude compared to equivalent standard desktop software implementations.
Conference Paper
Smith-Waterman algorithm is a classic dynamic programming algorithm to solve the problem of biological sequence alignment. However, with the rapid increment of the number of DNA and protein sequences, the originally sequential algorithm is very time consuming due to there existing the same computing task computed repeatedly on large-scale data. Today’s GPU (graphics processor unit) consists of hundreds of processors, so it has a more powerful computation capacity than the current multicore CPU . And as the programmability of GPU improved continuously, using it to do generous purpose computing is becoming very popular. In order to accelerate sequence alignment, previous researchers use the parallelism of the anti-diagonal of similarity matrix to parallelize the Smith-Waterman algorithm on GPU . In this paper, we design a new parallel algorithm which exploits the parallelism of the column of similarity matrix to parallelize the Smith-Waterman algorithm on a heterogeneous system based on CPU and GPU . The experiment result shows that our new parallel algorithm is more efficient than that of previous, which takes full advantage of the features of both the CPU and GPU and obtains approximately 37 times speedup compared with the sequential algorithm named OSEARCH implemented on Intel dual-core E2140 processor.
Conference Paper
An innovative reconfigurable supercomputing platform - XD1000 is being developed by XtremeData to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies - a key to minimize the overall PE pipeline cycle time; (3) we also present a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor results in speedup of 185 and 250 respectively.
Article
Sequence alignment is a common and often repeated task in molecular biology. Typical alignment operations consist of finding similarities between a pair of sequences (pairwise sequence alignment) or a family of sequences (multiple sequence alignment). The need for speeding up this treatment comes from the rapid growth rate of biological sequence databases: every year their size increases by a factor of 1.5 to 2. In this paper, we present a new approach to high-performance biological sequence alignment based on commodity PC graphics hardware. Using modern graphics processing units (GPUs) for high-performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have reformulated dynamic-programming-based alignment algorithms as streaming algorithms in terms of computer graphics primitives. Our experimental results show that the GPU-based approach allows speedups of more than one order of magnitude with respect to optimized CPU implementations.
Vassiliadis:Hardware acceleration of sequence alignment algorithms-an overview. Design and Technology of Integrated Systems in Nanoscale Era
  • L Hasan
  • Z Al-Ars
L. Hasan, Z. Al-Ars, S. Vassiliadis:Hardware acceleration of sequence alignment algorithms-an overview. Design and Technology of Integrated Systems in Nanoscale Era, 2007. pp.92,97, 2-5 Sept. 2007
Hardware and software systems for accelerating common bioinformatics sequence analysis algorithms
  • R Luthy
  • C Hoover
R. Luthy and C. Hoover. Hardware and software systems for accelerating common bioinformatics sequence analysis algorithms. Biosilico, 2(1), 2004.
Implementation of the Smith-Waterman Algorithm on a Reconfigurable Supercomputing Platform
  • Altera White
Altera White Paper. Implementation of the Smith-Waterman Algorithm on a Reconfigurable Supercomputing Platform, September 2007.
A parallel FPGA design of Smith-Waterman traceback
  • A Nawazn
  • M Nadeem
  • H Van Someren
  • K Bertels
A. Nawazn, M. Nadeem, H. van Someren, K. Bertels. A parallel FPGA design of Smith-Waterman traceback. International Conference on Field-Programmable Technology (FPT), 8-10 Dec.2010.