This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SW) algorithm on graphic processing units (GPUs) with NVIDIA corporation's Compute Unified Device Architecture (CUDA). Central to this is a divide and conquer approach which divides the computation of a whole pairwise sequence alignment matrix into multiple sub-matrices (or parallelograms) each running efficiently on the available hardware resources of the GPU in hand, with temporary intermediate data stored in global memory. Moreover, we use thread warps and padding techniques in order to decrease the cost of thread synchronization, as well as loop unrolling in order to reduce the cost of conditional branches. While intermediate data is stored in global memory for large queries, the most inner loop in our implementation will only access shared memory and registers. As a result of these optimizations, our implementation of the SW algorithm achieves a throughput ranging between 9.09 GCUPS (Giga Cell Update per Second) and 12.71 GCUPS on a single-GPU version, and a throughput between 29.46 GCUPS and 43.05 GCUPS on a quad-GPU platform. Compared with the best GPU implementation of the SW algorithm reported to date, our implementation achieves up to 46 % improvement in speed. The source code of our implementation is available in the public domain for Bioinformaticians to benefit from its performance.
"In either case, aligning sequences in a realistic time becomes an issue due to the exponential growth of database sequences. Advancement in computing technologies has seen the use of parallel architectures such as FPGAs in , , , , and GPUs in , , ,  to accelerate the time consuming DP-based algorithm. In hardware, the algorithm is usually accelerated using a linear systolic array. "
[Show abstract][Hide abstract] ABSTRACT: This paper presents a novel substitution matrix loader architecture for pairwise sequence alignment. The search for sequence homology using DP-based alignment matrix computation is an important tool in molecular biology. It can be implemented either by optimal or sub-optimal approaches. Both of these methods require frequent and rapid access to the amino acids probability scores for PE (Processing Element) configuration especially in a folded systolic array. Typical FPGA implementations configure look-up tables in the pipeline PEs either by using a serial configuration chain with different look-up tables or by run time reconfiguration of the same look-up table. In the former case, configuration time increases proportionally to the number of look-up tables, while the latter case suffers from the limited reconfiguration bandwidth. Therefore, in this paper, we propose a highly efficient parallel loader to optimize both time and space complexities of protein sequence alignment in folded systolic arrays, using only two configuration elements (CEs). In addition, the proposed loader enables PEs to be updated with substitution matrix scores concurrently, with the worst case configuration time of 2 × the depth of the PE's look-up table (in clock cycles). This allows for further optimization of the most time consuming alignment matrix computation through efficient scheduling of alignment matrix computation and PE configuration. Implementation results show that the proposed architecture achieves k.NPE speed-up in configuration time (where k is the folding factor and NPE is the number of PEs) compared to classical approaches, at virtually no area overhead.
Microelectronics (ICM), 2012 24th International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Large Scale DNA sequence alignment and Kernel method in molecular biology play critical roles in bioinformatics. Both of which are successfully implemented on the brook+ platform with AMD's GPUs. Aiming at the characters of graphical stream processors, we propose internal and external approach cooperatively to promote the performance of the two algorithms. The experiments show that the performance of 2D graph-based DNA sequence alignment based on GPUs is more than 25x faster than that of a single CPU-based one, and the parallel kernel matrix computing runs 7x faster on the same platform due to the write-back bottleneck of graph memory which simultaneously reveals the limitation of GPUs computing.
[Show abstract][Hide abstract] ABSTRACT: This paper analyses two methods of organizing parallelism for the Smith-Waterman algorithm, and show how they perform relative
to peak performance when the amount of parallelism varies. A novel systolic design is introduced, with a processing element
optimized for computing the affine gap cost function. Our FPGA design is significantly more energy-efficient than GPU designs.
For example, our design for the XC5VLX330T FPGA achieves around 16 GCUPS/W, while CPUs and GPUs have a power efficiency of
lower than 0.5 GCUPS/W.
Reconfigurable Computing: Architectures, Tools and Applications - 7th International Symposium, ARC 2011, Belfast, UK, March 23-25, 2011. Proceedings; 01/2011
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.