Conference Paper

Highly efficient mapping of the Smith-Waterman algorithm on CUDA-compatible GPUs.

Dept. of Comput. & Inf. Sci., Nagasaki Univ., Nagasaki, Japan
DOI: 10.1109/ASAP.2010.5540796 In proceeding of: 21st IEEE International Conference on Application-specific Systems Architectures and Processors, ASAP 2010, Rennes, France, 7-9 July 2010
Source: DBLP

ABSTRACT This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SW) algorithm on graphic processing units (GPUs) with NVIDIA corporation's Compute Unified Device Architecture (CUDA). Central to this is a divide and conquer approach which divides the computation of a whole pairwise sequence alignment matrix into multiple sub-matrices (or parallelograms) each running efficiently on the available hardware resources of the GPU in hand, with temporary intermediate data stored in global memory. Moreover, we use thread warps and padding techniques in order to decrease the cost of thread synchronization, as well as loop unrolling in order to reduce the cost of conditional branches. While intermediate data is stored in global memory for large queries, the most inner loop in our implementation will only access shared memory and registers. As a result of these optimizations, our implementation of the SW algorithm achieves a throughput ranging between 9.09 GCUPS (Giga Cell Update per Second) and 12.71 GCUPS on a single-GPU version, and a throughput between 29.46 GCUPS and 43.05 GCUPS on a quad-GPU platform. Compared with the best GPU implementation of the SW algorithm reported to date, our implementation achieves up to 46 % improvement in speed. The source code of our implementation is available in the public domain for Bioinformaticians to benefit from its performance.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Based on GPU computing, a fast computing using multi-GPU is proposed for the alignment of vast amounts of T-cell receptor (TCR) nucleotide sequences. Using CUDA-enabled Fermi GPU and CUDA toolkit 4.0 provided by NVIDIA, we design a faster and more effective sequence alignment process based on CPU and multi-GPU: CPU is responsible for logic control, GPU responsible for parallel computing. Inter-task parallel strategy is applied in the part of parallel computing, which not only bring high parallelism, but also make the alignment process not confined to a specific parallel alignment algorithm. Under the same hardware condition, the alignment computing of mouse TCR nucleotide sequences were carried out by multi-GPU computing, single-GPU computing and only-CPU computing respectively. The results show that multi-GPU computing has the best performance considering alignment efficiency and the cost.
    Biomedical Engineering and Sciences (IECBES), 2012 IEEE EMBS Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel substitution matrix loader architecture for pairwise sequence alignment. The search for sequence homology using DP-based alignment matrix computation is an important tool in molecular biology. It can be implemented either by optimal or sub-optimal approaches. Both of these methods require frequent and rapid access to the amino acids probability scores for PE (Processing Element) configuration especially in a folded systolic array. Typical FPGA implementations configure look-up tables in the pipeline PEs either by using a serial configuration chain with different look-up tables or by run time reconfiguration of the same look-up table. In the former case, configuration time increases proportionally to the number of look-up tables, while the latter case suffers from the limited reconfiguration bandwidth. Therefore, in this paper, we propose a highly efficient parallel loader to optimize both time and space complexities of protein sequence alignment in folded systolic arrays, using only two configuration elements (CEs). In addition, the proposed loader enables PEs to be updated with substitution matrix scores concurrently, with the worst case configuration time of 2 × the depth of the PE's look-up table (in clock cycles). This allows for further optimization of the most time consuming alignment matrix computation through efficient scheduling of alignment matrix computation and PE configuration. Implementation results show that the proposed architecture achieves k.NPE speed-up in configuration time (where k is the folding factor and NPE is the number of PEs) compared to classical approaches, at virtually no area overhead.
    Microelectronics (ICM), 2012 24th International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.
    Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on; 01/2012