Conference Paper

A design case study: CPU vs. GPGPU vs. FPGA

DOI: 10.1109/MEMCOD.2009.5185380 Conference: Formal Methods and Models for Co-Design, 2009. MEMOCODE '09. 7th IEEE/ACM International Conference on
Source: IEEE Xplore

ABSTRACT This paper describes our winning submission for the Absolute Performance category of the MEMOCODE 2009 Design Contest. We show that our GPGPU-based design achieves performance within a factor of four of theoretical maximum performance for the implemented algorithm. This result was reached after a short design-cycle of 2 man-days, which indicates that the NVIDIA CUDA platform allows for rapid development and optimization of applications that make substantial use of all available GPGPU computing resources. We also analyze the maximum theoretical performance of alternative computing systems that could have been used to implement the algorithm.

0 Bookmarks
 · 
38 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: A profile Hidden Markov Model (HMM) is well suited for representing profiles of multiple sequences alignments, and it has been becoming the main method of multiple sequences alignments in bioinformatics. The scoring of sequences on profile HMMs is compute-intensive, especially when there are many Markov models and many states in each model. A parallel algorithm for Graphic Processing Unit (GPU)s is presented to score multiple sequences quickly on profile HMMs, and it featured with delete states elimination to reduce the compute-load greatly using a commodity graphics processing unit. The access to the parameters of profile HMMs is accelerated by allocating space in proper memory hierarchy. The algorithm was tested on a NVIDIA 9800 GTX+ graphic processing unit, experimental results showed the parallel algorithm can score multiple sequences on profile HMMs 8~50 times faster than the serial algorithm does on Pentium E5200 CPU.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A parallel algorithm on graphic processing units is presented to evaluate an observation sequence on hidden Markov models quickly. The evaluation is compute-intensive when there are thousands of or more Markov models and there are many states in each model. The compute-load can be reduced by the proposed algorithm greatly, but the hardware needed by the algorithm is not so expensive as the other accelerating hardware and is used widely in PCs. In a supervised recognition task, all of the hidden Markov models must be sorted by their evaluation probability in descent order, one can select a proper model from the list quickly. A sorting network was implemented in the algorithm to run on graphic processing units to sort the models in parallel by their evaluation probability. The algorithm was tested on a NVIDIA 9800 GTX+ graphic processing units, experimental results showed the parallel algorithm can evaluate the probability of an observation sequence on hidden Markov models 10~100 times fast than the serial algorithm does on Pentium E5200 CPU.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Bayesian theorem is the most used instrument for stochastic inferencing in nonlinear dynamic systems. The algorithmic implementations of the recursive Bayesian estimation for arbitrary systems are the particle filters (PFs). They are sampling-based sequential Monte-Carlo methods, which generate, a set of samples to compute an approximation of the Bayesian posterior probability density function. Thus, the PF faces the problem of high computational burden, since it converges to the true posterior when number of particles Np → ∞. In order to solve these computational problems a highly parallelized C++ library, called Parallel Bayesian Toolbox (PBT), for implementing Bayes filters (BFs) was developed and released as open-source software, for the first time [1]. It features a high level language interface for numerical calculations and very efficient usage of available central processing units (CPUs) and graphics processing units (GPUs). This significantly increases the computational throughput without the need of special hardware such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs).
    Computing, Management and Telecommunications (ComManTel), 2013 International Conference on; 01/2013