Guiming Wu

State Key Laboratory of Mathematical Engineering and Advanced Computing, Wu-ch’i, Anhui Sheng, China

Are you Guiming Wu?

Claim your profile

Publications (16)4.38 Total impact

  • Song Guo · Yong Dou · Yuanwu Lei · Guiming Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a high performance sparse matrix-vector multiplication (SpMV) accelerator on the field-programming gate array (FPGA). By exploiting a hardware-friendly storage scheme, named as Variable-Bit-Width Coordinate Block Quasi Compressed Sparse Row, the redundant computation and memory accesses can be reduced greatly through the nested block compression and variable-bit-width column-index encoding schemes. Based on the proposed compression scheme, a deeply-pipelined SpMV accelerator is implemented on a Xilinx Virtex XC7VX485T FPGA platform, which can handle sparse matrices with arbitrary size and sparsity pattern. Experimental results show that the proposed design can gain higher performance for most of the tested matrices and improve the utilization of the memory bandwidth up to 13x, compared with the previous works on the Convey platforms (HC-1 and HC-2ex) and Nvidia Tesla S1070 GPU platform.
    No preview · Article · May 2015 · IEICE Electronics Express
  • Xinkai Yan · Guiming Wu · Dong Wu · Fang Zheng · Xianghui Xie
    [Show abstract] [Hide abstract]
    ABSTRACT: Modular multiplication is one of the most important operations in the public key cryptographic algorithms. In order to design a high-performance modular multiplier, we present a novel hybrid Montgomery modular multiplier over GF(p) on FPGAs, which employs Karatsuba and Knuth multiplication algorithms in different levels to implement large integer multiplication. A 9-stage pipeline full-word multiplier is proposed for the 256-bit multiplication with 4-level recursion. The performance of our modular multiplier is improved through optimizing the pipeline and reducing carry-chain latency of the modular adder. On average, our modular multiplier can perform one 256-bit modular multiplication in 3 cycles. We can integrate 13 modular multipliers on a Xilinx Virtex-6 V6VSX475T FPGA. The experimental results show that the throughput of 856.9 million modular multiplications per second can be achieved and the hybrid Montgomery modular multiplier has an outstanding performance in the situations which need continuous multiplications.
    No preview · Conference Paper · Dec 2013
  • Guiming Wu · Xianghui Xie · Yong Dou · Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: The conjugate gradient (CG) solver is an important algorithm for solving the symmetric positive define systems. However, existing CG architectures on field-programmable gate arrays (FPGAs) either need aggressive zero padding or can only be applied for small matrices and particular matrix sparsity patterns. This brief proposes a high-performance architecture for the CG solver on FPGAs, which can handle sparse linear systems with arbitrary size and sparsity pattern. Furthermore, it does not need aggressive zero padding. Our CG architecture mainly consists of a high-throughput sparse matrix-vector multiplication design including a multi-output adder tree, a reduction circuit, and a sum sequencer. Our experimental results demonstrate that our CG architecture can achieve speedup of 4.62X-9.24X on a Virtex5-330 FPGA, relative to a software implementation.
    No preview · Article · Nov 2013 · Circuits and Systems II: Express Briefs, IEEE Transactions on
  • [Show abstract] [Hide abstract]
    ABSTRACT: LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.
    No preview · Article · Apr 2012 · IEEE Transactions on Computers
  • Guiming Wu · Xianghui Xie · Yong Dou · Junqing Sun · Dong Wu · Yuan Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Sparse LU decomposition is the core computation in the direct method that solves sparse systems of linear equations. Only little work has been conducted on parallelizing it on FPGAs. In this paper, we study parallelization strategies for sparse LU decomposition on FPGAs. We first analyze how to parallelize the right-looking algorithm and find that this algorithm is not suitable for FPGAs. Then the left-looking algorithm is analyzed and considered as better candidate than the right-looking version. Our design derived from the left-looking algorithm is based on a simple yet efficient parallel computational model for FPGAs. Our design mainly consists of multiple parallel processing elements (PEs). A total of 14 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330. Unlike related work, where their designs are applied to sparse matrices from particular application domains, our hardware design can be applied to any symmetric positive definite or diagonally dominant matrices.
    No preview · Conference Paper · Jan 2012
  • Guiming Wu · Yong Dou · Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an automatic synthesis framework to map loop nests to processor arrays with local memories on FPGAs. An affine transformation approach is firstly proposed to address space-time mapping problem. Then a data-driven architecture model is introduced to enable automatic generation of processor arrays by extracting this data-driven architecture model from transformed loop nests. Some techniques including memory allocation, communication generation and control generation are presented. Synthesizable RTL codes can be easily generated from the architecture model built by these techniques. A preliminary synthesis tool is implemented based on PLUTO, an automatic polyhedral source-to-source transformation and parallelization framework.
    No preview · Conference Paper · Jan 2011
  • Yong Dou · Yuanwu Lei · Guiming Wu · Song Guo · Jie Zhou · Li Shen
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation. This paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition. The experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X--56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.
    No preview · Conference Paper · Jan 2010
  • Guiming Wu · Yong Dou · Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a high performance and memory efficient hardware implementation of matrix multiplication for dense matrices of any size on the FPGA devices. By applying a series of transformations and optimizations on the original serial algorithm, we can obtain an I/O and memory optimized block algorithm for matrix multiplication on FPGAs. A linear array of processing elements (PEs) is proposed to implement this block algorithm. We show significant reduction in hardware resources consuming compared to the related work while increasing clock frequency. Moreover, the memory requirement can be reduced to O(S) from O(S2), where S is the block size. Therefore, more PEs can be integrated into the same FPGA devices.
    No preview · Conference Paper · Jan 2010
  • Guiming Wu · Yong Dou · Gregory D. Peterson
    [Show abstract] [Hide abstract]
    ABSTRACT: To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, the original algorithm needs to be blocked. In this paper, we propose a block LU decomposition algorithm for FPGAs, which is applicable for matrices of arbitrary size. We introduce a high performance hardware design, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. A total of 36 PEs can be integrated into a Xilinx Virtex-5 xc5vlx330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz, which outperforms previous work.
    No preview · Conference Paper · Jan 2010
  • Yong Dou · Guiming Wu · Jinhui Xu · Xingming Zhou
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconfigurable computing tries to achieve the balance between high efficiency of custom computing and flexibility of general-purpose computing. This paper presents the implementation techniques in LEAP, a coarse-grained reconfigurable array, and proposes a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle and implementation techniques to support decoupling synchronization between the token generator and the collector. This paper also introduces the techniques of exploiting both data dependences of intra- and inter-iteration, with the help of two instructions for special data reuses in the loop-carried dependences. The experimental results show that the number of memory accesses reaches on average 3% of an RISC processor simulator with no memory optimization. In a practical image matching application, LEAP architecture achieves about 34 times of speedup in execution cycles, compared with general-purpose processors.
    No preview · Article · Apr 2009 · Science in China Series F Information Sciences
  • Guiming Wu · Miao Wang · Yong Dou · Fei Xia
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents our experience with exploiting fine-grained pipeline parallelism for wavefront computations on a multicore platform. Wavefront computations have been widely applied in many application areas such as scientific computing algorithms and dynamic programming algorithms. To exploit fine-grained parallelism on multicore platforms, the programmers must consider the problems of synchronization, scheduling strategies and data locality. This paper shows the impact of fine-grained synchronization methods, scheduling strategies and data tile sizes on performance. We propose a low cost, lock-free, and lightweight synchronization method that can fully exploit pipeline parallelism. Our evaluation shows that RNAfold, an application for RNA secondary structures prediction, can achieve the best speedup of 3.88 on four cores under our framework.
    No preview · Conference Paper · Jan 2009
  • Guiming Wu · Yong Dou · Yuanwu Lei · Jie Zhou · Miao Wang · Jingfei Jiang
    [Show abstract] [Hide abstract]
    ABSTRACT: Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benchmarks such as the LINPACK benchmark. We propose and implement an FPGA-based hardware design of the LINPACK benchmark, the key step of which is LU decomposition with pivoting. We introduce a fine-grained pipelined LU decomposition algorithm that enables optimum performance by exploiting fine-grained pipeline parallelism. A scalable linear array of processing elements (PEs), which is the core component of our hardware design, is proposed to implement this algorithm. To the best of our knowledge, this is the first reported FPGA-based pipelined implementation of LU decomposition with pivoting. A total of 19 PEs can be integrated into an Altera Stratix II EP2S130F1020C5 on our self-designed development board. Experimental results show that the speedup up to 6.14 can be achieved relative to a Pentium 4 processor for the LINPACK benchmark.
    No preview · Conference Paper · Jan 2009
  • Guiming Wu · Jinhui Xu · Yong Dou · Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Loop tiling is an effective loop transformation technique that tiles the iteration space of loop nests to improve the data locality. The appropriate data layout and transfer strategies are also important to assist loop tiling. This paper describes an approach to enhance data reuse and reduce off-chip memory access after loop tiling. Data tiles due to loop tiling may have overlapped elements, which will lead to more larger data transfer cost. This also provides us with the challenge to exploit data reuse between data tiles. Using our approach we are able to reduce these unnecessary data transfers and improve the performance compared to traditional pure loop tiling.
    No preview · Conference Paper · Sep 2008
  • Miao Wang · Guiming Wu · Zhiying Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Application Specific Instruction Processors (or, ASIPs) have the potential to meet the high-performance demands of multimedia applications, such as image processing, audio and video encoding, speech processing, and digital signal processing. To achieve lower cost and efficient energy for high performance embedded systems built by ASIPs, subword parallelism optimization will become an important alternative to accelerate multimedia applications. But one major problem is how to exploit subword parallelism for ASIPs with limited resources. This paper shows that loop transformations such as loop unrolling, variable expansion, etc., can be utilized to create opportunities for subword parallelism, and presents a novel approach to recognize and extract subword parallelism based on Cost Subgragh (or, CSG). This approach is evaluated on Transport Triggered Architecture (TTA), a customizable processor architecture that is particularly suitable for tailoring the hardware resources according to the requirements of the application. In our experiment, 63.58% of loops and 85.64% of instructions in these loops can exploit subword parallelism. The results indicate that significant available subword parallelism would be attained using our method.
    No preview · Conference Paper · Aug 2007
  • Yong Dou · Jinhui Xu · Guiming Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: Two critical points lie in applying loop pipelining in coarse-grained reconfigurable arrays. One is the technique of dynamically loop scheduling to achieve high pipeline throughput. The other is memory optimizing techniques to eliminate redundant memory accesses or to overlap memory accessing with computing. In this paper, we present the implementation techniques in LEAP, a coarse-grained reconfigurable array. We propose a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle. We present the implementation techniques to support the decoupling between the token generator and the collector. We introduce implementation techniques in exploiting both data dependences of intra-iteration and inter-iteration. We design two instructions for special data reuses in the case of loop-carried dependences. The experimental results show the reduction in memory accesses reaches 72.8 times comparing with the approaches with no memory optimization in a RISC processor simulator.
    No preview · Conference Paper · Jan 2007
  • Jinhui Xu · Guiming Wu · Yong Dou · Yazhuo Dong
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces LEAP(Loop Engine on Array Processor), a novel coarse-grained reconfigurable architecture which accelerates applications through Loop Self-Pipelining (LSP) technique. The LSP can provide effective execution mode for application pipelining. By mapping and distributing the expression statements of high level programming languages onto processing elements array, the LEAP can step the loop iteration automatically. The LEAP architecture has no centralized control, no centralized multi-port registers and no centralized data memory. The LEAP has the ability to exploit loop-level, instruction-level, and task-level parallelism, and it is suitable choice for stream-based application domains, such as multimedia, DSP and graphics application.
    No preview · Conference Paper · Jan 2006

Publication Stats

50 Citations
4.38 Total Impact Points

Institutions

  • 2012-2015
    • State Key Laboratory of Mathematical Engineering and Advanced Computing
      Wu-ch’i, Anhui Sheng, China
  • 2006-2012
    • National University of Defense Technology
      • National Key Laboratory of Parallel and Distributed Processing
      Ch’ang-sha-shih, Hunan, China