Guiming Wu

National University of Defense Technology, Ch’ang-sha-shih, Hunan, China

Are you Guiming Wu?

Claim your profile

Publications (14)3.36 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: The conjugate gradient (CG) solver is an important algorithm for solving the symmetric positive define systems. However, existing CG architectures on field-programmable gate arrays (FPGAs) either need aggressive zero padding or can only be applied for small matrices and particular matrix sparsity patterns. This brief proposes a high-performance architecture for the CG solver on FPGAs, which can handle sparse linear systems with arbitrary size and sparsity pattern. Furthermore, it does not need aggressive zero padding. Our CG architecture mainly consists of a high-throughput sparse matrix-vector multiplication design including a multi-output adder tree, a reduction circuit, and a sum sequencer. Our experimental results demonstrate that our CG architecture can achieve speedup of 4.62X-9.24X on a Virtex5-330 FPGA, relative to a software implementation.
    Circuits and Systems II: Express Briefs, IEEE Transactions on 01/2013; 60(11):791-795. · 1.33 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU decomposition algorithm on FPGAs applicable to arbitrary matrix size is proposed. Our algorithm applies a series of transformations, including loop blocking and space-time mapping, onto sequential nonblocking LU decomposition. We also introduce a high performance and memory efficient hardware architecture, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. Our design can achieve optimum performance under various hardware resource constraints. Furthermore, our algorithm and design can be easily extended to the multi-FPGA platform by using a block-cyclic data distribution and inter-FPGA communication scheme. A total of 36 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz for a matrix size of 16,384, which outperforms several general-purpose processors. For a Xilinx Virtex-6 XC6VLX760, a newer FPGA, we predict that a total of 180 PEs can be integrated, reaching 70.66 GFLOPS at 200 MHz. Compared to the previous work, our design can integrate twice the number of PEs into the same FPGA and has significantly higher performance.
    IEEE Transactions on Computers 04/2012; · 1.38 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sparse LU decomposition is the core computation in the direct method that solves sparse systems of linear equations. Only little work has been conducted on parallelizing it on FPGAs. In this paper, we study parallelization strategies for sparse LU decomposition on FPGAs. We first analyze how to parallelize the right-looking algorithm and find that this algorithm is not suitable for FPGAs. Then the left-looking algorithm is analyzed and considered as better candidate than the right-looking version. Our design derived from the left-looking algorithm is based on a simple yet efficient parallel computational model for FPGAs. Our design mainly consists of multiple parallel processing elements (PEs). A total of 14 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330. Unlike related work, where their designs are applied to sparse matrices from particular application domains, our hardware design can be applied to any symmetric positive definite or diagonally dominant matrices.
    Field-Programmable Technology (FPT), 2012 International Conference on; 01/2012
  • Guiming Wu, Yong Dou, Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present an automatic synthesis framework to map loop nests to processor arrays with local memories on FPGAs. An affine transformation approach is firstly proposed to address space-time mapping problem. Then a data-driven architecture model is introduced to enable automatic generation of processor arrays by extracting this data-driven architecture model from transformed loop nests. Some techniques including memory allocation, communication generation and control generation are presented. Synthesizable RTL codes can be easily generated from the architecture model built by these techniques. A preliminary synthesis tool is implemented based on PLUTO, an automatic polyhedral source-to-source transformation and parallelization framework.
    Field-Programmable Technology (FPT), 2010 International Conference on; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, the original algorithm needs to be blocked. In this paper, we propose a block LU decomposition algorithm for FPGAs, which is applicable for matrices of arbitrary size. We introduce a high performance hardware design, which mainly consists of a linear array of processing elements (PEs), to implement our block LU decomposition algorithm. A total of 36 PEs can be integrated into a Xilinx Virtex-5 xc5vlx330 FPGA on our self-designed PCI-Express card, reaching a sustained performance of 8.50 GFLOPS at 133 MHz, which outperforms previous work.
    18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2010, Charlotte, North Carolina, USA, 2-4 May 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation. This paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition. The experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X--56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.
    Proceedings of the 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2-4, 2010; 01/2010
  • Guiming Wu, Yong Dou, Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a high performance and memory efficient hardware implementation of matrix multiplication for dense matrices of any size on the FPGA devices. By applying a series of transformations and optimizations on the original serial algorithm, we can obtain an I/O and memory optimized block algorithm for matrix multiplication on FPGAs. A linear array of processing elements (PEs) is proposed to implement this block algorithm. We show significant reduction in hardware resources consuming compared to the related work while increasing clock frequency. Moreover, the memory requirement can be reduced to O(S) from O(S2), where S is the block size. Therefore, more PEs can be integrated into the same FPGA devices.
    Proceedings of the International Conference on Field-Programmable Technology, FPT 2010, 8-10 December 2010, Tsinghua University, Beijing, China; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Reconfigurable computing tries to achieve the balance between high efficiency of custom computing and flexibility of general-purpose computing. This paper presents the implementation techniques in LEAP, a coarse-grained reconfigurable array, and proposes a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle and implementation techniques to support decoupling synchronization between the token generator and the collector. This paper also introduces the techniques of exploiting both data dependences of intra- and inter-iteration, with the help of two instructions for special data reuses in the loop-carried dependences. The experimental results show that the number of memory accesses reaches on average 3% of an RISC processor simulator with no memory optimization. In a practical image matching application, LEAP architecture achieves about 34 times of speedup in execution cycles, compared with general-purpose processors.
    Science in China Series F Information Sciences 01/2009; 52:575-587. · 0.66 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benchmarks such as the LINPACK benchmark. We propose and implement an FPGA-based hardware design of the LINPACK benchmark, the key step of which is LU decomposition with pivoting. We introduce a fine-grained pipelined LU decomposition algorithm that enables optimum performance by exploiting fine-grained pipeline parallelism. A scalable linear array of processing elements (PEs), which is the core component of our hardware design, is proposed to implement this algorithm. To the best of our knowledge, this is the first reported FPGA-based pipelined implementation of LU decomposition with pivoting. A total of 19 PEs can be integrated into an Altera Stratix II EP2S130F1020C5 on our self-designed development board. Experimental results show that the speedup up to 6.14 can be achieved relative to a Pentium 4 processor for the LINPACK benchmark.
    FCCM 2009, 17th IEEE Symposium on Field Programmable Custom Computing Machines, Napa, California, USA, 5-7 April 2009, Proceedings; 01/2009
  • Guiming Wu, Miao Wang, Yong Dou, Fei Xia
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents our experience with exploiting fine-grained pipeline parallelism for wavefront computations on a multicore platform. Wavefront computations have been widely applied in many application areas such as scientific computing algorithms and dynamic programming algorithms. To exploit fine-grained parallelism on multicore platforms, the programmers must consider the problems of synchronization, scheduling strategies and data locality. This paper shows the impact of fine-grained synchronization methods, scheduling strategies and data tile sizes on performance. We propose a low cost, lock-free, and lightweight synchronization method that can fully exploit pipeline parallelism. Our evaluation shows that RNAfold, an application for RNA secondary structures prediction, can achieve the best speedup of 3.88 on four cores under our framework.
    ICPPW 2009, International Conference on Parallel Processing Workshops, Vienna, Austria, 22-25 September 2009; 01/2009
  • Guiming Wu, Jinhui Xu, Yong Dou, Miao Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Loop tiling is an effective loop transformation technique that tiles the iteration space of loop nests to improve the data locality. The appropriate data layout and transfer strategies are also important to assist loop tiling. This paper describes an approach to enhance data reuse and reduce off-chip memory access after loop tiling. Data tiles due to loop tiling may have overlapped elements, which will lead to more larger data transfer cost. This also provides us with the challenge to exploit data reuse between data tiles. Using our approach we are able to reduce these unnecessary data transfers and improve the performance compared to traditional pure loop tiling.
    Computer Systems Architecture Conference, 2008. ACSAC 2008. 13th Asia-Pacific; 09/2008
  • Miao Wang, Guiming Wu, Zhiying Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Application Specific Instruction Processors (or, ASIPs) have the potential to meet the high-performance demands of multimedia applications, such as image processing, audio and video encoding, speech processing, and digital signal processing. To achieve lower cost and efficient energy for high performance embedded systems built by ASIPs, subword parallelism optimization will become an important alternative to accelerate multimedia applications. But one major problem is how to exploit subword parallelism for ASIPs with limited resources. This paper shows that loop transformations such as loop unrolling, variable expansion, etc., can be utilized to create opportunities for subword parallelism, and presents a novel approach to recognize and extract subword parallelism based on Cost Subgragh (or, CSG). This approach is evaluated on Transport Triggered Architecture (TTA), a customizable processor architecture that is particularly suitable for tailoring the hardware resources according to the requirements of the application. In our experiment, 63.58% of loops and 85.64% of instructions in these loops can exploit subword parallelism. The results indicate that significant available subword parallelism would be attained using our method.
    Parallel and Distributed Processing and Applications, 5th International Symposium, ISPA 2007, Niagara Falls, Canada, August 29-31, 2007, Proceedings; 01/2007
  • Yong Dou, Jinhui Xu, Guiming Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: Two critical points lie in applying loop pipelining in coarse-grained reconfigurable arrays. One is the technique of dynamically loop scheduling to achieve high pipeline throughput. The other is memory optimizing techniques to eliminate redundant memory accesses or to overlap memory accessing with computing. In this paper, we present the implementation techniques in LEAP, a coarse-grained reconfigurable array. We propose a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle. We present the implementation techniques to support the decoupling between the token generator and the collector. We introduce implementation techniques in exploiting both data dependences of intra-iteration and inter-iteration. We design two instructions for special data reuses in the case of loop-carried dependences. The experimental results show the reduction in memory accesses reaches 72.8 times comparing with the approaches with no memory optimization in a RISC processor simulator.
    Reconfigurable Computing: Architectures, Tools and Applications, Third International Workshop, ARC 2007, Mangaratiba, Brazil, March 27-29, 2007.; 01/2007
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces LEAP(Loop Engine on Array Processor), a novel coarse-grained reconfigurable architecture which accelerates applications through Loop Self-Pipelining (LSP) technique. The LSP can provide effective execution mode for application pipelining. By mapping and distributing the expression statements of high level programming languages onto processing elements array, the LEAP can step the loop iteration automatically. The LEAP architecture has no centralized control, no centralized multi-port registers and no centralized data memory. The LEAP has the ability to exploit loop-level, instruction-level, and task-level parallelism, and it is suitable choice for stream-based application domains, such as multimedia, DSP and graphics application.
    Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, ACSAC 2006, Shanghai, China, September 6-8, 2006, Proceedings; 01/2006