Conference Paper

FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations.

DOI: 10.1007/978-3-642-19475-7_33 Conference: Reconfigurable Computing: Architectures, Tools and Applications - 7th International Symposium, ARC 2011, Belfast, UK, March 23-25, 2011. Proceedings
Source: DBLP

ABSTRACT Sparse matrix factorization is a critical step for the circuit simulation problem, since it is time consuming and computed
repeatedly in the flow of circuit simulation. To accelerate the factorization of sparse matrices, a parallel CPU+FPGA based
architecture is proposed in this paper. While the pre-processing of the matrix is implemented on CPU, the parallelism of numeric
factorization is explored by processing several columns of the sparse matrix simultaneously on a set of processing elements
(PE) in FPGA. To cater for the requirements of circuit simulation, we also modified the Gilbert/Peierls (G/P) algorithm and
considered the scalability of our architecture. Experimental results on circuit matrices from the University of Florida Sparse
Matrix Collection show that our architecture achieves speedup of 0.5x-5.36x compared with the CPU KLU results.

2 Followers
 · 
162 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sparse LU decomposition is the core computation in the direct method that solves sparse systems of linear equations. Only little work has been conducted on parallelizing it on FPGAs. In this paper, we study parallelization strategies for sparse LU decomposition on FPGAs. We first analyze how to parallelize the right-looking algorithm and find that this algorithm is not suitable for FPGAs. Then the left-looking algorithm is analyzed and considered as better candidate than the right-looking version. Our design derived from the left-looking algorithm is based on a simple yet efficient parallel computational model for FPGAs. Our design mainly consists of multiple parallel processing elements (PEs). A total of 14 PEs can be integrated into a Xilinx Virtex-5 XC5VLX330. Unlike related work, where their designs are applied to sparse matrices from particular application domains, our hardware design can be applied to any symmetric positive definite or diagonally dominant matrices.
    Field-Programmable Technology (FPT), 2012 International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Considering the increasing complexity of integrated circuit (IC) designs at Nano-Tera scale, multi-core CPUs and many-core GPUs have provided ideal hardware platforms for emerging parallel algorithm developments in electronic design automation (EDA). However, it has become extremely challenging to leverage parallel hardware platforms at extreme scale beyond 22nm and 60GHz where the EDA algorithms, such as circuit simulation, show strong data dependencies. This paper presents data dependency elimination in circuit simulation algorithms such as parasitic extraction, transient simulation and periodic-steady-state (PSS) simulation, which paves the way towards unleashing the underlying power of parallel hardware platforms.
    IEEE Design and Test of Computers 02/2013; 30(1):26-35. DOI:10.1109/MDT.2012.2226201 · 1.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The sparse matrix solver has become a bottleneck in simulation program with integrated circuit emphasis (SPICE)-like circuit simulators. It is difficult to parallelize the solver because of the high data dependency during the numeric LU factorization and the irregular structure of circuit matrices. This paper proposes an adaptive sparse matrix solver called NICSLU, which uses a multithreaded parallel LU factorization algorithm on shared-memory computers with multicore/multisocket central processing units to accelerate circuit simulation. The solver can be used in all the SPICE-like circuit simulators. A simple method is proposed to predict whether a matrix is suitable for parallel factorization, such that each matrix can achieve optimal performance. The experimental results on 35 matrices reveal that NICSLU achieves speedups of 2.08× ~ 8.57×(on the geometric mean), compared with KLU, with 1-12 threads, for the matrices which are suitable for the parallel algorithm. NICSLU can be downloaded from http://nicslu.weebly.com.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 02/2013; 32(2):161-274. DOI:10.1109/TCAD.2012.2217964 · 1.20 Impact Factor