Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes
ABSTRACT For numerical computations requiring a relatively high ratio of data access to operation, the scalability of memory bandwidth is key to performance improvement. In this paper, we propose a scalable FPGA-array to achieve custom computing machines for high-performance and power-efficient scientific simulations based on difference schemes. With the FPGA-array, we construct a systolic computational-memory array (SCMA) by homogeneously partitioning the SCMA among multiple tightly-coupled FPGAs. A large SCMA implemented using a lot of FPGAs achieves high-performance computation with scalable memory-bandwidth and scalable arithmetic-performance according to the array size. For feasibility demonstration and quantitative evaluation, we design and implement the SCMA of 192 processing elements over two ALTERA StratixII FPGAs. The implemented SCMA running at 106 MHz achieves the sustained performances of 32.8 to 36.5 GFlops in single precision for three benchmark computations while the peak performance is 40.7 GFlops. In comparison with a 3.4GHz Pentium4 processor, the SCMAs consume 70% to 87% power and require only 3% to 7% energy consumption for the same computations. Based on the requirement model for inter-FPGA bandwidth, we illustrate that SCMAs are completely scalable for the currently available high-end to low-end FPGAs, while the SCMA implemented with the two FPGAs demonstrates the doubled performance of that by the single-FPGA SCMA.