Improving FPGA Performance for Carry-Save Arithmetic

Processor Archit. Lab., Ecole Polytech. Fed. de Lausanne, Lausanne, Switzerland
IEEE Transactions on Very Large Scale Integration (VLSI) Systems (Impact Factor: 1.14). 05/2010; DOI: 10.1109/TVLSI.2009.2014380
Source: IEEE Xplore

ABSTRACT The selective use of carry-save arithmetic, where appropriate, can accelerate a variety of arithmetic-dominated circuits. Carry-save arithmetic occurs naturally in a variety of DSP applications, and further opportunities to exploit it can be exposed through systematic data flow transformations that can be applied by a hardware compiler. Field-programmable gate arrays (FPGAs), however, are not particularly well suited to carry-save arithmetic. To address this concern, we introduce the ??field programmable counter array?? (FPCA), an accelerator for carry-save arithmetic intended for integration into an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply accumulation, the FPCA can accelerate more general carry-save operations, such as multi-input addition (e.g., add k > 2 integers) and multipliers that have been fused with other adders. Our experiments show that the FPCA accelerates a wider variety of applications than DSP blocks and improves performance, area utilization, and energy consumption compared with soft FPGA logic.


Available from: Paolo Ienne, Apr 19, 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Addition is the key arithmetic operation in most digital circuits and processors. Therefore, their performance and other parameters, such as area and power consumption, are highly dependent on the adders' features. In this paper, we present multispeculation as a way of increasing adders' performance with a low area penalty. In our proposed design, dividing an adder into several fragments and predicting the carry-in of each fragment enables computing every addition in two very short cycles at the most, with 99% or higher probability. Furthermore, based on multispeculation principles, we propose a new strategy for implementing addition chains and hiding most of the penalty cycles due to mispredictions, while keeping at the same time the resource sharing capabilities that are sought in high-level synthesis. Our results show that it is possible to build linear and logarithmic adders more than 4.7× and 1.7× faster than the nonspeculative case, respectively. Moreover, this is achieved with a low area penalty (38% for linear adders) or even an area reduction (-8% for logarithmic adders). Finally, applying multispeculation principles to signal processing benchmarks that use addition chains will result in 25% execution time reduction, with an additional 3% decrease in datapath area with respect to implementations with logarithmic fast adders.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 12/2012; 31(12):1817-1830. DOI:10.1109/TCAD.2012.2208966 · 1.20 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although redundant addition is widely used to design parallel multioperand adders for ASIC implementations, the use of redundant adders on Field Programmable Gate Arrays (FPGAs) has generally been avoided. The main reasons are the efficient implementation of carry propagate adders (CPAs) on these devices (due to their specialized carry-chain resources) as well as the area overhead of the redundant adders when they are implemented on FPGAs. This paper presents different approaches to the efficient implementation of generic carry-save compressor trees on FPGAs. They present a fast critical path, independent of bit width, with practically no area overhead compared to CPA trees. Along with the classic carry-save compressor tree, we present a novel linear array structure, which efficiently uses the fast carry-chain resources. This approach is defined in a parameterizable HDL code based on CPAs, which makes it compatible with any FPGA family or vendor. A detailed study is provided for a wide range of bit widths and large number of operands. Compared to binary and ternary CPA trees, speedups of up to 2.29 and 2.14 are achieved for 16-bit width and up to 3.81 and 3.11 for 64-bit width.
    IEEE Transactions on Computers 10/2013; 62(10):2013-2025. DOI:10.1109/TC.2012.168 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe an energy-efficient Vedic multiplier structure using Energy Efficient Adiabatic Logic (EEAL). The power consumption of the proposed multiplier is significantly low because the energy transferred to the load capacitance is mostly recovered. The proposed 8×8 CMOS and adiabatic multiplier structure have been designed in a TSMC 0.18 μm CMOS process technology and verified by Cadence Design Suite. Both simulation and measurement results verify the functionality of such logic, making it suitable for implementing energy-aware and performance-efficient very-large scale integration (VLSI) circuitry.
    Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on; 01/2013