Conference Paper

A faster distributed arithmetic architecture for FPGAs

DOI: 10.1145/503048.503054 Conference: the 2002 ACM/SIGDA tenth international symposium
Source: DBLP


Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in FPGAs. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagation chains in the critical path that dictates the fastest time at which an entire design can run. In this paper, we describe a novel technique that can reduce or eliminate the carry-propagate chain from the critical path in LUT based DA architectures on FPGAs. In the proposed scheme, the individual bits of a word do not have to be processed as a unit. Instead, the current iteration can start as soon as the least significant bit (LSB) of the previous iteration is available, without waiting for the entire word from the previous iteration to be fully computed. This technique has great potential in speeding up DSP applications based on DA. Designs are described for serial and parallel DALUT and accumulator structures in which an n-bit carry chain, where n is the word length, is broken into smaller r-bit chains, 1*nnr n . A cost-performance analysis of the designs is presented. The analysis shows that the designs proposed in this paper have a lower cost-performance ratio (indicating better performance) than traditional DA designs. We also show that the 8-bit (r = 8) designs offer a good compromise between cost and performance. The implementation is on a Xilinx chip XC4028XL-3-BG256 using Xilinx Foundation tools v 3.1i. The results show that the proposed designs can achieve speedup by a factor of at least 1.5 over traditional DA designs in some cases.

Download full-text


Available from: Weijia Shang, May 19, 2015
  • Source
    • "Many other techniques have been proposed: Canonical Sign Digit (CSD) [15], the Dempster Method [16], Mirror Symmetric Filter Pairs [17], two-stage parallelism [18], and Redundant Binary Schemes [19] to name just a few. Methods specifically aimed at FPGA based FIR filter implementations include the fully pipelined and full-parallel transposed form [20], Add-and-Shift method with advanced calculation [21], and hardware efficient distributed arithmetic for higher orders [12] [13]. In [18], a new design technique based on a linear phase prototype filter that exploits coefficient symmetry was shown to offer better performance at a hardware cost similar to that of linear phase filters. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper efficient digital filter design techniques categorized as sigma-delta modulation based short word length (SWL) and multibit (or contemporary) techniques are reviewed in terms of hardware complexity, area, performance and power tradeoffs, synthesis issues, and algorithm versatility. More recent, general purpose DSP applications including classical LMS algorithms reported using sigma-delta modulation encoding are reviewed thoroughly. A small number of basic arithmetic circuits designed using sigma-delta modulation encoding and synthesized by using FPGAs are also described. Finally, recent FPGA based area-performance-power analysis of single-bit ternary FIR filtering is discussed and compared to its corresponding multi-bit system. This work shows that in most cases single-bit ternary FIR-like filters are able to outperform their equivalent multi-bit filters in terms of area, power, and performance.
    Full-text · Article · Nov 2012
  • Source
    • "In this section, we will have a close look on the arithmetic used for calculating the forward and the inverse DCT. For these transformations on 8x8 pixel blocks distributed arithmetic is used [15] [16] [17] [18]. This leads to a bit serial computation where only 16 word look-up tables (ROM) and accumulator but not multipliers need to be utilized. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a comparison between two methods, the modified Loeffler algorithm (11 MUL and 29 ADD) and Distributed Arithmetic, to implement the DCT/IDCT algorithm for MPEG or H.26x video compression using VHDL description language. The implementation has been achieved on Altera Stratix EP1S10 FPGA which provides a dedicated DSP blocks required for common signal processing functions. A new solution based on this DSP blocks used for to implement multipliers for the modified Loeffler algorithm in order to optimize speed and area.
    Full-text · Article · Jan 2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract In today’s proactive computing age, sensor networks monitor the environ- ment, collect data, and execute tasks that aect our lives. The main ingredient to this process is a tiny sensor node that demands a long operating lifetime. Because of the sluggish growth of battery energy density, several research groups have developed technologies to power these sensors with scavenged energy from vibration and light. This highly variable supply mandates precision-on-demand processing. Distributed Arithmetic (DA), a bit-serial algorithm for dot product computation, possesses this capability to trade output quality for power consumption as demonstrated with the use of full-custom circuits. This thesis evaluates the energy scalability of a DA-based low-pass lter on a modern eld programmable gate array (FPGA) and a standard-
    Preview · Article ·
Show more