Parallel merged multiplier-accumulator coprocessor optimized for digital filters.

Nanoelectronics Center of Excellence, University of Tehran, Tehran, Iran
Computers & Electrical Engineering (Impact Factor: 0.99). 01/2010; 36:864-873. DOI: 10.1016/j.compeleceng.2008.04.005
Source: DBLP

ABSTRACT In an attempt to improve the speed of VLSI signal processing systems, a new architecture for a high-speed multiply–accumulate (MAC) unit optimized for digital filters is proposed. This unit is designed as a coprocessor for the LEON2 RISC processor [LEON2 Processor; 2005 [Online]. ]. In this work, four parallel MAC units with two dual-port coefficient register-files, a three-port general register-file and a control unit are included in the coprocessing block. With the existence of four parallel units, several SIMD format instructions have been added to LEON2 instruction set. Each MAC unit has two 16-bit inputs, 32-bit output register and a programmable round-saturate block. The MAC unit uses a new architecture which embeds the accumulate module within the partial products summation tree of the multiplier with minimum overhead. A central control unit controls inputs of the four MACs and loading of the output registers. Our experimental results demonstrate a high performance in implementation of digital filters at elevated speeds of up to 33 millions of input samples per second in a 0.18μm technology.

  • [Show abstract] [Hide abstract]
    ABSTRACT: A programmable digital signal processor (DSP) for real-time image processing is presented that combines the concepts of single-instruction multiple-data (SIMD) and very long instruction word with a high utilization of parallel resources on instruction and data level. The SIMD approach has been extended with autonomous instruction selection capabilities (ASIMD), which offers to control four parallel datapaths with low area overhead. The memory concept is adapted to image-processing requirements and follows two basic rules: shared data have to be accessed regularly in the shape of a matrix and are stored in the matrix memory. As soon as data are accessed irregularly, they are stored in the private cache memories. The matrix memory allows parallel, conflict-free access from all datapaths in a single clock cycle. The DSP achieves 1.3-GOPS performance at 66 MHz. A first prototype in 0.5-μm CMOS technology has been fabricated
    IEEE Journal of Solid-State Circuits 08/2000; · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This work presents 64-bit fixed-point vector multiply-accumulator (MAC) architecture capable of supporting multiple precisions. The vector MAC can perform one 64/spl times/64, two 32/spl times/32, four 16/spl times/16, or eight 8/spl times/8 bit signed/unsigned multiply using essentially the same hardware as a scalar 64-bit MAC and with only a small increase in delay. The scalar MAC architecture is "vectorized" by inserting mode-dependent multiplexing into the partial product generation and by inserting mode-dependent kills in the carry chain of the reduction tree and the final carry-propagate adder. This is an example of "shared segmentation" in which the existing scalar structure is segmented and then shared between vector modes. The vector MAC is area efficient and can be fully pipelined, which makes it suitable for high-performance processors and, possibly, dynamically reconfigurable processors. The "shared segmentation" method is compared to an alternative method, referred to as the "shared subtree" method, by implementing vector MAC designs using two different technologies and three different vector widths.
    IEEE Transactions on Computers 04/2005; 54(3):284- 293. · 1.38 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a dependence graph (DG) to visualize and describe a merged multiply-accumulate (MAC) hardware that is based on the modified Booth algorithm (MBA). The carry-save technique is used in the Booth encoder, the Booth multiplier, and the accumulator sections to ensure the fastest possible implementation. The DG applies to any MAC data word size and allows designing multiplier structures that are regular and have minimal delay, sign-bit extensions, and datapath width. Using the DG, a fast pipelined implementation is proposed, in which an accurate delay model for deep submicron CMOS technology is used. The delay model describes multi-level gate delays, taking into account input ramp and output loading. Based on the delay model, the proposed pipelined parallel MAC design is three times faster than other parallel MAC schemes that are based on the MBA. The speedup resulted from merging the accumulate and the multiply operations and the wide use of carry-save techniques
    IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing 10/2000;