Parallel merged multiplier-accumulator coprocessor optimized for digital filters

Nanoelectronics Center of Excellence, University of Tehran, Tehran, Iran
Computers & Electrical Engineering (Impact Factor: 0.82). 09/2010; 36(5):864-873. DOI: 10.1016/j.compeleceng.2008.04.005
Source: DBLP

ABSTRACT In an attempt to improve the speed of VLSI signal processing systems, a new architecture for a high-speed multiply–accumulate (MAC) unit optimized for digital filters is proposed. This unit is designed as a coprocessor for the LEON2 RISC processor [LEON2 Processor; 2005 [Online]. ]. In this work, four parallel MAC units with two dual-port coefficient register-files, a three-port general register-file and a control unit are included in the coprocessing block. With the existence of four parallel units, several SIMD format instructions have been added to LEON2 instruction set. Each MAC unit has two 16-bit inputs, 32-bit output register and a programmable round-saturate block. The MAC unit uses a new architecture which embeds the accumulate module within the partial products summation tree of the multiplier with minimum overhead. A central control unit controls inputs of the four MACs and loading of the output registers. Our experimental results demonstrate a high performance in implementation of digital filters at elevated speeds of up to 33 millions of input samples per second in a 0.18μm technology.

10 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a dependence graph (DG) to visualize and describe a merged multiply-accumulate (MAC) hardware that is based on the modified Booth algorithm (MBA). The carry-save technique is used in the Booth encoder, the Booth multiplier, and the accumulator sections to ensure the fastest possible implementation. The DG applies to any MAC data word size and allows designing multiplier structures that are regular and have minimal delay, sign-bit extensions, and datapath width. Using the DG, a fast pipelined implementation is proposed, in which an accurate delay model for deep submicron CMOS technology is used. The delay model describes multi-level gate delays, taking into account input ramp and output loading. Based on the delay model, the proposed pipelined parallel MAC design is three times faster than other parallel MAC schemes that are based on the MBA. The speedup resulted from merging the accumulate and the multiply operations and the wide use of carry-save techniques
    IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing 10/2000; 47(9-47):902 - 908. DOI:10.1109/82.868458
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, ¿sec, and quotients in 3 ¿sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.
    IEEE Transactions on Electronic Computers 03/1964; EC-13(1-EC-13):14 - 17. DOI:10.1109/PGEC.1964.263830
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An MIMD multiprocessor digital signal-processing (DSP) chip containing four 64-b processing elements (PE's) interconnected by a 128-b pipelined split transaction bus (STBus) is presented. Each PE contains a 32-b RISC core with DSP enhancements and a 64-b single-instruction, multiple-data vector coprocessor with four 16-b MAC/s and a vector reduction unit. PEs are connected to the STBus through reconfigurable dual-ported snooping L1 cache memories that support shared memory multiprocessing using a modified-MESI data coherency protocol. High-bandwidth data transfers between system memory and on-chip caches are managed in a pipelined memory controller that supports multiple outstanding transactions. An embedded RTOS dynamically schedules multiple tasks onto the PEs. Process synchronization is achieved using cached semaphores. The 200-mm<sup>2</sup>, 0.25-μm CMOS chip operates at 100 MHz and dissipates 4 W from a 3.3-V supply
    IEEE Journal of Solid-State Circuits 04/2000; 35(3-35):412 - 424. DOI:10.1109/4.826824 · 3.01 Impact Factor
Show more