A Fully Pipelined Multiplierless Architecture for 2D Convolution with Quadrant Symmetric Kernels.
ABSTRACT Design of a fully pipelined multiplierless digital architecture for computing 2D convolution utilizing the quadrant symmetry of the kernels is proposed in this paper. Pixels in the four quadrants of the kernel region with respect to an image pixel are considered simultaneously for computing the partial results of the convolution sum. The new architecture performs computations in log-domain by utilizing low complexity log2 and inverse-log2 approximation modules. An effective data handling strategy is developed in conjunction with the logarithmic modules to eliminate the necessity of the multipliers in the architecture. The proposed architecture is capable of performing convolution operations for 181.3 1024 times 1024 frames or 190.1 million outputs per second with 22 times 22 kernels in a Xilinx's Virtex XC2V2000-4ff896 FPGA at maximum clock frequency of 190.1 MHz. The throughput of the new design is 3.17 times higher when compared with that of the previous implementations (Zhang et al., 2005). Evaluation in Xilinx's core generator showed that the proposed design results in 60% reduction in hardware resource when compared to the design using pipelined multipliers
- SourceAvailable from: P.K. Meher[show abstract] [hide abstract]
ABSTRACT: Distributed arithmetic (DA)-based computation is popular for its potential for efficient memory-based implementation of finite impulse response (FIR) filter where the filter outputs are computed as inner-product of input-sample vectors and filter-coefficient vector. In this paper, however, we show that the look-up-table (LUT)-multiplier-based approach, where the memory elements store all the possible values of products of the filter coefficients could be an area-efficient alternative to DA-based design of FIR filter with the same throughput of implementation. By operand and inner-product decompositions, respectively, we have designed the conventional LUT-multiplier-based and DA-based structures for FIR filter of equivalent throughput, where the LUT-multiplier-based design involves nearly the same memory and the same number of adders, and less number of input register at the cost of slightly higher adder-widths than the other. Moreover, we present two new approaches to LUT-based multiplication, which could be used to reduce the memory size to half of the conventional LUT-based multiplication. Besides, we present a modified transposed form FIR filter, where a single segmented memory-core with only one pair of decoders are used to minimize the combinational area. The proposed LUT-based FIR filter is found to involve nearly half the memory-space and $(1/N)$ times the complexity of decoders and input-registers, at the cost of marginal increase in the width of the adders, and additional $sim(4Ntimes W)$ AND-OR-INVERT gates and $sim(2Ntimes W)$ NOR gates. We have synthesized the DA-based design and LUT-multiplier based design of 16-tap FIR filters by Synopsys Design Compiler using TSMC 90 nm library, and find that the proposed LUT-multiplier-based design involves ne- - arly 15% less area than the DA-based design for the same throughput and lower latency of implementation.Circuits and Systems I: Regular Papers, IEEE Transactions on 04/2010; · 2.24 Impact Factor