## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

Often, when performing fixed-point multiplication, it is sufficient to return a faithfully rounded result, i.e., the machine representable number either immediately above or below the arbitrary precision result, if the latter is not exactly representable. Compared to correctly rounded multipliers, i.e., those returning the nearest machine representable number, faithfully rounded multipliers use considerably less silicon area, typically by implementing a truncation scheme within the partial product array. A number of such heuristically inspired schemes exist in the literature, however their use in industrial practice is hampered by the absence of verification, and exhaustive simulation is typically infeasible, e.g., a 32 bit multiplier requires ${bf 2}^{bf {64}}$ simulations. We present three truncated multiplier schemes which subsume the majority of existing schemes and derive both closed form necessary and sufficient conditions for faithful rounding. For two of the schemes we provide closed form expressions for the bit vectors giving rise to the worst-case error and the probability of encountering these inputs during Monte-Carlo simulation. From these expressions, we show how HDL code can be created that performs correct-by-construction faithfully rounded multiplication. We also present a method for truncating an arbitrary array while maintaining faithful rounding, creating two novel truncated multiplier schemes in the process.

To read the full-text of this research,

you can request a copy directly from the authors.

... Hence, in order to reduce the complexity of a multiplier, either the width m or the height n must be diminished. Lowering m leads to truncated multipliers [17]- [19], which is not the purpose of this work. Adopting the other approach, the height of the PPM is usually reduced by applying a Booth recoding [14], [15], [20] in radix R = 2 r , r > 0, which maintains the accuracy of the multiplier. ...

While partial carry-save adders are easily designed by splitting them into several fragments working in parallel, the design of partial carry-save multipliers is more challenging. Prior approaches have proposed several solutions based on the radix-4 Booth recoding. This technique makes it possible to diminish the height of a multiplier by half, this being the most widespread option when designing multipliers, as only easy multiples are required. Larger radices provide further reductions at the expense of the appearance of hard multiples. Such is the case of radix-8 Booth multipliers, whose critical path is located at the generation of the 3X multiple. In order to mitigate this delay, in our prior works, we proposed to first decouple the 3X computation and introduce it in the dataflow graph, leveraging the available slack. Considering this, we then present a partial carry-save radix-8 Booth multiplier that receives three inputs in this format, namely, the multiplicand, the multiplier, and the 3X multiple. Moreover, the rest of the datapath is adapted to work in partial carry-save. In comparison with conventional radix-4 and radix-8 Booth-based datapaths, the proposal is able to diminish the execution time and energy consumption while benefits from the area reduction provided by the selection of radix 8. Furthermore, it outperforms prior state-of-the-art partial carry-save multipliers based on radix 4.

... Although this approach improves the accuracy of the final product considerably, the estimating circuit increases the complexity of the multiplier. Other similar estimating multipliers with variable error compensation can be found in [11][12][13][14][15][16][17][18][19][20][21][22][23][24] . In addition, nonparallel truncated multipliers such as hybrid and pipelined are presented in [25][26][27] . ...

Estimating arithmetic deals with trading accuracy for speed, silicon area, and/or power consumption. Truncated parallel multipliers, which reduce power and area approximately by half, are very important units in estimating arithmetic. An n-bit unsigned truncated sequential multiplier with new approaches that compensate for the truncation error is proposed in this article. These compensating approaches improve the result accuracy using the ( )-th or n-th columns of the partial product matrix dynamically. By introducing a small circuit into the original sequential multiplier, these approaches compensate for the error resulting from removing the carry bits of the least significant parts of the partial product matrix. The maximum relative error of the new truncated multiplier is approximately 2.03%, thus it is only slightly different from that of the precise counterpart in terms of accuracy. A timing evaluation is conducted for the critical path of the proposed multiplier, applying a pre-layout logical synthesis. The evaluation reveals that depending on the operands length, this proposed multiplier is approximately 2.5% to 26.6% faster than the precise multiplier.

... The first works in circuit approximation were 1 BACS benchmark set: https://github.com/scale-lab/BACS the result of manual design, that is, approximate adders [65], [70], multipliers [10], [15], [22], [40], or dividers [16], [42] were created to design single inexact implementations of arithmetic units. Other works [20], [21] present algorithms that allow to automatically explore the energy-quality tradeoff but again limit the analysis to adders and multipliers only. ...

Approximate computing is an emerging paradigm that, by relaxing the requirement for full accuracy, offers benefits in terms of design area and power consumption. This paradigm is particularly attractive in applications where the underlying computation has inherent resilience to small errors. Such applications are abundant in many domains, including machine learning, computer vision, and signal processing. In circuit design, a major challenge is the capability to synthesize the approximate circuits automatically without manually relying on the expertise of designers. In this work, we review methods devised to synthesize approximate circuits, given their exact functionality and an approximability threshold. We summarize strategies for evaluating the error that circuit simplification can induce on the output, which guides synthesis techniques in choosing the circuit transformations that lead to the largest benefit for a given amount of induced error. We then review circuit simplification methods that operate at the gate or Boolean level, including those that leverage classical Boolean synthesis techniques to realize the approximations. We also summarize strategies that take high-level descriptions, such as C or behavioral Verilog, and synthesize approximate circuits from these descriptions.

... Flexible error based multiplier and multiply-add design have been developed for area reduced error-resilient applications [10]. Self-tuned error compensation is an efficient method to reduce the truncation error [11][12][13]. The area reduction multiplication using an approximation for image processing application is developed in [14,15]. ...

Compensating the error using additional circuitry is mandatory in a low-error fixed-width multiplier. Instead of compensating the error, reconfiguring n-bit fixed-width multiplier to n/2-bit error-free full-width multiplier using decomposed multiplication is proposed in this paper. The decomposed block multiplication using an area-efficient New Bit Pair Recoding (NBPR) algorithm in fixed-width mode shows a relatively lesser truncation error than existing truncated multipliers. Reconfigurable 16x16 NBPR multiplier in three different modes (8x8, 16x8,16x16) with a fixed 16-bit product is verified on the TSMC 65nm CMOS standard cell library. The experimental results show that the NBPR multiplier consumes a lesser area than standard Booth multipliers. Evaluating the proposed multiplier in imaging shows improved PSNR with minimal error compared to other fixed-width multipliers

... Here, it is expected that the generic multiplier can be faster for faster timing constraints at the cost of additional resources. While other methods exist to create efficient multipliers for 8bit multipliers [42] we compare against Xilinx IP cores which are already heavily optimized. ...

Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.

... Hence, in order to reduce the complexity of a multiplier either the width m or the height n must be diminished. Narrowing m leads to truncated multipliers [10]- [12], which is not the purpose of this work. On the other hand, the height of the PPM is usually reduced by applying a Booth recoding [1], [8], [9] in radix R = 2 β , β > 0, which maintains the accuracy of the multiplier. ...

Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.

An energy-efficient fast array multiplier is proposed and designed. The multiplier operates in a left-to-right mode enabling a full overlap between reduction of partial products in carry-save form and the final addition producing the product. The design is based on the left-to-right carry-free (LRCF) multiplier. It differs from the LRCF multiplier in a much smaller on-the-fly conversion circuit of $O(n)$ size and the use of radix-4 full adders in the conversion. The new converter produces the most-significant half of the product during the reduction process. It eliminates the most-significant part of the final adder. The least-significant half of the product is obtained with a carry-ripple adder during the reduction. Thus conversion of the carry-save form of accumulated partial products to the conventional product does not add any delay to the total time of the multiplier. Several right-to-left, left-to-right multipliers and tree multipliers are designed for 16, 24, 32, and 56 bits, and radices 2 and 4, synthesized in 90 nm technology and compared, demonstrating the advantages and disadvantages of the proposed design with respect to area, delay, power, and energy. We considered both truncated and full-precision multipliers. The proposed multiplier has lower delay, area, power, and energy than other considered types of array multipliers. Its advantages grow with the increase in precision. As expected, it is slower than a tree multiplier but it has smaller area, power, and energy.

Piecewise polynomial interpolation is a well-established technique for hardware function evaluation. The paper describes a novel technique to minimize polynomial coefficients wordlength with the aim of obtaining either exact or faithful rounding at a reduced hardware cost. The standard approaches employed in literature subdivide the design of piecewise-polynomial interpolators into three steps (coefficients calculation, coefficients quantization and arithmetic hardware optimization) and estimate conservatively the overall approximation error as the sum of the error components arising in each step. The proposed technique, using Integer Linear Programming (ILP), optimizes the polynomial coefficients taking into account all error components simultaneously. This gives two advantages. Firstly, we can obtain exactly rounded approximations; secondly, for faithfully rounded interpolators, we avoid any overdesign due to pessimistic assumptions on error components, optimizing in this way the resulting hardware. The proposed ILP based algorithm requires an acceptable CPU time (from few seconds to tens of minutes) and is suited for approximations up to, maximum, 24 input bits. The results compare favorably with previously published data. We present synthesis results in 28 nm and 90 nm CMOS technologies, to further assess the effectiveness of the proposed approach.

As a crucial part of the Digital Signal Processor, a multiplier plays a major role in performing
calculations within the processor. An important part of the electronic circuits in digital electronics is a binary multiplier which is capable of multiplying two binary digits. Most of the techniques developed for the purpose of multiplication involve calculation of the partial
products separately and finally summing them to produce the multiplication output. This is the reason why a number of methods and technologies have been designed to make this multiplication process easier, minimizing the delay and error present. A comparative note on analysing CPL, Gate-Diffusion Input (GDI) and improved Shannon Adder technique is also made to determine the technique that uses minimum power when used as a full adder and GDI is found to be the optimum technique. This method uses CPL and Gate-Diffusion Input technique to implement an array multiplier and a modified Baugh-Wooley multiplier. However, a new Technique of improved Shannon Adder led to the implementation of the multipliers, resulting in better performance parameters. A comparative analysis on the power consumed and propagation delay is made with the help of Tanner EDA tool and it is found that Improved Shannon Adder implemented in modified Baugh- Wooley multiplier performs better than its peers.

In advancement technology and applications like processing sound signals, processing image signals, software define radio signals and so on. In processing of digital systems in all type of devices may be wide range of contraptions and clamors. The result of design the all the athematic devices and performing the addition and multiplication operations need larger space in very large scale implementation. Here, the latest proposed methodology progressed in the design of Finite impulse response filter by redesign the multipliers and adders to introduce new unsigned and signed truncated multiplier and new SCG-HSCG adders. Also different III-Vth semiconductor materials can be used to get high speed filtering operation. In this way to decreases the partials part reduction of design truncated multiplier have more number of adders. The new design handles to replace the SCG-HSCG Adders. This design decreases the logic gates in operations of athematic, additions and multiplications. It also gives the more efficient in the sign and unsigned designs of different digital signal processing applications. This proposed new finite impulse response filter has designed with help of VHDL and Xilinx FPGA-S6LX9. The efficiency of proposed filter with truncated multiplier and SCG-HSCG adder, is compare with the existing all CSLA – Carry Select adder in terms of area, delay, and power. The novel FIR filter design reduces power distribution to the environment while increasing hardware device performance. Furthermore, these kinds of designs take up less space, which means they use less energy and last longer.

Though there are various types of multipliers presently, although fractional multipliers effectively planned through making a few sections working in equal, the structure of halfway convey spare multipliers is additionally testing. Earlier methodologies proposed a few arrangements using booth recoding with radix-4. The present system forms it conceivable to decrease the stature considerably, which is the utmost broad choice while planning, just simple products recommended. Bigger values of radix in addition decreases to the detriment in presence of products. So as to relieve the deferral, from the former studies, initially recommended to detach calculation utilizing the accessible leeway. Normally in binary multiplication and operation is used for partial products of multiplier and multiplicand. In case of Booth Multiplier the multiples are encoded. For n bit by n-bit multiplication, n/2 partial products are obtained for radix 4 and n/3 for radix-8. Thinking about this, encoding method is implemented so that partial products are reduced. In the proposed methodology we will use modified carry skip adder for high speed operation.

Fault tolerant techniques can extend the power savings achievable by dynamic voltage scaling by trading accuracy and/or timing performance against power. Such energy improvements have a strong dependency on the delay distribution of the circuit and the statistical characteristics of the input signal. Independently, programmable truncated multipliers also achieve power benefits at the expense of degradation of the output signal-to-noise ratio. In this brief, a combination of programmable truncated multiplication is used within a fault tolerant digital signal processing (DSP) structure in which the supply voltage is reduced beyond the critical timing level. Timing modulation properties of truncated multiplication are analyzed and demonstrated to improve the performance of fault tolerant designs, reducing error correction burdens, and extending the system operating voltage range. Combining both power strategies results in lower energy consumption levels, which improve the energy savings beyond that expected when applying a combination of both techniques with the original DSP.

This paper presents an error compensation method for truncated multiplication. From two n-bit operands, the operator produces an n-bit product with small error compared to the 2n-bit exact product. The method is based on a logical computation followed by a simplification process. The filtering parameter used in the simplification process helps to control the trade-off between hardware cost and accuracy. The proposed truncated multiplication scheme has been synthesized on an FPGA platform. It gives a better accuracy over area ratio than previous well-known schemes such as the constant correcting and variable correcting truncation schemes (CCT and VCT)

An area-efficient parallel sign-magnitude multiplier that receives
two N-bit numbers and produces an N-bit product, referred to as a
truncated multiplier, is described. The quantization of the product to N
bits is achieved by omitting about half the adder cells needed to add
the partial products but in order to keep the quantization error to a
minimum, probabilistic biases are obtained and are then fed to the
inputs of the retained adder cells. The truncated multiplier requires
approximately 50% of the area of a standard parallel multiplier. The
paper then shows that this design strategy can also be applied for the
design of two's-complement multipliers. The paper concludes with the
application of the truncated multiplier for the implementation of a
digital filter and it is shown that the signal-to-noise ratio of the
digital filter using a truncated multiplier is better than that using a
standard multiplier

Truncated-matrix multipliers and squarers offer significant reductions in area, power, and delay, at the expense of increased computational error. These trade-offs make them an attractive choice for many signal processing systems. However, extensive bit-accurate simulation is often necessary to explore the design space effectively and chose the best parameters when using them in systems. This paper presents an algorithm for fast, bit-accurate simulation of truncated-matrix multipliers and squarers in software. The algorithm is applicable to most correction methods published to date, is simple to implement, and it facilitates research into system-level use of truncated-matrix units.

In the design of digital signal processing systems, where single-precision results are required, the power dissipation and area of parallel multipliers can be significantly reduced by truncating the less significant columns and compensating to produce an approximate rounded product. This paper provides a new method for truncated multiplication, which yields less errors than the previous methods with only slightly more complexity by a specialized counter. An error pattern is exhaustively analyzed with all possible inputs to evaluate the errors by the proposed correction method. Error and hardware comparisons of the previous methods and the proposed correction method are presented. The proposed method is applied to both unsigned and two's complement multipliers.

In this paper a statistical error compensation (SEC) method for fixed-width Booth multipliers is proposed. According to the statistical simulation for the truncation part, the adaptive compensated biases based on the truncated factors for different bit-width compensated circuit are made up. For the 8×8 fixed-width Booth multiplier as an example, the proposed method achieves higher accuracy comparison with previous works under the same area cost. Furthermore, the proposed SEC Booth multiplier is implemented in two-dimensional (2-D) discrete cosine transform (DCT). Compared to traditional Booth multiplier's applications, the proposed 2-D DCT core can reduce 22% area cost with almost 2 dB peak signal-to-noise ratio (PSNR) penalty. Therefore, the proposed multiplier has a low hardware cost achieving high accuracy designs.

A truncated multiplier is a multiplier with two n bit operands that produces a n bit result. Truncated multipliers discard some of the partial products of a complete multiplier to trade off accuracy with hardware cost. Compared with a conventional multiplier, a truncated multiplier introduces an error on the output whose magnitude depends on the input bits. The maximum value of the error is hardly computable, since it isn't possible to test every possible input and nonexhaustive simulations are very unlikely to provide the actual maximum absolute error value. It is therefore extremely useful to develop methods that provide the maximum error for a truncated multiplier. This paper presents a closed form analytical calculation, for every bit width, of the maximum error for a previously proposed family of truncated multipliers. The considered family of truncated multipliers is particularly important since it is proved to be the design that gives the lowest mean square error for a given number of discarder partial products. With the contribution of this paper, the considered family of truncated multipliers is the only architecture that can be designed, for every bit width, using an analytical approach that allows the a priori knowledge of the maximum error.

This paper focuses on fixed-width multipliers with linear compensation function by investigating in detail the effect of coefficients quantization. New fixed-width multiplier topologies, with different accuracy versus hardware complexity trade-off, are obtained by varying the quantization scheme. Two topologies are in particular selected as the most effective ones. The first one is based on a uniform coefficient quantization, while the second topology uses a nonuniform quantization scheme. The novel fixed-width multiplier topologies exhibit better accuracy with respect to previous solutions, close to the theoretical lower bound.

Many multimedia and DSP applications require fixed-width multipliers, in which input data and output results have the same bit width. In this paper we investigate fixed-width multipliers where one of the input operand is a constant, encoded using canonic signed digit (CSD) representation. This is a very important case in many practical applications such as the calculation of Fast Fourier Transform. In the paper we derive in closed form the expression of the compensation function giving the minimum mean square error for CSD fixed-width multiplier. On the basis of this analytical result, we propose a hardware efficient implementation of the multiplier. Fixed width CSD multipliers implemented with the approach presented in this paper are accurate and can be implemented by using a simple partial-product reduction tree followed by a fast adder, without requiring additional look-up tables. The proposed approach is general and is well suited for implementation in circuit synthesizers. Implementation results in 90 nm technology are presented, to demonstrate the effectiveness of the proposed technique.

Truncated multipliers compute the n most-significant bits of the n × n bits product. This paper focuses on variable-correction truncated multipliers, where some partial-products are discarded, to reduce complexity, and a suitable compensation function is added to partly compensate the introduced error. The optimal compensation function, that minimizes the mean square error, is obtained in this paper in closed-form for the first time. A sub optimal compensation function, best suited for hardware implementation, is introduced. Efficient multipliers implementation based on sub-optimal function is discussed. Proposed truncated multipliers are extensively compared with previously proposed circuits. Experimental results, for a 0.18 μm technology, are also presented.

This paper presents a technique for designing linear and quadratic interpolators for function approximation using truncated multipliers and squarers. Initial coefficient values are found using a Chebyshev series approximation, and then adjusted through exhaustive simulation to minimize the maximum absolute error of the interpolator output. This technique is suitable for any function and any precision up to 24-bits (IEEE single precision). Designs for linear and quadratic interpolators that implement the reciprocal function, f(x) = 1/x, are presented and analyzed as an example. We show that a 24-bit truncated reciprocal quadratic interpolator with a design specification of ±1 ulp error requires 24.1% fewer partial products to implement than a comparable standard interpolator with the same error specification.

This paper presents a method for compensating the truncation error of fixed-width booth multipliers which keep the input and the output the same bit-width. The truncated part that produces the carry-out bits is replaced with a carry-estimation equation. In order to reduce the truncation error, different input-width multipliers will have different carry-estimation equations. Simulation results show that our self-compensation method can lead to 85 % reduction of truncation errors while compared with direct-truncated multipliers, as well as 40% reduction in area of a multiplier while compared with traditional booth multipliers. In contrast with the 128-point fast Fourier transform (FFT) using traditional booth multipliers, our approach has 10% area reduction but only 1 dB SQNR loss

A truncated binary squarer is a squarer with a n bit input that produces a n bit output. The proposed design minimizes the mean square error of the squarer and results in a very simple and fast circuital implementation. The squarer, compared against state of the art circuits, provides a reduction of the mean square error ranging from 20% to 5%. At the same time, the proposed squarer is able to reduce the power dissipation, reduce the silicon area occupation, and increase the maximum working frequency. Implementations results are provided for a 0.18μm technology.

An implementation of a radix-4 approximate squaring circuit is described employing a new operand dual recoding technique. Approximate squaring circuits have numerous applications including use in computer graphics, digital radio modules, implementation of division and function approximation in ALU circuits. The theory of operation of the circuit is described including radix-4 operand dual recoding. Our recoding yields non negative partial squares and other features which simplify the design of the approximate squaring circuit. Results of the implementation in terms of delay, power, and area in both 130nm and 90nm technologies are presented and analyzed. The results show the circuit is power, area and performance efficient, yielding reduction factors by three or more when compared to a truncated multiplication approach using state-of-the-art logic synthesis tools. The radix-4 squaring circuit is also shown to be more efficient than a radix-2 state-of-the-art binary squaring circuit.

Truncated multiplication provides an efficient method for reducing the power dissipation and area of rounded parallel multipliers in digital signal processing systems. With this technique, the products of parallel multipliers are rounded to a shorter word size and the least-significant columns of the multiplication matrix are not used. This technique provides significant savings in terms of power dissipation for unsigned multiplication. Although previous implementations involved unsigned and signed array and tree multipliers, this technique can be equally applied to multiplication using Booth-encoding. This paper presents the design and implementation of parallel and truncated multipliers that use Booth-encoding and compressors for signed multiplication. Initial estimates indicate that truncated parallel multipliers dissipate less power than standard parallel multipliers for operand sizes of 16 bits.

A faithfully rounded truncated multiplier design is presented where the maximum absolute error is guaranteed to be no more than 1 unit of least position. The proposed method jointly considers the deletion, reduction, truncation, and round- ing of partial product bits in order to minimize the number of full adders and half adders during tree reduction. Experimen- tal results demonstrate the efficiency of the proposed faithfully truncated multiplier with area saving rates of more than 30%. In addition, the truncated multiplier design also has smaller delay due to the smaller bit width in the final carry-propagate adder. Index Terms—Computer arithmetic, faithful rounding, fixed-width multiplier, tree reduction, truncated multiplier.

In this paper, a single compensation formula of adaptive conditional-probability estimator (ACPE) applied to fixed-width Booth multiplier is proposed. Based on the conditional-probability theory, the ACPE can be easily applied to large length Booth multipliers (such as 32-bit or larger) for achieving a higher accuracy performance. To consider the trade-off between accuracy and area cost, the ACPE provides varying column information $w$ to adjust the accuracy with respect to system requirements. The 16-bit ACPE Booth multiplier with $w=3$ reduces 28.9% silicon area with only 0.39 dB signal-to-noise ratio (SNR) loss when compared with post-truncated (P-T) Booth multiplier. Furthermore, the ACPE Booth multipliers are applied to two-dimensional (2-D) discrete cosine transform (DCT) to evaluate the system performance. Implemented in a TSMC 0.18 $\mu \hbox{m}$ CMOS process, the DCT core with ACPE $(w=3)$ can save 14.3% area cost with only 0.48 dB peak-signal-to-noise-ratio (PSNR) penalty compared to P-T method.

In this brief, a probabilistic estimation bias (PEB) circuit for a fixed-width two's-complement Booth multiplier is proposed. The proposed PEB circuit is derived from theoretical computation, instead of exhaustive simulations and heuristic com- pensation strategies that tend to introduce curve-fitting errors and exponential-grown simulation time. Consequently, the proposed PEB circuit provides a smaller area and a lower truncation error compared with existing works. Implemented in an 8 × 82 -D discrete cosine transform (DCT) core, the DCT core using the proposed PEB Booth multiplier improves the peak signal-to-noise ratio by 17 dB with only a 2% area penalty compared with the direct-truncated method.

The maximum error has serious effect on the performance of fixed-width multipliers that receive W-bit inputs and produce W-bit products. In this paper, we analyze the error bound of fixed-width modified Booth multiplier. Then, we present a method that can be used to reduce the maximum error. By simulations, it is shown that the performance of the proposed fixed-width multiplier is pretty close to that of the multiplier with rounding scheme.

The paper presents a design method for a fixed-width squarer that receives a W-bit input and produces a W-bit squared product. To compensate efficiently for the quantization error, Booth encoder outputs (not multiplier coefficients) are used for the generation of error compensation bias. The truncated bits are divided into two groups depending upon their effects on the quantization error. Then, different error compensation methods are applied to each group. By simulation, it is shown that the performance of the proposed method is pretty close to that of the rounding operation and much better than that of the truncation operation.

Truncated multiplication can be used to significantly reduce the power dissipation for applications that do not require correctly-rounded results. This paper presents an efficient method for truncated multiplication called hybrid-correction truncation that utilizes the advantages of two previous methods to obtain lower average and maximum absolute error. Comparisons are presented contrasting power, area, and delay for all three methods compared to standard parallel multipliers. Estimates indicate that hybrid truncated multipliers dissipate slightly less power and consume slightly less area than previous methods for truncated multiplication. In addition, utilization of the hybrid truncation method can provide a method for altering the implementation within certain limits to meet a given precision.

This paper presents a design method for fixed-width squarer that receives a W-bit input and produces a W-bit squared product. To efficiently compensate for the quantization error, Booth encoder outputs (not multiplier coefficients) are used for the generation of error compensation bias. The truncated bits are divided into two groups depending upon their effects on quantization error. Then, different error compensation methods are applied to each group. By simulation, it is shown that the performance of the proposed method is pretty close to that of the rounding operation and much better than that of the truncation operation.

About half the hardware for floating point multipliers is needed only to guarantee correctly rounded results. For multimedia, graphics, and DSP systems, a significant reduction in area, delay, and power can be achieved by producing results that are not correctly rounded. This paper presents an efficient method for designing variable-correction truncated floating point multipliers that produce results with a maximum error of less than one unit in the last place. With this method, several of the less significant columns of the significand multiplier are eliminated and the rounding logic for floating point multiplication is simplified.

The variable correction truncated multiplier is introduced. This is a method for minimizing the error of a truncated multiplier. The error is reduced by using information from the partial product bits of the column adjacent to the truncated LSB. This results in a complexity savings while introducing minimum distortion to the result.

Multiplication is frequently required in digital signal
processing. Parallel multipliers provide a high-speed method for
multiplication, but require large area for VLSI implementations. In most
signal processing applications, a rounded product is desired to avoid
growth in word-size. Thus an important design goal is to reduce the area
requirements of rounded output multipliers. The authors present a
technique for parallel multiplication which computes the product of two
numbers by summing only the most significant columns of the
multiplication matrix, along with a correction constant. A method for
selecting the value of the correction constant which minimizes the
average and mean square error is introduced. Equations are given for
estimating the average, mean square, and maximum error of the rounded
product. With this technique, the hardware requirements of the
multiplier can be reduced by 25 to 35%, while limiting the maximum error
of the rounded product to less than one unit in the last place

In this paper, a new error-compensation network for fixed-width multipliers is proposed. The error-compensation block is composed of two summation trees which are optimally chosen in order to minimize either the mean-square error or the maximum absolute error. The new technique substantially improves error performances with respect to previously proposed approaches. Simulation results show that new fixed-width multipliers exhibit significant improvements both in propagation delay and in power dissipation with respect to previous solutions.

In this paper, we propose a low-error fixed-width redundant multiplier design. The design is based on the statistical analysis of the error compensation value of the truncated partial products in binary signed-digit representation with modified Booth encoding. The overall truncation error is significantly reduced compared with other previous approaches. Furthermore, the derived relationship between the compensation value and the truncated digits is so simple that the area cost of the corresponding compensation circuit is almost negligible. The fixed-width multiplier design is also applied to the discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) computation in JPEG image compression.

This paper presents an error compensation method for fixed-width canonic signed digit (CSD) multipliers that receive a W-bit input and produce a W-bit product. To efficiently compensate for the quantization error, the truncated bits are divided into two groups (major group and minor group) depending upon their effects on the quantization error. The desired error compensation bias is first expressed in terms of the truncated bits in the major group. Then the effects of the other truncated bits in the minor group are taken care of by a probabilistic estimation. Also, an efficient sign extension reduction method applied to the fixed-width CSD multipliers is proposed. By simulations, it is shown that 25% reduction in the truncation error and 13% hardware complexity can be achieved by the proposed error compensation and sign extension reduction methods, respectively.

About half the hardware for floating point multipliers is needed only to guarantee correctly rounded results. For multimedia, graphics, and DSP systems, a significant reduction in area, delay, and power can be achieved by producing results that are not correctly rounded. This paper presents an efficient method for designing variable-correction truncated floating point multipliers that produce results with a maximum error of less than one unit in the last place. With this method, several of the less significant columns of the significand multiplier and the rounding logic for floating point multiplication are eliminated. Technical areas: (13) DSP hardware, software, and coreware; (14) ASIC and FPGA algorithm/processor design. POC: Michael Schulte, 19 Memorial Dr. West, EECS Dept., Lehigh University, Bethlehem, PA 18015. Email: mschulte@eecs.lehigh.edu, Phone: (610) 758-5036, FAX: (610) 758-6279. Extended Abstract Most modern processors perform floating point operations accord...

Coding Guidelines for Datapath Synthesis.

- R Zimmermann

R. Zimmermann, " Coding guidelines for datapath synthesis, " https://www.synopsys.com/dw/doc.php/wp/coding guidelines.pdf, July 2005.