## About

64

Publications

6,465

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

964

Citations

Citations since 2016

Introduction

**Skills and Expertise**

Additional affiliations

March 1998 - present

## Publications

Publications (64)

ExaScale systems will be a key driver for simulations that are essential for advance of science and economic growth. We aim to present a new concept of microprocessor for floating-point computations useful for being a basic building block of ExaScale systems and beyond. The proposed microprocessor architecture has a frontend for programming interfa...

Hardware signatures based on Bloom filters are used to support and accelerate membership query in a set of items. They use
modest hardware at the cost of false positives, but never produce false negatives. Signatures were traditionally used in different
distributed and network applications, but in recent years their use has been extended to other f...

We present the algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speedup its computation: the redundant BCD excess-3 code (XS-3), and the overloaded BCD representation (ODDS). In addition, new techniques are developed to reduce significantly the latency and area of previous...

ExaScale systems will be a key driver for simulations that are essential for advance of science and economic growth. Current technology trends indicate that there might be a big energy wall by the end of the decade. Different reports call for strong changes at all levels for ExaScale computer systems. This academic position paper addresses this pro...

The four articles in this special section focus on the topic of computer arithmetic and its applications.

With the advent of chip multiprocessors, new techniques have been developed to make parallel programing easier and more reliable. New parallel programing paradigms and new methods of making the execution of programs more efficient and more reliable have been developed. Usually, these improvements require hardware support to avoid a system slowdown....

Two's complement multipliers are important for a wide range of applications. In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a fa...

In this work we propose a new decimal redundant CORDIC algorithm to manage transcendental functions, using floating-point representation. The algorithms determine the direction of the elementary rotation using sign estimations. Unlike binary redundant CORDIC, repetition of iterations are not required to ensure convergence since novel decimal codes...

We present a novel method for hardware design of combined binary/decimal multi-operand adders. More specifically, we apply this method to architectures based on binary CSA (carry-save adder) trees, which are of interest for VLSI implementation of high performance multipliers and other low latency arithmetic units. A remarkable feature of the propos...

The new generation of high-performance decimal floating-point units (DFUs) is demanding efficient implementations of parallel decimal multipliers. In this paper, we describe the architectures of two parallel decimal multipliers. The parallel generation of partial products is performed using signed-digit radix-10 or radix-5 recodings of the multipli...

In this paper we propose a simple Cache Filtering Mechanism (CFM-TM) for TM systems that are coupled from caches with the aim of reducing the useof the transactional memory baseline system. We propose to use CFM-TM with LogTM-SE as the baseline system , because it uses signatures for conflict detection (a resource that might be used for other purpo...

A recent work proposed to simplify fat-trees with adaptive routing by means of a load-balancing deterministic routing algorithm. The resultant network has performance figures comparable to the more complex adaptive routing fat-trees when packets need to be delivered in order. In a second work by the same authors published in IEEE CAL, they propose...

Adders are critical for microprocessor design. Current designs use variations of parallel prefix schemes. A method introduced by Ling [7] may improve this kind of adders. However, as recent research publications demonstrate, the use of the Ling scheme in prefix adders is not a mature and clear concept. In this work we show how to easily extend any...

The unfolded and pipelined CORDIC is a high-performance hardware element that produces a wide variety of one and two argument functions with high throughput. The reduction in delay, power, and area (cost) are of significant interest regarding this module due to its high demand for resources. The linear approximation to rotation has been proposed to...

In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2<sup>k</sup> division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save est...

This paper introduces two novel architectures for parallel decimal multipliers. Our multipliers are based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits. It significantly improves the area and latency of the partial product reduction tree with respect to previous proposals. We...

We present a high-radix Cordic rotation algorithm, which results in a reduction of the number of iterations. Carry-save representation is used and the selection function is performed by rounding, except for i=0 where a small table is necessary. The scale factor is not constant, but is efficiently computed in logarithmic form and compensated by a hi...

In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction...

The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this w...

The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) and vectoring (division) complicate implementation in a single unit. In this work, we improve the linear approximation scheme, leading to a unified implementation for rotatio...

Graphics processors require strong arithmetic support to perform computational kernels over data streams. Because of the current implementation using the basic arithmetic operations, the algorithms are given in algebraic terms. However, since the operations are really of a geometric nature, it seems to us that more flexibility in the implementation...

In this work, we present a reciprocal square root algorithm by digit recurrence and selection by a staircase function and the radix-4 implementation. As in similar algorithms for division and square root, the results are obtained correctly rounded in a straightforward manner (in contrast to existing methods to compute the reciprocal square root). A...

In this work we present an implementation of the exponential function in double precision, in a unit that supports IEEE floating-point arithmetic. As existing proposals, the implementation is based on the use of a floating-point multiplier and additional hardware. We decompose the computation into three subexponentials. The first and third subexpon...

Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to...

We present hardware primitives for 3D rotation and vector normalization for high-throughput 3D graphics and animation. The primitives are based on the 2D and 3D CORDIC algorithms, in contrast to more conventional mac-based engines. Also considered are conversions among rotation representations and rotation composition based on the same primitives.

In this work we present a reciprocal square-root algorithm by digit recurrence and selection by a staircase function, and the radix-4 implementation. As similar algorithms for division and square-root, the results are obtained correctly rounded in a straightforward manner (in contrast to existing methods to compute the reciprocal square-root). Alth...

We present a reciprocal square-root algorithm by digit recurrence and selection by a staircase function, and the radix-4 implementation. As similar algorithms for division and square-root, the results are obtained correctly rounded in a straightforward manner (in contrast to existing methods to compute the reciprocal square-root). Although apparent...

CORDIC algorithm has a lot of applications, e.g. digital signal processing, and its great latency is an important problem to overcome. A 32--bits precision implementation employing very--high radix of circular vectoring CORDIC in a word--serial architecture is presented. It is restricted to angle calculation, and has been implemented using a VHDL--...

A very--high radix algorithm and implementation for circular CORDIC in vectoring mode is presented. As for division, to simplify the selection function, the operands are pre--scaled. However, in the CORDIC algorithm the coordinate x varies during the execution so several scalings might be needed; we show that two scalings are sufficient. Moreover,...

In this work we present a Cordic rotator, using carry--save arithmetic, based on the prediction of all the coefficients into which the rotation angle is decomposed. The prediction algorithm is based on the use of radix--2 microrotations with multiple shifts in the first iterations and the use of a redundant radix--2 and radix--4 representation for...

In this paper we present a high-radix Cordic rotation algorithm, which results in a reduction of the number of iterations. Carry--save representation is used, leading to a fast iteration time. The selection function is performed by rounding, except for i = 0 where a small table is necessary. The scale factor is not constant, but is efficiently comp...

CORDIC--based algorithms to compute cos Gamma1 (t), sin Gamma1 (t) and p 1 Gamma t 2 are proposed. The implementation requires a standard CORDIC module plus a module to compute the direction of rotation, this being the same hardware required for the extended CORDIC vectoring, recently proposed by the authors. Although these functions can be obtaine...

In this work we present the VLSI implementation of an application specific processor that performs the angle calculation and rotation operation. This operation is important in matrix algebra and its hardware implementation is of interest for many real time applications. The computation of the angle and the rotation are performed by means of the red...

A very-high radix algorithm and implementation for circular CORDIC
is presented. We first present in depth the algorithm for the vectoring
mode in which the selection of the digits is performed by rounding of
the control variable. To assure convergence with this kind of selection,
the operands are prescaled. However, in the CORDIC algorithm, the
co...

. A very-high radix algorithm and implementation for CORDIC rotation in circular and hyperbolic coordinates is presented. The selection function consists of rounding the residual. It is shown that this assures convergence from the second iteration on. For the first iteration, the selection is done by table, using a lower radix than for the remainin...

CORDIC-based algorithms to compute cos
$\cos ^{ - 1} (t),\sin ^{ - 1} (t)$
and
$\sqrt {1 - t^2 }$
are proposed. The implementation requires a standard CORDIC module plus a module to compute the direction of rotation, this being the same hardware required for the extended CORDIC vectoring, recently proposed by the authors [T. Lang and E. Antelo...

Many applications require the evaluation of rotations at high speeds. However there is a trade--off between the chip area and the latency. In this paper we develop a digit on--line pipelined array architecture based on the radix-- 4 CORDIC algorithm in rotation mode. The radix--4 CORDIC algorithm halves the number of microrotations with respect the...

In this work we extend the radix--4 CORDIC algorithm to the vectoring mode (the radix-4 CORDIC algorithm was proposed recently by the authors for the rotation mode). The extension to the vectoring mode is not straightforward, since the digit selection function is more complex in the vectoring case than in the rotation case; as in the rotation mode,...

The computation of additional functions in the CORDIC module
increases its flexibility. We consider here the extension of the
vectoring mode (angle calculation) so that the vector is rotated until
one of the coordinates (for instance y) attains a target value t (in
contrast to the value 0, as in standard vectoring). The main problem in
the algorith...

This paper presents a new design for two operand normalization. The two operand normalization operation involves the normalization of at least one of two operands by left shifting both by the same amount. Our design performs the computation of the shift by making an OR of the bits of both operands in a tree network, encoding the position of the fir...

A very-high radix digit-recurrence algorithm for the operation
√(x/d) is developed, with residual scaling and digit selection by
rounding. This is an extension of the division and square-root
algorithms presented previously, and for which a combined unit was shown
to provide a fast execution of these operations. The architecture of a
combined unit...

Abstract—A very-high radix digit-recurrence algorithm for the operation $\sqrt {{x \mathord{\left/ {\vphantom {x d}} \right. \kern-\nulldelimiterspace} d}}$ is developed, with residual scaling and digit selection by rounding. This is an extension of the division and square-root algorithms presented previously, and for which a combined unit was show...

In this paper, we consider the errors appearing in angle
computations with the CORDIC algorithm (circular and hyperbolic
coordinate systems) using fixed-point arithmetic. We include errors
arising not only from the finite number of iterations and the finite
width of the data path, but also from the finite number of bits of the
input. We show that t...

Traditionally, CORDIC algorithms have employed radix-2 in the
first n/2 microrotations (n is the precision in bits) in order to
preserve a constant scale factor. The authors present a full radix-4
CORDIC algorithm in rotation mode and circular coordinates and its
corresponding selection function, and propose an efficient technique for
the compensat...

In this work we present a new CORDIC algorithm for the vectoring mode, based on the use of radix-4, preserving a complexity in the microrotations that is similar to that of the conventional radix-2 CORDIC. The use of this radix, together with the inclusion in the CORDIC algorithm of the zero skipping technique, reduces by more than half the number...

A very--high radix digit--recurrence algorithm for the operation p x=d is developed, with residual scaling and digit selection by rounding. This is an extension of the division and square--root algorithms presented previously, and for which a combined unit was shown to provide a fast execution of these operations. The architecture of a combined uni...

CORDIC-based algorithms to compute cos<sup>-1</sup>(t), sin<sup>-1 </sup>(t) and √(1-t<sup>2</sup>) are proposed. The implementation requires a standard CORDIC module plus a module to compute the direction of rotation, this being the same hardware required for the extended CORDIC vectoring, recently proposed by the authors. Although these functions...

The computation of additional functions in the CORDIC module increases its flexibility. We consider here the exten - sion of the vectoring mode (angle calculation) so that the vector is rotated until one of the coordinates (for instance ) attains a target value (in contrast to the value 0, as in standard vectoring). The main problem in the algorith...

We present a unified mixed radix CORDIC algorithm with carry-save
arithmetic with a constant scale factor. The pipelined architecture of
the processor is determined by a unique sequence of microrotations for
the two modes of operation (rotation and vectoring) in circular and
hyperbolic coordinates. The combination of radix-2 and radix-4
microrotati...

In this paper we present a new CORDIC algorithm for the vectoring mode, based on the use of radix-4 preserving a complexity in the microrotations that is similar to that of the conventional radix-2 CORDIC. The use of this radix, together with the inclusion in the CORDIC algorithm of the zero skipping technique, reduces by more than half the number...

The compensation of scale factor imposes significant computation overhead on the CORDIC algorithm. In this paper we will propose two algorithms and architectures in order to perform the compensation of the scale factor in parallel with the computation of the CORDIC iterations. This way it is not necessary to carry out the final multiplication or ad...

Many applications figure the evaluation of rotations at high speeds. However there is a trade-off between the chip area and the latency. In this paper we develop a digit on-line pipelined array architecture based on the radix-4 CORDIC algorithm in rotation mode. The radix-4 CORDIC algorithm halves the number of microrotations with respect the tradi...

We present a Cordic rotator, using carry-save arithmetic, based on
the prediction of all the coefficients into which the rotation angle is
decomposed. The prediction algorithm is based on the use of radix-2
microrotations with multiple shifts in the first iterations and the use
of a redundant radix-2 and radix-4 representation for the coefficients...

We present the design and implementation of the Sobel operator in
an application specific integrated circuit. Systolic processor arrays
were employed for an efficient exploitation of the advantages of VLSI
technology. The architecture obtained is highly regular and simple. The
performance of the architecture is improved by means of the use of carry...

In this work we develop a generalization of the CORDIC algorithm for any radix in three coordinate systems, linear, circular and hyperbolic. We carry out a comparative study between different radixes at the number of additions level, due to the fact that the complexity in additions determines the total hardware associated with the implementation of...

Floating-point implementations of the logarithm function require to compute a fixed-point approximation with high accuracy when the result is close to zero. Thus, iterative methods with linear convergence for the logarithm introduce a significant latency penalty when the input argument X ≈ 1. Some solutions use a second order polynomial approximati...