## About

80

Publications

18,993

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

817

Citations

Citations since 2016

## Publications

Publications (80)

In this work, we present and evaluate a hardware architecture for the LOCO-ANS (Low Complexity Lossless Compression with Asymmetric Numeral Systems) lossless and near-lossless image compressor, which is based on JPEG-LS standard. The design is implemented in two FPGA generations, evaluating its performance for different codec configurations. The te...

Near-lossless compression is a generalization of lossless where the codec user is able to set the maximum absolute difference (the error tolerance) between the values of an original pixel and the decoded one. This enables higher compression ratios, while still allowing to control the bounds of the quantization errors in the space domain. This featu...

The realization that the network is becoming an important bottleneck in computing clusters and in the cloud has led in the past years to an increase scrutiny of how networking functionality is deployed. From TCP Offload Engines (TOEs) to Software Defined Networking (SDN), including Smart NICs and In-Network Data Processing, a wide range of approach...

This paper introduces the first flow exporter architecture based on a stand-alone FPGA, designed for the aggregation and subsequent exportation of TCP sessions records for 100 GbE links, assuring peak performance without packet sampling even at the maximum packet rate. Our research shows that FPGA fabric offers adequate flexibility and performance...

In this demo, we present an end-to-end video transmission system, using the low-end ZynqBerry board. In the programmable logic we have developed an efficient hardware implementation of a video encoder optimized for ultra low-latency, using the Logarithmic Hop Encoding (LHE) algorithm, which works in the time domain, meaning that no domain transform...

En este trabajo se presenta una arquitectura basada en FPGA, diseñada para la agregación y posterior exportación de registros de sesiones TCP en enlaces de hasta 40 Gbit/s sin realizar muestreo de paquetes, incluso a la máxima tasa de paquetes. De esta manera, se descarga a exportadores de flujos basados en hardware de propósito específico de tarea...

In this paper, we present an efficient hardware implementation of a video encoder optimized for ultra low-latency, using the Logarithmic Hop Encoding algorithm. This design provides the following features: (i) A maximum marginal output latency of 23 clock cycles, (ii) small area requirements, (iii) proven rate up to 95 Millions of pixels per second...

Communication networks these days face a relentless increase in traffic load. Multi-gigabitper- second links are becoming widespread, and network devices are under continuous stress, so testing whether they guarantee the specified throughput or delay is a must. Software-based solutions, such as packet-train traffic injection, were adequate for lowe...

The Kalman filter plays an essential role in an integrated navigation system. From an embedded-system design point of view, the UD filter is a convenient, numerically-stable version of the Kalman filter. In this paper, a UD filter coprocessor with single-precision floating-point format that runs in FPGA is presented. A comprehensive hardware/softwa...

In this paper we present TNT10G (multi-Terabyte trace Network Tester), an FPGA-based tool for replaying and capturing massive Ethernet traces at 10 Gb/s. The tool is capable of reproducing and storing terabytes of network traffic at line rate, even if small packets are being used. Moreover, since the design works at low level (XGMII), accuracy is b...

The rise of network speeds to tens of gigabits per second poses a challenge to develop packet processing applications that can cope with such bit rates. Therefore, the need for a suitable open source system that can be used as a prototype platform to test new network functionality while ensuring line-rate processing, accurate timestamping, and redu...

The Kalman filter is an effective tool for fusing signals from multiple sources. The UD filtering is a well-known, numerically-stable formulation of the Kalman filter, owing to G.J. Bierman and C. Thornton. The most popular version of this filter is oriented to be executed in a traditional, sequential microprocessor. In this paper a new algorithm f...

In this paper we present an FPGA-based architecture to export flows in 10 Gbps networks, implemented on the NetFPGA-10G platform. Flow-based monitoring is a powerful methodology to analyze and detect network issues, such as congested links or DDoS attacks. Our design provides the following advantages: (i) The architecture allows processing 10 Gbps...

In this paper we present an FPGA implementation of a Monte-Carlo method for pricing Asian options using Impulse C and floating-point arithmetic. In an Altera Stratix-V FPGA, a 149x speedup factor was obtained against an OpenMP-based solution in a 4-core Intel Core i7 processor. This speedup is comparable to that reported in the literature using a c...

This paper details the design of a new high-speed point multiplier for elliptic curve cryptography using either field-programmable gate array or application-specified integrated circuit technology. Different levels of digit-serial computation were applied to the data path of Galois field (GF) multiplication and division to explore the resulting per...

HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfigurable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurable devices is usually smaller than the number of processor cores, thus preventing that a 1:1 mapping between cores and coprocessors could be achieved....

La multiplicación de matrices es fundamental en áreas como procesamiento de señales y robótica. En sistemas que exigen movilidad esta operación algebraica debe ejecutarse en un sistema embebido alimentado por baterías, donde el consumo de energía impone una fuerte restricción de diseño. La multiplicación de matrices tiene una complejidad O(N^3), se...

New methodologies and tools are employed to significantly improve the performance of complex applications such as aeronautical CFD simulations, while reducing the energy required to perform those computations. An in-house Navier-Stokes solver is used written in C++ using single floating-point arithmetic, which implements a vertex-centered finite-vo...

Modular exponentiation with large modulus and exponent, which is usually accomplished by repeated modular multiplications, has been widely used in public key cryptosystems. Typically, the Montgomery's modular-multiplication algorithm is used since no trial division is necessary, and the carry-save addition (CSA) is employed to reduce the critical p...

This paper addresses the problem of accelerating Computational Fluid Dynamics (CFD) applications, utilized by aeronautical engineers to create more efficient and aerodynamic designs. CFD applications require intensive floating point calculations, so they are usually executed on High-Performance Computing (HPC) systems. Here, we study the HW impleme...

This paper describes the FPGA implementation of a Decimal Floating Point (DFP) adder/subtractor. The design performs addition and subtraction on 64-bit operands that use the IEEE 754-2008 decimal encoding of DFP numbers and is based on a fully pipelined circuit. The design presents a novel hardware for pre-signal generation stage and an enhanced ve...

This paper presents experiences in applying modern functional verification to a configurable decimal floating point Adder / Subtractor core targeted to programmable logic. Despite its huge input space, a number of hard-to-verify corner cases are identified. Two different verification frameworks were applied in order to develop testbenches: OVM and...

Como todo semiconductor VLSI (Very Large Scale Integration), los FPGA(Field Programmable Gates Array), presentan diferencias en su estructura debido a variaciones en su proceso de fabricación. Si bien se ha determinado que estas variaciones influyen en el tiempo de retardo de un diseño digital, nada se ha dicho respecto a variaciones en el consumo...

The design of complex circuits as SoCs presents two great challenges to designers. One is the speeding up of system functionality modeling and the second is the implementation of the system in an architecture that meets performance and power consumption ...

Modular exponentiation with large modulus and exponent has been widely used in public key cryptosystems. Montgomery's modular multiplication algorithm is normally used since no trial division is necessary and the critical path is reduced by using carry-save addition (CSA). In this paper, the Montgomery multiplication is greatly optimized and archit...

The work reported in this paper is devoted to the FPGA implementation of decimal dividers. Two types of dividers are described. The first one implements a decimal non-restoring like algorithm and uses ripple-carry operators. For medium size operators it gives a good compromise between cost and latency. The second one implements an SRT-like algorith...

This paper first presents a study on the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGA's platforms. Some new concepts are presented to compute the P and G functions for carry-chain optimization purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations...

Start of the above-titled section of the conference proceedings record.

This paper presents a number of approaches to implement decimal multiplication algorithms on Xilinx FPGA’s. A variety of algorithms for basic one by one digit multiplication are proposed and FPGA implementations are presented. Later on N by one digit and N by M digit multiplications are studied. Time and area results for sequential and combinationa...

This paper describes the design and implementation of a hardware module to calculate the decimal floating-point (DFP) multiplication compliant with the current IEEE-754-2008 standard. The design proposed is made up of independent stages: IEEE-754 coder / decoder, decimal multiplier and rounding. The decimal multiplication is based on a previously d...

This paper presents FPGA implementations of add/subtract algorithms for 10´s complement BCD numbers. Carry-chain type circuits have been designed on 6-input LUT´s Xilinx Virtex-5 FPGA technologies. Some new concepts are reviewed to compute the P and G functions for carry-chain optimization purposes. Designs are presented with the corresponding time...

This paper presents a novel class of division algorithm that reduces the delay of calculus introducing more concurrency in computation. The algorithm is suitable for fixed-point operands and divides in a radix r = 2<sup>k</sup>, producing k bits at each iteration. The proposed digit recurrence algorithm has two different architectures, a first one...

This paper presents a study of the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGAs. Some new concepts are presented to compute the P and G functions for carry-chain optimization purposes. Several alternative designs are then presented with the corresponding time performances and area consumption...

In this paper we present radix r = 2(k) divider for fixed point operands. The divider divides in a radix r = 2(k), producing k bits at each iteration. The proposed digit recurrence algorithm has two different architectures, a first one for general hardware implementation, and the second one is optimized for configurable logic (FPGAs). Results show...

In this paper we present radix r = 2k divider for fixed-point operands. The divider divides in a radix r = 2k, producing k bits at each iteration. The proposed digit recurrence algorithm has two different architectures, a first one for general hardware implementation, and the second one is optimized for configurable logic (FPGAs). Results show a sp...

Este trabajo analiza los consumos de energía y potencia de un sistema en un chip (SoC) compuesto por el soft procesador Microblaze y memoria SDRAM externa, para varios tamaños de caches de datos e instrucciones, corriendo diferentes programas de prueba. El sistema fue sintetizado en una FPGA Virtex-II Pro. Se analizaron los consumos de potencia de...

Power consumption is one of the mayor design trade-off in today elec-tronic. This paper explores the utility of some end-user low-power design (LPD) methods based on architectural and implementation modifications, for FPGA based systems. The contribution of spurious transitions to the overall consumption is evi-denced and main strategies for its re...

This paper describes algorithms and circuits for executing the point-multiplication operation in the particular case of the K-163 NIST-recommended curve. The circuits have been described in VHDL and implemented within the low cost Spartan-3 FPGA devices. Three point-multiplication algorithms are considered: the basic algorithm, the Montgomery algor...

This paper shows that, under certain conditions, digital arithmetical circuits do not meet the addition commutation property in terms of power consumption. That is, the power consumed by the operation AtimesB is different from BtimesA. As a consequence, it is possible to get a power saving simply permuting the circuit inputs, wherever any of the fo...

Compton cameras acting as electronic collimators, improve the characteristics of nuclear medicine imaging, with an additional computational cost for the readout circuitry and digital control system. This paper presents the use of reconfigurable logic based hardware-software co-design, to deal with such tasks as the ones related to the electronic co...

Pulse width modulation (PWM) is a very common technique used in different applications, from the control of motors, switching power converters (power supplies), audio amplifiers or illumination systems. In some of those applications, the pulse frequency has increased so much in the last years that the resolution obtained with classical (counter) te...

Several algorithms for computing x mod in are presented, among others the reduction mod B-k-a, the pre-computation of B-i.k mod m, a generalized version of the Barrett algorithm and a modified version of the same Barrett algorithm. The four mentioned algorithms, as well as the classical integer non-restoring division algorithm, have been synthesize...

This paper summarizes the utility of some low-power design (LPD) methods based on architectural and implementation modifications, for FPGA based systems. Power consumption is becoming one of the mayor design trade-off in today electronic. In this work, the contribution of spurious transitions to the overall consumption is evidenced and main strateg...

This chapter is devoted to arithmetic functions and operations other than the four basic ones. Number representation systems conversion procedures are first analyzed; they play a prominent role in arithmetic processes since a variety of algorithms are designed for a wide-ranging number of systems and/or bases (radices). Further on, this chapter rev...

Addition is used as a primitive operation for computing most arithmetic functions, so that it deserves particular attention. The classical pencil and paper algorithm implies the sequential computation of a set of carries, each of them depending on the preceding one. As a consequence the execution time of any program, or circuit, based on the classi...

This chapter is devoted to the hardware platforms available to implement the algorithms described in the preceding chapters. In the first section, some generalities in electronic system design are presented. The hardware platforms are then classified as instruction-set processor, ASIC based and reconfigurable hardware. A special emphasis is given t...

This chapter presents some topics in mathematics; it is intended to make this book self-contained. For further details the reader is referred to textbooks on Algebra, Mathematical Analysis, Number Theory, Finite Fields and Cryptography.

Chapter 6 shows that division is somewhat more intricate than the other three basic arithmetic operations. In the earliest computer applications, division was most often implemented as an assembly language program using the other arithmetic operations as primitives. Such was the case for multiplication too in elementary pioneer processors. The prog...

Finite field operations are used as computation primitives for executing numerous cryptographic algorithms, especially those related with the use of public keys (asymmetric cryptography). Classical examples are ciphering deciphering, authentication and digital signature protocols based on RSA-type or elliptic curve algorithms. Other classical appli...

Basically, multiplication is a very simple operation as it most often reduces to multi-operand addition. Base-B is generally assumed, while Base-2 is extensively treated whenever the specificity of the binary system results in prominent features or allows significant algorithmic simplifications. Most multiplication algorithms share a common feature...

Arithmetic deals with operations on numbers: addition, subtraction, and so on. Thus, number representation is a fundamental topic in arithmetic. The choice of a number representation system has repercussions on the complexity of the algorithms executing the arithmetic operations, and thus on the costs and performances of the circuits which implemen...

According to speed/cost requirements, the technology at hand and a number of other circumstantial criteria such as expandability, user-configurable features, copy protection or power consumption, a great quantity of theoretical and practical multiplier implementations has been proposed in the literature. This chapter presents classic multipliers in...

This chapter is devoted to implementations of arithmetic functions reviewed in chapter 7. As in the preceding chapters, several alternatives will be proposed for the algorithms previously described. As mentioned before, the ever-increasing availability of fast low-cost memory blocks (ROM, RAM) motivates the development of affordable logical circuit...

This chapter deals with the synthesis of circuits implementing the main finite field operations: addition, subtraction, product, exponentiation and inversion. The reason why these operations should be implemented in hardware, instead of just being programmed for some target microprocessor, is the reduction of the computation time. This is particula...

This chapter is a summary of digital system architecture. The general problem dealt with is the synthesis of a digital circuit implementing some given algorithm, in such a way that a set of conditions related with the costs and the delays be satisfied. The costs to be taken into account could be the number of cells in the case of an application spe...

There are many data processing applications which use a large range of values and need a relatively high precision. In such cases, instead of encoding the information in the form of integers or fixed-point numbers, an alternative solution is a floating-point representation (chapter 3). In the first section of this chapter, a method is proposed for...

Algorithms for performing divisions over Z
p
and GF(p
m
) are described, the corresponding digital circuits are synthesized and conclusions about their computation times are drawn. The results of their implementation within field-programmable devices are given in the case of the most efficient ones.

Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Enero de 2005 Bibliografía al final de cada capítulo

This paper surveys different implementations of dividers on FPGA technology. A special attention is paid on ATP (area-time-power) trade-offs between restoring, non-restoring, and SRT dividers algorithms for different operand widths, remainder representations, and radices. Main results show that SRT radix-2 present the best ATP figure. In combinatio...

This paper describes different implementations of dividers on FPGA. Many division algorithms have been adapted for FPGA technology;
nevertheless the peculiar characteristics of re-configurable hardware devices deserve special attention to ensure efficient
implementations. This paper presents comparative analyses of implementations targeting Virtex...

In this paper, the realization of low power finite state machines (FSMs) on FPGAs using decomposition techniques is addressed. The original FSM is divided into two submachines using a probabilistic criterion. Only one submachine is active at a time, meanwhile the other is disabled to save power. Different deactivation alternatives and state encodin...

In this paper, the problem of state encoding of FPGA-based synchronous finite state machines (FSMs) for low-power is addressed. Four codification schemes have been studied: First, the usual binary encoding and the One-Hot approach suggested by the FPGA vendor; then, a code that minimizes the output logic; finally, the so-called Two-Hot code strateg...

In this paper, an activity estimation tool for FPGA-based combinational circuits is presented. The current version is able
to estimate average activity for individual nodes. The tool is statistical-based, allowing the user to specify the tolerated
error at a given confidence level. The tunable properties of the implemented technique have been caref...

In this work it is shown that, in terms of power consumption, the arithmetic digital circuits, don't always satisfy the Commutative Property. Therefore, it is possible to obtain an additional consumption reduction, simply exchanging the circuit inputs. The phenomenon is reinforced when some of the following conditions are completed: the amount of d...

High-speed digital designs exhibit a moderate logic depth, gate count, and wiring capacitance. These three characteristics are also essential conditions for a low-power operation. Therefore, blocks with lower area or higher bandwidth can be good candidates to have a moderated power figure. This fact opens a way to overcome the lack of low-power EDA...

The generator includes several tools that allows to translate the initial problem specification to a specific circuit implementation. From the rule based specification the generator produces a first computing scheme. By applying various transformations (lattice and arithmetic operation minimization, optimal register assignation, ...) the system pro...

Fuzzy controllers can be implemented by standard hardware, dedicated micro controller or application specific circuits. The basic fuzzy controller architecture can use one or more arithmetic and logical units (ALU). ALUs execute the following operations: fuzzyfication, defuzzyfication, lattice and arithmetic functions. The generator includes severa...

Three modular multiplication algorithms are described and compared: the so-called Multiply and Reduce, the Shift and Add, and finally, the Montgomery product. An estimation of the cost of their combinational implementation using Xilinx FPGAs family is calculated. Practical results in term of area, delay, and power for both combinational and complet...

The abstract should summarize the contents of the paper and should contain at least 70 and at most 150 words. It should be set in 9-point font size and should be inset 1.0 cm from the right and left margins. There should be two blank (10-point) lines before and after the abstract. …

There is bibliography that shows that t he c ircuits with the maximum operation frequency, consumes the less energy. In this paper this proposition is verified experimentally for FPGAs, opening a possibility for power estimation and optimization, taken into account t he lack of EDA tools for low power design for programmable devices.

## Projects

Projects (5)

The project proposes AGILE monitoring of services as a paradigm of rapid prototyping of monitoring for visibility of a service. To design the monitoring components, it addresses from the capture and aggregation to the root cause diagnosis and interaction with the user, as well as issues related to network data analysis for business intelligence.

This project aims at creating the first on-line multiplayer virtualized and real game. Its first strategic line is to improve the real-time coding capacities of the LHE (Logaritmical Hopping Encoding) encoder, which will let a better quality of experience, thanks to the reduction in the delay between game and player, achieving a better control sensitivity sensation. Its second strategic line is the measurement of network quality to adapt the service to the communication network state and the accessing device.