R. Hermida

Complutense University of Madrid, Madrid, Madrid, Spain

Are you R. Hermida?

Claim your profile

Publications (51)5.43 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Variable Latency Adders are attracting strong interest for increasing performance at a low cost. However, most of the literature is focused on achieving a good area-delay tradeoff. In this paper we consider multispeculation as an alternative for designing adders with low energy consumption, while offering better performance than the corresponding non-speculative ones. Instead of introducing more logic to accelerate the computation, the adder is split into several fragments which operate in parallel, and whose carry-in signals are provided by predictor units. On the one hand, the critical path of the module is shortened, and on the other hand the frequent useless glitches produced in the carry propagation structure are diminished. Hence, this will be translated into an overall energy reduction. Several experiments have been performed with linear and logarithmic adders, and results show energy savings by up to 90% and 70%, respectively, while achieving an additional execution time decrease. Furthermore, when utilized in whole datapaths with current control techniques, it is possible to reduce execution time by 24.5% (34% best case) and energy by 32% (48% best case) on average.
    Computer Design (ICCD), 2013 IEEE 31st International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Addition is the key arithmetic operation in most digital circuits and processors. Therefore, their performance and other parameters, such as area and power consumption, are highly dependent on the adders' features. In this paper, we present multispeculation as a way of increasing adders' performance with a low area penalty. In our proposed design, dividing an adder into several fragments and predicting the carry-in of each fragment enables computing every addition in two very short cycles at the most, with 99% or higher probability. Furthermore, based on multispeculation principles, we propose a new strategy for implementing addition chains and hiding most of the penalty cycles due to mispredictions, while keeping at the same time the resource sharing capabilities that are sought in high-level synthesis. Our results show that it is possible to build linear and logarithmic adders more than 4.7× and 1.7× faster than the nonspeculative case, respectively. Moreover, this is achieved with a low area penalty (38% for linear adders) or even an area reduction (-8% for logarithmic adders). Finally, applying multispeculation principles to signal processing benchmarks that use addition chains will result in 25% execution time reduction, with an additional 3% decrease in datapath area with respect to implementations with logarithmic fast adders.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 01/2012; 31(12):1817-1830. · 1.09 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Heterogenous datapaths maximize the utilization of functional units (FUs) by customizing their widths individually through fragmentation of wide operands. In comparison, slices in large functional units in a homogenous datapath could be spending many cycles not performing actual useful work. Various fragmentation techniques demonstrated benefits in minimizing the total functional unit area. Upon a closer look at fragmentation techniques, we observe that the area savings achieved by heterogenous datapaths can be traded-off for power optimization. Our specific approach is to introduce choices for functional units with power/area trade-offs for different fragmentation and allocation choices, for reducing power consumption while satisfying the area constraint imposed on the heterogenous datapath. As low power FUs in literature produce an area penalty, a methodology must be developed in order to introduce them in the HLS flow while complying with the area constraint. We propose an allocation and module selection algorithms that pursue a trade-off between area and power consumption for fragmented datapaths under a total area constraint. Results show that it is possible to reduce power by 37% on average (49% in the best case). Moreover latency and cycle time will be equal or nearly the same as in the baseline case, which will lead to an energy reduction, too.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Speculative Functional Units (SFUs) enable a new execution paradigm for High Level Synthesis (HLS). SFUs are arithmetic functional units that operate using a predictor for the carry signal, which reduces the critical path delay. The performance of these units is determined by the success in the prediction of the carry value, i.e. the hit rate of the prediction. Hence SFUs reduce critical path at a low cost, but they cannot be used in HLS with the current techniques. In order to use them, it is necessary to include hardware support to recover from mispredictions of the carry signals. In this paper, we present techniques for designing a datapath controller for seamless deployment of SFUs in HLS. We have developed two techniques for this goal. The first approach stops the execution of the entire datapath for each misprediction and resumes execution once the correct value of the carry is known. The second approach decouples the functional unit suffering from the misprediction from the rest of the datapath. Hence, it allows the rest of the SFUs to carry on execution and be at different scheduling states at different times. Experiments show that it is possible to reduce execution time by as much as 38% and by 33% on average.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Conventional high-level synthesis algorithms usually employ multi-cycle operators to reduce the cycle length in order to improve the circuit performance. These operators need several cycles to execute one operation, but the entire functional unit is not used in any cycle. Additionally, the execution of operations over wider multi-cycle operators is unfeasible if their results must be available in a smaller number of cycles than the functional unit delay. This obliges to add new functional resources to the datapath even if multi-cycle operators are idle when the execution of the operation begins. In this paper a new design technique to overcome the restricted reusability of multi-cycle operators is presented. It reduces the area of these functional units allowing their internal reuse when executing one operation. It also expands the possibilities of common hardware sharing as it allows the partial use of multicycle operators to calculate narrower operations faster than the functional unit delay. This technique is applied as an optimization phase at the end of the high-level synthesis process, and can optimize the circuits synthesized by any high-level synthesis tool
    Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE '07; 01/2007
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses the exploitation of the capabilities of multi-context reconfigurable architectures to handle a variety of interactive multimedia services. One of the main features of these applications is their changing behavior depending on the runtime scenario, turning the configuration management into a key point. In this work, a configuration scheduler for these applications is proposed. We describe the target applications at a task (kernel) granularity by using data flow graphs in which some kernels are conditionally executed depending on runtime conditions. After testing a condition that decides the next kernel to be executed, its corresponding configurations and input data should be loaded into the on-chip memory before its execution starts, producing a computation stall. Our configuration scheduler minimizes these computation stalls and reduces the application's latency by loading configurations before they are needed. Experimental results obtained for interactive and synthetic applications meet their real-time constraints.
    Field Programmable Logic and Applications, 2006. FPL '06. International Conference on; 09/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: Conventional scheduling algorithms try to balance the number of operations of every different type executed per cycle. However, in most cases, a uniform distribution is not reachable, and thus, some hardware (HW) waste appears. This situation becomes worse when heterogeneous specifications (those formed by operations with different data formats and widths) are synthesized. Our proposal is an innovative bit-level algorithm able to minimize this HW waste. In order to obtain uniform distributions of the computational cost of operations among cycles, it successively transforms specification operations into sets of smaller ones, which are then scheduled independently. As a consequence, some specification operations may be executed during a set of nonconsecutive cycles, and over several functional units. In combination with allocation algorithms able to guarantee the bit-level reuse of HW resources, our approach produces circuits with substantially smaller area than conventional implementations. Due to the fragmentation of operations, in the proposed implementations, the type, number, and width of HW resources are, in general, independent of the type, number, and width of both specification operations and variables. Additionally, the clock-cycle length is also reduced in most circuits.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 02/2006; · 1.09 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce a new hardware/software approach to reduce the energy of the shared register file in upcoming embedded architectures with several VLIW processors. This paper includes a set of architectural extensions and special loop unrolling techniques for the compilers of MPSoC platforms. This complete hardware/software support enables reducing the energy consumed in the register file of MPSoC architectures up to a 60% without introducing performance penalties.
    Innovative Architecture for Future Generation High-Performance Processors and Systems, 2005; 02/2005
  • [Show abstract] [Hide abstract]
    ABSTRACT: Early scheduling algorithms usually adjusted the clock cycle duration to the execution time of the slowest operation. This resulted in large slack times wasted in those cycles executing faster operations. To reduce the wasted times multi-cycle and chaining techniques have been employed. While these techniques have produced successful designs, their effectiveness are often limited due to the area increment that may derive from chaining, and the extra latencies that may derive from multicycling. In this paper we present an optimization method that solves the time-constrained scheduling problem by transforming behavioural specifications into new ones whose subsequent synthesis substantially improves circuit performance. Our proposal breaks up some of the specification operations, allowing their execution during several possibly unconsecutive cycles, and also the calculation of several data-dependent operation fragments in the same cycle. To do so, it takes into account the circuit latency and the execution time of every specification operation. The experimental results carried out show that circuits obtained from the optimized specification are on average 60% faster than those synthesized from the original specification, with only slight increments in the circuit area.
    Design, Automation and Test in Europe, 2005. Proceedings; 01/2005
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new technique to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal of this technique is to diminish application energy consumption. Two levels of on-chip data storage are assumed in the reconfigurable architecture. The Data Scheduler attempts to optimally exploit this storage, by deciding in which on-chip memory the data have to be stored in order to reduce energy consumption. We also show that a suitable data scheduling could decrease the energy required to implement the dynamic reconfiguration of the system.
    12/2004: pages 145-155;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconfigurable architectures have become increasingly important in years. We present an approach to the problem of executing 3D graphics interactive applications onto these architectures. The hierarchical trees are usually implemented to reduce the data processed, thereby diminishing the execution time. We have developed a mapping scheme that parallelizes the tree execution onto a SIMD reconfigurable architecture. This mapping scheme considerably reduces the time penalty caused by the possibility of executing different tree nodes in SIMD fashion. We have developed a technique that achieves an efficient hierarchical tree execution taking decisions at execution time. It also promotes the possibility of data coherence in order to reduce the execution time. The experimental results show high performance and efficient resource utilization on tested applications.
    Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004. International Conference on; 10/2004
  • M.C. Molina, J.M. Mendias, R. Hermida
    [Show abstract] [Hide abstract]
    ABSTRACT: Conventional synthesis algorithms produce schedules balanced in the number of operations executed per cycle and allocate operations to functional units of their same type and width. In most implementations some hardware waste appears, because some functional units are not used in all clock cycles. This waste is even greater when multiple precision specifications, i.e. those formed by operations of different widths, are synthesised, because some bits of the results produced must be discarded in some cycles. The allocation algorithm proposed minimises this waste by increasing the reuse of hardware resources. It extracts, prior to the allocation, the common operative kernel of specification operations, and successively breaks down operations into sets of smaller ones such that functional units reuse is possible in other parts of the schedule. These transformations produce new operations whose types and widths may be different from the original ones. In consequence, some specification operations are finally executed over a set of functional units linked by some glue logic. Experimental results show that the implementations proposed by our algorithm need a considerably smaller area than those proposed by conventional allocation algorithms. Also, due to operation transformations, the type, number, and width of the hardware resources in the datapaths produced may be different from the type, number, and width of the specification operations and variables. Additionally, an analytical method to estimate the area potentially saved by our algorithm in comparison to conventional ones is developed
    IEE Proceedings - Computers and Digital Techniques 10/2003;
  • Source
    M.C. Molina, J. M. Mendas, R. Hermida
    [Show abstract] [Hide abstract]
    ABSTRACT: Conventional synthesis algorithms perform the allocation of heterogeneous specifications, those formed by operations of different types and widths, by binding operations to functional units of their same type and width. Thus, in most of the implementations obtained some hardware waste appears. This paper proposes an allocation algorithm able to minimize this hardware waste by fragmenting operations into their common operative kernel, which then may be executed over the same functional units. Hence, fragmented operations are executed over sets of several linked hardware resources.
    03/2003;
  • Source
    M.C. Molina, J.M. Mendias, R. Hermida
    [Show abstract] [Hide abstract]
    ABSTRACT: Conventional synthesis algorithms perform the allocation of heterogeneous specifications, those formed by operations of different types and widths, by binding operations to functional units of their same type and width. Thus, in most of the implementations obtained, some hardware waste appears. This paper proposes an allocation algorithm able to minimize this hardware waste by fragmenting operations into their common operative kernel, which then may be executed over the same functional units. Hence, fragmented operations are executed over sets of several linked hardware resources. The implementations proposed by our algorithm need considerably smaller area than the ones proposed by conventional allocation algorithms. And due to operation fragmentation, in the datapaths produced the type, number, and width of the hardware resources are independent of the type, number, and width of the specification operations and variables.
    Design, Automation and Test in Europe Conference and Exhibition, 2003; 02/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Not Available
    IEEE Circuits and Systems Magazine 01/2003; 2(2):55- 55. · 1.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multi-FPGA systems (MFS) are used for a great variety of applications, for instance, dynamically re-configurable hardware applications, digital circuit emulation, and numerical computation. There are a great variety of boards for MFS implementation. In this paper a methodology for MFS design is presented. The techniques used are evolutionary programs and they solve all of the design tasks (partitioning placement and routing). Firstly a hybrid compact genetic algorithm solves the partitioning problem and then genetic programming is used to obtain a solution for the two other tasks.
    12/2002: pages 207-207;
  • Source
    M.C. Molina, J. M. Mendas, R. Hermida
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a heuristic method to perform the high-level synthesis of multiple-precision specifications. The scheduling is based on the balance of the number of bits calculated per cycle, and the allocation on the bit-level reuse of the hardware resources. The implementations obtained are multiple-precision datapaths independent of the number and widths of the specification operations. As a result impressive area savings are achieved in comparison with conventional algorithms implementations.
    06/2002;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The property called mutual exclusiveness, responsible for the degree of conditional reuse achievable after a highlevel synthesis (HLS) process, is intrinsic to the systems behavior. But sometimes it is only partially reflected in the actual description written by a designer. Our algorithm performs a transformation of the input description that exploits the maximum conditional reuse of the behavior, independently of description style, allowing the HLS tools to obtain circuits with less area.
    03/2002;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genetic algorithms (GAs) are stochastic optimization heuristics in which searches in solution space are carried out by imitating the population genetics stated in Darwin's theory of evolution. The compact genetic algorithm (cGA) does not manage a population of solutions but only mimics its existence. The combination of genetic and local search heuristic has been shown to be an effective approach to solve some optimization problems more efficiently than with a single GA or a cGA. multi-FPGA systems design flow has three major tasks: partitioning, placement and routing. In this paper we present a new hybrid algorithm that exploits a cGA in order to generate high quality partitioning and placement solutions and, by means of a local search heuristic, improves the solutions obtained using a cGA or a GA.
    Digital System Design, 2002. Proceedings. Euromicro Symposium on; 02/2002
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, the placement problem on FPGAs is faced using thermodynamic combinatorial optimization (TCO). TCO is a new combinatorial optimization method based on both thermodynamics and information Theory. In TCO two kinds of processes are considered: microstate and macrostate transformations. Applying the Shannon's definition of entropy to microstate reversible transformations, a probability of acceptance based on Fermi-Dirac statistics is derived On the other hang applying thermodynamic laws to reversible macrostate transformations, an efficient annealing schedule is provided TCO has been compared with simulated annealing (SA) on a set of benchmark circuits for the FPGA placement problem. TCO has achieved large time reductions with respect to SA, while providing interesting adaptive properties
    Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings; 02/2002