José Duato

Universitat Politècnica de València, Valenza, Valencia, Spain

Are you José Duato?

Claim your profile

Publications (547)182.45 Total impact

  • No preview · Article · Feb 2016 · The Journal of Supercomputing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Torus topology is one of the preferred topologies for the interconnection network in high-performance clusters and supercomputers. Cost and scalability are some of the properties that make torus suitable for systems with a large number of nodes. The 3D torus is the version more extended due to its excellent nearest neighbor. However, some of the last supercomputers have been built using a torus network with five or six dimensions. To obtain an torus, ports per node are needed, which can be offered by a single or several cards per node. In the second case, there are multiple ways of assigning the dimension and direction of the card ports. In previous work we defined and characterized the 3D Twin (3DT) torus which uses two four-port cards per node. In this paper we extend that previous work to define the -dimensional Twin ( DT) torus topology. In this case, we formally obtain the optimal port configuration when ( + 1)-port cards are used instead of -port cards. Moreover, we expl- in how deadlock problem can appear and propose a simple solution. Finally, we include evaluation results which show performance increases when an torus is used instead of an torus with fewer dimensions and with the same computational resources.
    No preview · Article · Oct 2015 · IEEE Transactions on Computers
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: DNA methylation analysis suffers from very long processing time, since the advent of Next-Generation Sequencers (NGS) has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. Since it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. Results: We present a new software tool, called HPG-Methyl, which efficiently maps bisulfite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPGMethyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulfite reads. Availability: Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to (password “anonymous”). Contact:
    Full-text · Article · Oct 2015 · Bioinformatics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: La notable evolución que han sufrido las unidades de procesamiento gráfico (GPUs), unido a la buena relación coste/prestaciones que ofrecen y también a la excelente relación prestaciones/energía que presentan, ha hecho que la computación basada en estos dispositivos se haya generalizado en la actualidad. Sin embargo, aunque las GPUs presentan numerosas ventajas, también tienen algunos inconvenientes. Uno de ellos es que, en general, presentan una baja utilización. Con el fin de aumentar la utilización de estos aceleradores se han creado diversos entornos de virtualización de GPUs. Entre ellos destaca rCUDA por ser el más moderno y proporcionar las mejores prestaciones. rCUDA permite a un proceso que se esté ejecutando en un nodo del cluster usar GPUs remotas que se encuentras en otro nodo. No obstante, al entorno de virtualización de GPUs debe acompañarle el correspondiente planificador de trabajos del cluster, como SLURM, el cual necesita ser extendido para que pueda planificar de forma adecuada el uso de las GPUs remotas. En este trabajo presentamos un estudio en el que extendemos SLURM para que utilice diferentes políticas para asignar GPUs remotas a trabajos. La evaluación de prestaciones se ha llevado a cabo en un cluster compuesto por 9 nodos interconectados por InfiniBand FDR. Cada nodo posee una GPU NVIDIA Tesla K20.
    Full-text · Conference Paper · Sep 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: La tecnología DRAM requiere operaciones de refresco para evitar pérdidas de datos debidas a la descarga del condensador con el tiempo. Estas operaciones consumen una cantidad de energía dinámica significativa y degradan las prestaciones, puesto que los bancos de memoria no se pueden acceder mientras están siendo refrescados. Algunos trabajos recientes se han centrado en la reducción del número de operaciones de refresco en memorias DRAM fuera del chip. Sin embargo, este problema también se manifiesta en las memorias eDRAM en el chip, las cuales están siendo utilizadas en algunos diseños actuales de cache de bajo nivel. La energía de refresco debida a estas operaciones puede ser mayor que la mitad de la energía dinámica consumida por toda la cache. Este trabajo se centra en la reducción del número de operaciones de refresco en una propuesta de diseño de cache eDRAM de bajo nivel en el chip optimizada para minimizar el consumo energético. Con este propósito, este artículo introduce una política de refresco selectiva y distribuida que aprovecha la información de reúso de los datos para decidir si un bloque se debe refrescar. Los resultados experimentales muestran que, comparada con una cache eDRAM convencional, la cache eDRAM propuesta consigue un ahorro de energía de refresco de hasta un 72% en la media, mientras que la reducción total de energía de toda la jerarquía de memoria es de un 30% en la media. Más aún , estos beneficios se consiguen con un impacto mínimo en las prestaciones (un 1.2% en la media).
    Full-text · Conference Paper · Sep 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Large cluster-based machines require efficient high-performance interconnection networks. Routing is a key design issue of interconnection networks. Adaptive routing usually outperforms deterministic routing at the expense of introducing out-of-order packet delivery. Many of the commodity interconnects for clusters are based on fat-trees. The adaptive routing algorithm commonly used in fat-trees is composed of a fully adaptive upward subpath, followed by a deterministic downward subpath. As the latter is determined by the former, choosing the most adequate upward path for each packet is critical in fat-trees to achieve a good performance. In this paper, we present a mechanism for selecting the upward path in fat-trees, which enables optimum use of the available network resources to achieve a high network throughput. The proposed path selection is destination based, which allows reducing the head-of-line blocking effect. Indeed, the proposed mechanism can be used either as a selection function (the provided path is used as the preferred one), or as a deterministic routing algorithm (the path is the only possible one). The results show that the resulting selection function outperforms any other known one. Moreover, the proposed deterministic routing algorithm can achieve a similar, or even higher, level of performance than adaptive routing, while providing in-order packet delivery and a simpler switch implementation.
    No preview · Article · Jul 2015 · The Journal of Supercomputing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package.
    No preview · Article · Jul 2015 · Future Generation Computer Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: Combined high-radix switches are an attractive option for building high-radix switches. The idea basically consists in combining several current smaller single-chip switches to obtain switches that have a greater number of ports. The performance of these kinds of switches varies depending on their internal configuration because the subnetwork interconnecting all the internal switches could become a bottleneck if an inappropriate internal configuration is established. In this paper, we show how to obtain the optimal internal switch configuration by applying a specific methodology. We highlight the impact of internal switch configuration on the network performance by means of case studies.
    No preview · Article · Jul 2015 · The Journal of Supercomputing
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, embedded Dynamic Random-Access Memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than Static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9%, while the total energy savings are by 32%. For a 45nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks.
    Full-text · Article · Jun 2015 · IEEE Transactions on Computers
  • [Show abstract] [Hide abstract]
    ABSTRACT: Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.
    No preview · Article · Jun 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract High Performance Computing usually leverages messaging libraries such as MPI, GASNet, or OpenSHMEM, among others, in order to exchange data among processes in large-scale clusters. Furthermore, these libraries make use of specialized low-level network layers in order to achieve as much performance as possible from hardware interconnects such as InfiniBand or 40 Gb Ethernet, for example. EXTOLL is an emerging network targeted at high performance clusters. Specialized low-level network layers require some kind of flow control in order to prevent buffer overflows at the receiver side. In this paper we present a new end-to-end flow control mechanism that is able to dynamically adapt, at execution time, the buffer resources used by a process according to the communication pattern of the parallel application and the varying activity among communicating peers. The tests carried out on a 64-node 1024-core EXTOLL cluster show that our new dynamic flow control mechanism presents very low overhead with an extraordinarily high buffer efficiency, as overall buffer resources are reduced by 4× with respect to the amount of buffers required by a static flow control protocol achieving similar low overhead levels.
    No preview · Article · Mar 2015 · Parallel Computing
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures. Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy. Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).
    Full-text · Article · Feb 2015 · Microprocessors and Microsystems
  • [Show abstract] [Hide abstract]
    ABSTRACT: Buffer resource minimization plays an important role to achieve power-efficient NoC designs. At the same time, advanced switching mechanisms like virtual cut-through (VCT) are appealing due to their inherited benefits (less network contention, higher throughput, and simpler broadcast implementations). Moreover, adaptive routing algorithms exploit the inherited bandwidth of the network providing higher throughput. In this paper, we propose a novel flow control mechanism, referred to as type-based flow control (TBFC), and a new adaptive routing algorithm for NoCs. First, the reduced flow control strategy allows using minimum buffer resources, while still allowing VCT. Then, on top of TBFC we implement the safe/unsafe routing algorithm (SUR). This algorithm allows higher performance than previous proposals as it achieves a proper balanced utilization of input port buffers. Results show the same performance of fully adaptive routing algorithms but using less resources. When resources are matched, SUR achieves up to 20% throughput improvement.
    No preview · Article · Jan 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40% for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7% with respect to the Linux scheduler.
    No preview · Article · Jan 2015 · IEEE Transactions on Computers
  • [Show abstract] [Hide abstract]
    ABSTRACT: Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.
    No preview · Article · Jan 2015 · IEEE Transactions on Parallel and Distributed Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: On the one hand, performance and fault-tolerance of interconnection networks are key design issues for high performance computing (HPC) systems. On the other hand, cost should be also considered. Indirect topologies are often chosen in the design of HPC systems. Among them, the most commonly used topology is the fat-tree. In this work, we focus on getting the maximum benefits from the network resources by designing a simple indirect topology with very good performance and fault-tolerance properties, while keeping the hardware cost as low as possible. To do that, we propose some extensions to the fat-tree topology to take full advantage of the hardware resources consumed by the topology. In particular, we propose three new topologies with different properties in terms of cost, performance and fault-tolerance. All of them are able to achieve a similar or better performance results than the fat-tree, providing also a good level of fault-tolerance and, contrary to most of the available topologies, these proposals are able to tolerate also faults in the links that connect to end nodes.
    No preview · Article · Jan 2015 · IEEE Transactions on Parallel and Distributed Systems

  • No preview · Article · Jan 2015 · IEEE Transactions on Parallel and Distributed Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
    No preview · Article · Dec 2014 · Parallel Computing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Torus is a subclass of direct topologies that was defined in theory to support dimensions. Although recently some supercomputers have been built on a network with five and six dimensions, the most common case is when only three dimensions are implemented. In the market, there are low-profile communication expansion cards that have a reduced number of ports which is not enough to build tori of a certain number of dimensions. In this paper, we will deal with four-port expansion cards. By means of one of these cards per node, a 2-D torus topology could be built, but not a 3-D torus topology. However, two of these cards could be used to build each node of a 3-D torus topology. In this case, two ports are used to interconnect both cards each other, and the other six ports to connect to six neighbor nodes in the 3-D torus. Theoretically, there are several ways of assigning the dimension and direction of the ports. This paper presents a detailed study of the possible port configurations, and under specific network conditions, the best of them is obtained.
    No preview · Article · Nov 2014 · IEEE Transactions on Computers
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
    Full-text · Conference Paper · Oct 2014

Publication Stats

9k Citations
182.45 Total Impact Points


  • 1992-2015
    • Universitat Politècnica de València
      • Department of Computer Engineering
      Valenza, Valencia, Spain
    • University of Valencia
      • Department of Informatic
      Valenza, Valencia, Spain
  • 2003-2011
    • Simula Research Laboratory
      Kristiania (historical), Oslo County, Norway
  • 2008
    • Keio University
      • Department of Information and Computer Science
      Tokyo, Tokyo-to, Japan
  • 2000-2004
    • University of Oslo
      • Department of Informatics
      Kristiania (historical), Oslo, Norway
    • University of Southern California
      • Department of Electrical Engineering
      Los Angeles, CA, United States
    • The Ohio State University
      • Department of Computer Science and Engineering
      Columbus, Ohio, United States
  • 2002
    • Pennsylvania State University
      • Department of Computer Science and Engineering
      University Park, MD, United States
    • Polytechnic University of Catalonia
      • Department of Computer Architecture (DAC)
      Barcino, Catalonia, Spain
  • 2000-2002
    • University of Murcia
      • Departamento de Ingeniería y Tecnología de Computadores
      Murcia, Murcia, Spain
  • 1992-2001
    • University of Castilla-La Mancha
      • Departamento de Sistemas Informáticos
      Ciudad Real, Castille-La Mancha, Spain
  • 1999
    • Universidad de Cantabria
      • Computers and Electronics
      Santander, Cantabria, Spain
  • 1995-1997
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, GA, United States
  • 1996
    • CA Technologies
      New York, New York, United States