José Duato

Universitat Politècnica de València, Valenza, Valencia, Spain

Are you José Duato?

Claim your profile

Publications (533)149.7 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: High Performance Computing usually leverages messaging libraries such as MPI, GASNet, or OpenSHMEM, among others, in order to exchange data among processes in large-scale clusters. Furthermore, these libraries make use of specialized low-level network layers in order to achieve as much performance as possible from hardware interconnects such as InfiniBand or 40Gb Ethernet, for example. EXTOLL is an emerging network targeted at high performance clusters.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures. Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy. Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).
    Microprocessors and Microsystems 02/2015; 39(1):37-48. DOI:10.1016/j.micpro.2014.12.001 · 0.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, embedded Dynamic Random-Access Memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than Static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9%, while the total energy savings are by 32%. For a 45nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks.
    IEEE Transactions on Computers 01/2015; DOI:10.1109/TC.2014.2346185 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.
    IEEE Transactions on Parallel and Distributed Systems 01/2015; 26(1):107-119. DOI:10.1109/TPDS.2014.2307851 · 2.17 Impact Factor
  • IEEE Transactions on Parallel and Distributed Systems 01/2015; DOI:10.1109/TPDS.2015.2430863 · 2.17 Impact Factor
  • IEEE Transactions on Parallel and Distributed Systems 01/2015; DOI:10.1109/TPDS.2015.2412139 · 2.17 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
    Parallel Computing 12/2014; DOI:10.1016/j.parco.2014.09.011 · 1.89 Impact Factor
  • IEEE Transactions on Computers 11/2014; 63(11):2701-2715. DOI:10.1109/TC.2013.155 · 1.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
    2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing; 10/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster. Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns with respect to usability, performance, and support for new CUDA features. In response, in this paper, we present a new rCUDA version that (1) improves usability by including a new component that allows an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result, for any CUDA-compatible program, rCUDA now allows the use of remote GPUs within a cluster with low overhead, so that a single application running in one node can use all GPUs available across the cluster, thereby extending the single-node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 10/2014; · 0.78 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Un elevado número de clústeres de computación de altas prestaciones incluyen una o más GPUs por nodo con el propósito de reducir el tiempo de ejecución de las aplicaciones. Sin embargo, la utilización de estos aceleradores es, por norma general, inferior al 100% del tiempo. En este contexto, la virtualización de GPUs remotas puede contribuir a reducir el número de GPUs necesarias, disminuyendo, tanto el coste de adquisición de estos dis-positivos como el consumo energético del sistema. En este artículo se investiga el sobrecoste y posibles cuellos de botella en algunas configuraciones "heterogéneas" formadas por nodos cliente sin GPUs que ejecutan aplicaciones CUDA sobre GPUs instaladas en servidores remotos. Esta evaluación se lleva a cabo sobre tres procesadores multinúcleo de propósito general (Intel Xeon, Intel Atom y ARM Cortex A9), dos aceleradores gráficos (NVIDIA GeForce GTX480 y NVIDIA Quadro M1000) y dos aplica-ciones científicas (CUDASW++ y LAMMPS) uti-lizadas en bioinformática y simulaciones de dinámica molecular.
    XXV Jornadas del Paralelismo, Valladolid; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SLURM es un gestor de recursos para clusters que permite compartir una serie de recursos heterogéneos entre los trabajos en ejecución. Sin embargo, SLURM no está diseñado para compartir re-cursos tales como procesadores gráficos (GPUs). De hecho, aunque SLURM admita plugins de recursos genéricos para poder manejar GPUs estas sólo pueden ser accedidas de forma exclusiva por un trabajo en ejecución del nodo que las hospeda. Esto es un serio inconveniente para las tecnologías de virtualización de GPUs remotas, cuya misión es proporcionar al usuario un acceso completamente transparente a todas las GPUs del cluster , independientemente de la ubicación concreta, tanto del trabajo como de la GPU. En este trabajo presentamos un nuevo tipo de dis-positivo en SLURM, "rgpu", para conseguir que una aplicación desde su nodo acceda a cualquier GPU del cluster haciendo uso de la tecnología de virtualización de GPUs remotas, rCUDA. Además, con este nuevo mecanismo de planificación, un trabajo puede utilizar tantas GPUs como existan en el cluster , siempre y cuando estén disponibles. Finalmente, presentamos los resultados de varias simulaciones que muestran los beneficios de este nuevo enfoque, en términos del incremento de la flexibilidad de planificación de trabajos.
    XXV Jornadas de Paralelismo, Valladolid; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 1 Resumen— Los modos de bajo consumo en los mi-croprocesadores actuales utilizan frecuencias y ten-siones bajas para reducir el consumo energético . Sin embargo, la variación en los parámetros producida por los procesos de fabricación provoca errores per-sistentes en tensiones de alimentación por debajo de Vccmin. Las propuestas recientes proporcionan una toleran-cia a errores bastante baja, debido principalmente a la relación entre cobertura y sobrecarga. Este artículo propone una nueva cache de datos de primer nivel tolerante a fallos, la cual combina celdas SRAM y eDRAM para proporcionar una cobertura del 100% de los errores persistentes producidos. Los resultados experimentales muestran que, com-parando los resultados con una cache convencional y asumiendo un 50% de probabilidad de fallos en modo de bajo consumo, los ahorros en leakage y energía dinámica son de un 85% y 62%, respectivamente, con un impacto mínimo en las prestaciones. Palabras clave— Cache HER, predicción de vía , eDRAM, SRAM, tiempo de retención . I. Introducción L A mayoría de procesadores actuales soportan múltiples modos de alimentación para mejorar la relación prestaciones/consumo. En modo de al-tas prestaciones el procesador trabaja a una frecuen-cia alta junto con una tensión de alimentación el-evada para mejorar el tiempo de ejecución de la carga. En modo de bajo consumo se utilizan nive-les de frecuencia/ tensión bajos para reducir el con-sumo energético . Sin embargo, conforme el tamaño del transistor continúe reduciéndose en tecnologías futuras, las variaciones debidas al proceso de fabri- cación harán que las celdas sean menos fiables en tensiones bajas. Además , si la tensión se reduce por debajo de un cierto nivel de seguridad, nombrado Vccmin, la probabilidad de fallo se incrementa expo-nencialmente. Las antememorias o caches de los microproce-sadores se han implementado normalmente utili-zando celdas SRAM (Static Random-Access Mem-ory). Sin embargo, las celdas eDRAM (embedded Dynamic RAM) están siendo utilizadas en algunos procesadores modernos [1]. A pesar de ser más lentas que las SRAM, las celdas eDRAM mejoran la densidad de almacenamiento en un factor de 3x a 4x y también reducen la energía . Sin embargo, la tecnología eDRAM necesita operaciones de refresco para evitar que los condensadores pierdan su estado. Debido a que cada tecnología tiene sus ventajas e
    XXV Jornadas de Paralelismo, Valladolid (Spain); 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A clear trend has emerged involving the acceler-ation of scientific applications by using GPUs. However, the capabilities of these devices are still generally underutilized. Remote GPU virtualization techniques can help increase GPU utilization rates, while reducing acquisition and maintenance costs. The overhead of using a remote GPU instead of a local one is introduced mainly by the difference in performance between the internode network and the intranode PCIe link. In this paper we show how using the new InfiniBand Connect-IB network adapters (attaining similar throughput to that of the most recently emerged GPUs) boosts the performance of remote GPU virtualization, reducing the overhead to a mere 0.19% in the application tested.
    IEEE CLUSTER 2014, Madrid, Spain; 09/2014
  • The Journal of Supercomputing 09/2014; 69(3):1410-1444. DOI:10.1007/s11227-014-1223-9 · 0.84 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: To mitigate the impact of bandwidth contention, which in some processes can yield to performance degradations up to 40%, we devise a scheduling algorithm that tackles main memory and L1 bandwidth contention. Experimental evaluation on a real system shows that the proposal achieves an average speedup by 5% with respect to Linux.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these acceler-ators is usually far below 100%. In this context, remote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bot-tlenecks of several "heterogeneous" scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
    The Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY 2014), Chamonix, France; 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In application-specific SoCs, the irregularity of the topology ends up in a complex and customized implementation of the routing algorithm, usually relying on routing tables implemented with memory structures at source end nodes. As system size increases, the routing tables also increase in size with nonnegligible impact on power, area, and latency overheads. In this paper, we present a routing implementation for application-specific SoCs able to implement in an efficient manner (with no routing tables and using a small logic block in every switch) a deadlock-free routing algorithm in these irregular networks. The mechanism relies on a tool that maps the initial irregular topology of the SoC system into a logical regular structure where the mechanism can be applied. We provide details for both the mapping tool and the proposed routing mechanism. Evaluation results show the effectiveness of the mapping tool as well as the low area and timing requirements of the mechanism. With the mapping tool and the routing mechanism, complex irregular SoC topologies can now be supported without the need for routing tables.
    IEEE Transactions on Computers 03/2014; 63(3):557-569. DOI:10.1109/TC.2012.299 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: To improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.
    IEEE Transactions on Parallel and Distributed Systems 03/2014; 25(3):581-590. DOI:10.1109/TPDS.2013.61 · 2.17 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although performance is a key design issue of interconnection networks, fault-tolerance is becoming more important due to the large amount of components of large machines. In this paper, we focus on designing a simple indirect topology with both good performance and fault-tolerance properties. The idea is to take full advantage of the network resources consumed by the topology. To do that, starting from the RUFT topology, which is a simple UMIN topology that does not tolerate any link fault, we first duplicate injection and ejection links connecting these extra links in a particular way. The resulting topology tolerates 3 network link faults and also slightly increases performance with marginal increase in the network hardware cost. Most important, contrary to most of the available topologies, the topology is able to tolerate also faults in the links that connect to end-nodes. We also propose another topology that also duplicates network links, achieving 2x performance improvements and tolerating up to 7 network link faults. These results are better than the ones obtained by a BMIN with a similar amount of resources.
    Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 02/2014

Publication Stats

8k Citations
149.70 Total Impact Points

Institutions

  • 1992–2015
    • Universitat Politècnica de València
      • Department of Computer Engineering
      Valenza, Valencia, Spain
  • 1992–2014
    • University of Valencia
      • Department of Informatic
      Valenza, Valencia, Spain
  • 2000–2012
    • University of Murcia
      • Department of Computer Engineering and Technology
      Murcia, Murcia, Spain
    • The Ohio State University
      • Department of Computer Science and Engineering
      Columbus, Ohio, United States
  • 2003–2011
    • Simula Research Laboratory
      Kristiania (historical), Oslo County, Norway
  • 1992–2011
    • University of Castilla-La Mancha
      • Departamento de Sistemas Informáticos
      Ciudad Real, Castille-La Mancha, Spain
  • 2008
    • University of Bologna
      Bolonia, Emilia-Romagna, Italy
  • 2000–2008
    • University of Oslo
      • Department of Informatics
      Kristiania (historical), Oslo County, Norway
  • 2007
    • Texas A&M University
      • Department of Computer Science and Engineering
      College Station, TX, United States
  • 2002–2005
    • Polytechnic University of Catalonia
      Barcino, Catalonia, Spain
    • Pennsylvania State University
      • Department of Computer Science and Engineering
      University Park, MD, United States
  • 1995–2005
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
  • 2000–2003
    • University of Southern California
      • Department of Electrical Engineering
      Los Angeles, CA, United States
  • 1999
    • Universidad de Cantabria
      • Computers and Electronics
      Santander, Cantabria, Spain