José Duato

Universitat Politècnica de València, Valenza, Valencia, Spain

Are you José Duato?

Claim your profile

Publications (544)170.55 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: DNA methylation analysis suffers from very long processing time, since the advent of Next-Generation Sequencers (NGS) has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. Since it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. Results: We present a new software tool, called HPG-Methyl, which efficiently maps bisulfite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPGMethyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulfite reads. Availability: Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to (password “anonymous”). Contact:
    Bioinformatics 10/2015; 31(19):3130–3138. DOI:10.1093/bioinformatics/btv357 · 4.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Torus topology is one of the preferred topologies for the interconnection network in high-performance clusters and supercomputers. Cost and scalability are some of the properties that make torus suitable for systems with a large number of nodes. The 3D torus is the version more extended due to its excellent nearest neighbor. However, some of the last supercomputers have been built using a torus network with five or six dimensions. To obtain an torus, ports per node are needed, which can be offered by a single or several cards per node. In the second case, there are multiple ways of assigning the dimension and direction of the card ports. In previous work we defined and characterized the 3D Twin (3DT) torus which uses two four-port cards per node. In this paper we extend that previous work to define the -dimensional Twin ( DT) torus topology. In this case, we formally obtain the optimal port configuration when ( + 1)-port cards are used instead of -port cards. Moreover, we expl- in how deadlock problem can appear and propose a simple solution. Finally, we include evaluation results which show performance increases when an torus is used instead of an torus with fewer dimensions and with the same computational resources.
    IEEE Transactions on Computers 10/2015; 64(10):2847-2861. DOI:10.1109/TC.2014.2378267 · 1.66 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: La tecnología DRAM requiere operaciones de refresco para evitar pérdidas de datos debidas a la descarga del condensador con el tiempo. Estas operaciones consumen una cantidad de energía dinámica significativa y degradan las prestaciones, puesto que los bancos de memoria no se pueden acceder mientras están siendo refrescados. Algunos trabajos recientes se han centrado en la reducción del número de operaciones de refresco en memorias DRAM fuera del chip. Sin embargo, este problema también se manifiesta en las memorias eDRAM en el chip, las cuales están siendo utilizadas en algunos diseños actuales de cache de bajo nivel. La energía de refresco debida a estas operaciones puede ser mayor que la mitad de la energía dinámica consumida por toda la cache. Este trabajo se centra en la reducción del número de operaciones de refresco en una propuesta de diseño de cache eDRAM de bajo nivel en el chip optimizada para minimizar el consumo energético. Con este propósito, este artículo introduce una política de refresco selectiva y distribuida que aprovecha la información de reúso de los datos para decidir si un bloque se debe refrescar. Los resultados experimentales muestran que, comparada con una cache eDRAM convencional, la cache eDRAM propuesta consigue un ahorro de energía de refresco de hasta un 72% en la media, mientras que la reducción total de energía de toda la jerarquía de memoria es de un 30% en la media. Más aún , estos beneficios se consiguen con un impacto mínimo en las prestaciones (un 1.2% en la media).
    XXVI Jornadas de Paralelismo, Córdoba (Spain); 09/2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Combined high-radix switches are an attractive option for building high-radix switches. The idea basically consists in combining several current smaller single-chip switches to obtain switches that have a greater number of ports. The performance of these kinds of switches varies depending on their internal configuration because the subnetwork interconnecting all the internal switches could become a bottleneck if an inappropriate internal configuration is established. In this paper, we show how to obtain the optimal internal switch configuration by applying a specific methodology. We highlight the impact of internal switch configuration on the network performance by means of case studies.
    The Journal of Supercomputing 07/2015; 71(7). DOI:10.1007/s11227-015-1408-x · 0.86 Impact Factor
  • Future Generation Computer Systems 07/2015; DOI:10.1016/j.future.2015.06.011 · 2.79 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, embedded Dynamic Random-Access Memory (eDRAM) technology has been implemented in last-level caches due to its low leakage energy consumption and high density. However, the fact that eDRAM presents slower access time than Static RAM (SRAM) technology has prevented its inclusion in higher levels of the cache hierarchy. This paper proposes to mingle SRAM and eDRAM banks within the data array of second-level (L2) caches. The main goal is to achieve the best trade-off among performance, energy, and area. To this end, two main directions have been followed. First, this paper explores the optimal percentage of banks for each technology. Second, the cache controller is redesigned to deal with performance and energy. Performance is addressed by keeping the most likely accessed blocks in fast SRAM banks. In addition, energy savings are further enhanced by avoiding unnecessary destructive reads of eDRAM blocks. Experimental results show that, compared to a conventional SRAM L2 cache, a hybrid approach requiring similar or even lower area speedups the performance on average by 5.9%, while the total energy savings are by 32%. For a 45nm technology node, the energy-delay-area product confirms that a hybrid cache is a better design than the conventional SRAM cache regardless of the number of eDRAM banks, and also better than a conventional eDRAM cache when the number of SRAM banks is an eighth of the total number of cache banks.
    IEEE Transactions on Computers 06/2015; 64(7):1884-1897. DOI:10.1109/TC.2014.2346185 · 1.66 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: High Performance Computing usually leverages messaging libraries such as MPI, GASNet, or OpenSHMEM, among others, in order to exchange data among processes in large-scale clusters. Furthermore, these libraries make use of specialized low-level network layers in order to achieve as much performance as possible from hardware interconnects such as InfiniBand or 40Gb Ethernet, for example. EXTOLL is an emerging network targeted at high performance clusters.
    Parallel Computing 03/2015; 46. DOI:10.1016/j.parco.2015.03.006 · 1.51 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures. Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy. Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).
    Microprocessors and Microsystems 02/2015; 39(1):37-48. DOI:10.1016/j.micpro.2014.12.001 · 0.43 Impact Factor
  • M. Gorgues · D. Xiang · J. Flich · Z. Yu · J. Duato
    [Show abstract] [Hide abstract]
    ABSTRACT: Buffer resource minimization plays an important role to achieve power-efficient NoC designs. At the same time, advanced switching mechanisms like virtual cut-through (VCT) are appealing due to their inherited benefits (less network contention, higher throughput, and simpler broadcast implementations). Moreover, adaptive routing algorithms exploit the inherited bandwidth of the network providing higher throughput. In this paper, we propose a novel flow control mechanism, referred to as type-based flow control (TBFC), and a new adaptive routing algorithm for NoCs. First, the reduced flow control strategy allows using minimum buffer resources, while still allowing VCT. Then, on top of TBFC we implement the safe/unsafe routing algorithm (SUR). This algorithm allows higher performance than previous proposals as it achieves a proper balanced utilization of input port buffers. Results show the same performance of fully adaptive routing algorithms but using less resources. When resources are matched, SUR achieves up to 20% throughput improvement.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40% for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7% with respect to the Linux scheduler.
    IEEE Transactions on Computers 01/2015; DOI:10.1109/TC.2015.2428694 · 1.66 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.
    IEEE Transactions on Parallel and Distributed Systems 01/2015; 26(1):107-119. DOI:10.1109/TPDS.2014.2307851 · 2.17 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: On the one hand, performance and fault-tolerance of interconnection networks are key design issues for high performance computing (HPC) systems. On the other hand, cost should be also considered. Indirect topologies are often chosen in the design of HPC systems. Among them, the most commonly used topology is the fat-tree. In this work, we focus on getting the maximum benefits from the network resources by designing a simple indirect topology with very good performance and fault-tolerance properties, while keeping the hardware cost as low as possible. To do that, we propose some extensions to the fat-tree topology to take full advantage of the hardware resources consumed by the topology. In particular, we propose three new topologies with different properties in terms of cost, performance and fault-tolerance. All of them are able to achieve a similar or better performance results than the fat-tree, providing also a good level of fault-tolerance and, contrary to most of the available topologies, these proposals are able to tolerate also faults in the links that connect to end nodes.
    IEEE Transactions on Parallel and Distributed Systems 01/2015; DOI:10.1109/TPDS.2015.2430863 · 2.17 Impact Factor
  • IEEE Transactions on Parallel and Distributed Systems 01/2015; DOI:10.1109/TPDS.2015.2412139 · 2.17 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
    Parallel Computing 12/2014; 40(10). DOI:10.1016/j.parco.2014.09.011 · 1.51 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Torus is a subclass of direct topologies that was defined in theory to support dimensions. Although recently some supercomputers have been built on a network with five and six dimensions, the most common case is when only three dimensions are implemented. In the market, there are low-profile communication expansion cards that have a reduced number of ports which is not enough to build tori of a certain number of dimensions. In this paper, we will deal with four-port expansion cards. By means of one of these cards per node, a 2-D torus topology could be built, but not a 3-D torus topology. However, two of these cards could be used to build each node of a 3-D torus topology. In this case, two ports are used to interconnect both cards each other, and the other six ports to connect to six neighbor nodes in the 3-D torus. Theoretically, there are several ways of assigning the dimension and direction of the ports. This paper presents a detailed study of the possible port configurations, and under specific network conditions, the best of them is obtained.
    IEEE Transactions on Computers 11/2014; 63(11):2701-2715. DOI:10.1109/TC.2013.155 · 1.66 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
    2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing; 10/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster. Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns with respect to usability, performance, and support for new CUDA features. In response, in this paper, we present a new rCUDA version that (1) improves usability by including a new component that allows an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result, for any CUDA-compatible program, rCUDA now allows the use of remote GPUs within a cluster with low overhead, so that a single application running in one node can use all GPUs available across the cluster, thereby extending the single-node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 10/2014; DOI:10.1002/cpe.3409 · 1.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Un elevado número de clústeres de computación de altas prestaciones incluyen una o más GPUs por nodo con el propósito de reducir el tiempo de ejecución de las aplicaciones. Sin embargo, la utilización de estos aceleradores es, por norma general, inferior al 100% del tiempo. En este contexto, la virtualización de GPUs remotas puede contribuir a reducir el número de GPUs necesarias, disminuyendo, tanto el coste de adquisición de estos dis-positivos como el consumo energético del sistema. En este artículo se investiga el sobrecoste y posibles cuellos de botella en algunas configuraciones "heterogéneas" formadas por nodos cliente sin GPUs que ejecutan aplicaciones CUDA sobre GPUs instaladas en servidores remotos. Esta evaluación se lleva a cabo sobre tres procesadores multinúcleo de propósito general (Intel Xeon, Intel Atom y ARM Cortex A9), dos aceleradores gráficos (NVIDIA GeForce GTX480 y NVIDIA Quadro M1000) y dos aplica-ciones científicas (CUDASW++ y LAMMPS) uti-lizadas en bioinformática y simulaciones de dinámica molecular.
    XXV Jornadas del Paralelismo, Valladolid; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: SLURM es un gestor de recursos para clusters que permite compartir una serie de recursos heterogéneos entre los trabajos en ejecución. Sin embargo, SLURM no está diseñado para compartir re-cursos tales como procesadores gráficos (GPUs). De hecho, aunque SLURM admita plugins de recursos genéricos para poder manejar GPUs estas sólo pueden ser accedidas de forma exclusiva por un trabajo en ejecución del nodo que las hospeda. Esto es un serio inconveniente para las tecnologías de virtualización de GPUs remotas, cuya misión es proporcionar al usuario un acceso completamente transparente a todas las GPUs del cluster , independientemente de la ubicación concreta, tanto del trabajo como de la GPU. En este trabajo presentamos un nuevo tipo de dis-positivo en SLURM, "rgpu", para conseguir que una aplicación desde su nodo acceda a cualquier GPU del cluster haciendo uso de la tecnología de virtualización de GPUs remotas, rCUDA. Además, con este nuevo mecanismo de planificación, un trabajo puede utilizar tantas GPUs como existan en el cluster , siempre y cuando estén disponibles. Finalmente, presentamos los resultados de varias simulaciones que muestran los beneficios de este nuevo enfoque, en términos del incremento de la flexibilidad de planificación de trabajos.
    XXV Jornadas de Paralelismo, Valladolid; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 1 Resumen— Los modos de bajo consumo en los mi-croprocesadores actuales utilizan frecuencias y ten-siones bajas para reducir el consumo energético . Sin embargo, la variación en los parámetros producida por los procesos de fabricación provoca errores per-sistentes en tensiones de alimentación por debajo de Vccmin. Las propuestas recientes proporcionan una toleran-cia a errores bastante baja, debido principalmente a la relación entre cobertura y sobrecarga. Este artículo propone una nueva cache de datos de primer nivel tolerante a fallos, la cual combina celdas SRAM y eDRAM para proporcionar una cobertura del 100% de los errores persistentes producidos. Los resultados experimentales muestran que, com-parando los resultados con una cache convencional y asumiendo un 50% de probabilidad de fallos en modo de bajo consumo, los ahorros en leakage y energía dinámica son de un 85% y 62%, respectivamente, con un impacto mínimo en las prestaciones. Palabras clave— Cache HER, predicción de vía , eDRAM, SRAM, tiempo de retención . I. Introducción L A mayoría de procesadores actuales soportan múltiples modos de alimentación para mejorar la relación prestaciones/consumo. En modo de al-tas prestaciones el procesador trabaja a una frecuen-cia alta junto con una tensión de alimentación el-evada para mejorar el tiempo de ejecución de la carga. En modo de bajo consumo se utilizan nive-les de frecuencia/ tensión bajos para reducir el con-sumo energético . Sin embargo, conforme el tamaño del transistor continúe reduciéndose en tecnologías futuras, las variaciones debidas al proceso de fabri- cación harán que las celdas sean menos fiables en tensiones bajas. Además , si la tensión se reduce por debajo de un cierto nivel de seguridad, nombrado Vccmin, la probabilidad de fallo se incrementa expo-nencialmente. Las antememorias o caches de los microproce-sadores se han implementado normalmente utili-zando celdas SRAM (Static Random-Access Mem-ory). Sin embargo, las celdas eDRAM (embedded Dynamic RAM) están siendo utilizadas en algunos procesadores modernos [1]. A pesar de ser más lentas que las SRAM, las celdas eDRAM mejoran la densidad de almacenamiento en un factor de 3x a 4x y también reducen la energía . Sin embargo, la tecnología eDRAM necesita operaciones de refresco para evitar que los condensadores pierdan su estado. Debido a que cada tecnología tiene sus ventajas e
    XXV Jornadas de Paralelismo, Valladolid (Spain); 09/2014

Publication Stats

8k Citations
170.55 Total Impact Points


  • 1992–2015
    • Universitat Politècnica de València
      • Department of Computer Engineering
      Valenza, Valencia, Spain
  • 1992–2014
    • University of Valencia
      • Department of Informatic
      Valenza, Valencia, Spain
  • 2000–2012
    • University of Murcia
      • Department of Computer Engineering and Technology
      Murcia, Murcia, Spain
    • University of Southern California
      • Department of Electrical Engineering
      Los Angeles, CA, United States
    • The Ohio State University
      • Department of Computer Science and Engineering
      Columbus, Ohio, United States
    • University of Oslo
      • Department of Informatics
      Kristiania (historical), Oslo County, Norway
  • 2003–2011
    • Simula Research Laboratory
      Kristiania (historical), Oslo County, Norway
  • 1992–2011
    • University of Castilla-La Mancha
      • Departamento de Sistemas Informáticos
      Ciudad Real, Castille-La Mancha, Spain
  • 2008
    • Keio University
      • Department of Information and Computer Science
      Tokyo, Tokyo-to, Japan
    • University of Bologna
      Bolonia, Emilia-Romagna, Italy
  • 2007
    • Texas A&M University
      • Department of Computer Science and Engineering
      College Station, TX, United States
  • 2002–2005
    • Polytechnic University of Catalonia
      Barcino, Catalonia, Spain
    • Pennsylvania State University
      • Department of Computer Science and Engineering
      University Park, MD, United States
  • 1995–2005
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
  • 1999
    • Universidad de Cantabria
      • Computers and Electronics
      Santander, Cantabria, Spain
  • 1996
    • CA Technologies
      New York, New York, United States