[Show abstract][Hide abstract] ABSTRACT: DRAM technology requires refresh operations to be performed in order to avoid data loss due to capacitance leakage. Refresh operations consume a significant amount of dynamic energy, which increases with the storage capacity. To reduce this amount of energy, prior work has focused on reducing refreshes in off-chip memories. However, this problem also appears in on-chip eDRAM memories implemented in current low-level caches. The refresh energy can dominate the dynamic consumption when a high percentage of the chip area is devoted to eDRAM cache structures.
Replacement algorithms for high-associativity low-level caches select the victim block avoiding blocks more likely to be reused soon. This paper combines the state-of-the-art MRUT replacement algorithm with a novel refresh policy. Refresh operations are performed based on information produced by the replacement algorithm. The proposed refresh policy is implemented on top of an energy-aware eDRAM cache architecture, which implements bank-prediction and swap operations to save energy.
Experimental results show that, compared to a conventional eDRAM design, the proposed energy-aware cache can achieve by 72% refresh energy savings. Considering the entire on-chip memory hierarchy consumption, the overall energy savings are 30%. These benefits come with minimal impact on performance (by 1.2%) and area overhead (by 0.4%).
Microprocessors and Microsystems 02/2015; 39(1):37-48. · 0.60 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.
[Show abstract][Hide abstract] ABSTRACT: SLURM is a resource manager that can be lever-aged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plug-in (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing; 10/2014
[Show abstract][Hide abstract] ABSTRACT: SLURM es un gestor de recursos para clusters que permite compartir una serie de recursos heterogéneos entre los trabajos en ejecución. Sin embargo, SLURM no está diseñado para compartir re-cursos tales como procesadores gráficos (GPUs). De hecho, aunque SLURM admita plugins de recursos genéricos para poder manejar GPUs estas sólo pueden ser accedidas de forma exclusiva por un trabajo en ejecución del nodo que las hospeda. Esto es un serio inconveniente para las tecnologías de virtualización de GPUs remotas, cuya misión es proporcionar al usuario un acceso completamente transparente a todas las GPUs del cluster , independientemente de la ubicación concreta, tanto del trabajo como de la GPU. En este trabajo presentamos un nuevo tipo de dis-positivo en SLURM, "rgpu", para conseguir que una aplicación desde su nodo acceda a cualquier GPU del cluster haciendo uso de la tecnología de virtualización de GPUs remotas, rCUDA. Además, con este nuevo mecanismo de planificación, un trabajo puede utilizar tantas GPUs como existan en el cluster , siempre y cuando estén disponibles. Finalmente, presentamos los resultados de varias simulaciones que muestran los beneficios de este nuevo enfoque, en términos del incremento de la flexibilidad de planificación de trabajos.
[Show abstract][Hide abstract] ABSTRACT: Un elevado número de clústeres de computación de altas prestaciones incluyen una o más GPUs por nodo con el propósito de reducir el tiempo de ejecución de las aplicaciones. Sin embargo, la utilización de estos aceleradores es, por norma general, inferior al 100% del tiempo. En este contexto, la virtualización de GPUs remotas puede contribuir a reducir el número de GPUs necesarias, disminuyendo, tanto el coste de adquisición de estos dis-positivos como el consumo energético del sistema. En este artículo se investiga el sobrecoste y posibles cuellos de botella en algunas configuraciones "heterogéneas" formadas por nodos cliente sin GPUs que ejecutan aplicaciones CUDA sobre GPUs instaladas en servidores remotos. Esta evaluación se lleva a cabo sobre tres procesadores multinúcleo de propósito general (Intel Xeon, Intel Atom y ARM Cortex A9), dos aceleradores gráficos (NVIDIA GeForce GTX480 y NVIDIA Quadro M1000) y dos aplica-ciones científicas (CUDASW++ y LAMMPS) uti-lizadas en bioinformática y simulaciones de dinámica molecular.
[Show abstract][Hide abstract] ABSTRACT: A clear trend has emerged involving the acceler-ation of scientific applications by using GPUs. However, the capabilities of these devices are still generally underutilized. Remote GPU virtualization techniques can help increase GPU utilization rates, while reducing acquisition and maintenance costs. The overhead of using a remote GPU instead of a local one is introduced mainly by the difference in performance between the internode network and the intranode PCIe link. In this paper we show how using the new InfiniBand Connect-IB network adapters (attaining similar throughput to that of the most recently emerged GPUs) boosts the performance of remote GPU virtualization, reducing the overhead to a mere 0.19% in the application tested.
[Show abstract][Hide abstract] ABSTRACT: To mitigate the impact of bandwidth contention, which in some processes can yield to performance degradations up to 40%, we devise a scheduling algorithm that tackles main memory and L1 bandwidth contention. Experimental evaluation on a real system shows that the proposal achieves an average speedup by 5% with respect to Linux.
[Show abstract][Hide abstract] ABSTRACT: Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these acceler-ators is usually far below 100%. In this context, remote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bot-tlenecks of several "heterogeneous" scenarios consisting of client GPU-less nodes running CUDA applications and remote GPU-equipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
The Fourth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (ENERGY 2014), Chamonix, France; 04/2014
[Show abstract][Hide abstract] ABSTRACT: To improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.
IEEE Transactions on Parallel and Distributed Systems 03/2014; 25(3):581-590. · 2.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Although performance is a key design issue of interconnection networks, fault-tolerance is becoming more important due to the large amount of components of large machines. In this paper, we focus on designing a simple indirect topology with both good performance and fault-tolerance properties. The idea is to take full advantage of the network resources consumed by the topology. To do that, starting from the RUFT topology, which is a simple UMIN topology that does not tolerate any link fault, we first duplicate injection and ejection links connecting these extra links in a particular way. The resulting topology tolerates 3 network link faults and also slightly increases performance with marginal increase in the network hardware cost. Most important, contrary to most of the available topologies, the topology is able to tolerate also faults in the links that connect to end-nodes. We also propose another topology that also duplicates network links, achieving 2x performance improvements and tolerating up to 7 network link faults. These results are better than the ones obtained by a BMIN with a similar amount of resources.
Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 02/2014
[Show abstract][Hide abstract] ABSTRACT: Torus topology is widely used in the largest supercomputers, and specially three-dimensional torus. To implement a 3D torus topology, six ports (links) per node are needed, which can be offered by a single or several communication cards. We showed how to build 3D tori by using two four-port low-profile cards per node, and since there are multiple ways of assigning the dimension and direction of the card ports, we found the optimal port configuration . In these tori, routing becomes a challenge because deadlocks can appear due to the use of the link interconnecting the two cards in the nodes. In this paper we study this problem and present two different alternatives based on the DOR routing scheme.
Proceedings of the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip; 01/2014
[Show abstract][Hide abstract] ABSTRACT: The overall performance of High-Performance Computing applications may depend largely on the performance achieved by the network interconnecting the end-nodes; thus high-speed interconnect technologies like InfiniBand are used to provide high throughput and low latency. Nevertheless, network performance may be degraded due to congestion; thus using techniques to deal with the problems derived from congestion has become practically mandatory. In this paper we propose a straightforward congestion-management method suitable for fat-tree topologies built from InfiniBand components. Our proposal is based on a traffic-flow-to-service-level mapping that prevents, as much as possible with the resources available in current InfiniBand components (basically Virtual Lanes), the negative impact of the two most common problems derived from congestion: head-of-line blocking and buffer-hogging. We also provide a mathematical approach to analyze the efficiency of our proposal and several ones, by means of a set of analytical metrics. In certain traffic scenarios, we observe up to a 68% of the ideal performance gain that could be achieved in HoL-blocking and buffer-hogging prevention.
Journal of Parallel and Distributed Computing 01/2014; 74(1):1802–1819. · 1.12 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In application-specific SoCs, the irregularity of the topology ends up in a complex and customized implementation of the routing algorithm, usually relying on routing tables implemented with memory structures at source end nodes. As system size increases, the routing tables also increase in size with nonnegligible impact on power, area, and latency overheads. In this paper, we present a routing implementation for application-specific SoCs able to implement in an efficient manner (with no routing tables and using a small logic block in every switch) a deadlock-free routing algorithm in these irregular networks. The mechanism relies on a tool that maps the initial irregular topology of the SoC system into a logical regular structure where the mechanism can be applied. We provide details for both the mapping tool and the proposed routing mechanism. Evaluation results show the effectiveness of the mapping tool as well as the low area and timing requirements of the mechanism. With the mapping tool and the routing mechanism, complex irregular SoC topologies can now be supported without the need for routing tables.
IEEE Transactions on Computers 01/2014; 63(3):557-569. · 1.47 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: As parallel computing systems increase in size, the interconnection network is becoming a critical subsystem. The current trend in network design is to use as few components as possible to interconnect the end nodes, thereby reducing cost and power consumption. However, this increases the probability of congestion appearing in the network. As congestion may severely degrade network performance, the use of a congestion management mechanism is becoming mandatory in modern interconnects. One of the most cost-effective proposals to deal with the problems derived from congestion situations is the Regional Explicit Congestion Notification (RECN) strategy, based on using special queues to totally isolate the packet flows which contribute to congestion, thereby preventing the Head-of-Line (HoL) blocking effect that these flows may cause to others. Unfortunately, RECN requires the use of source-based routing, thus not being suitable for interconnects with distributed routing, like InfiniBand. Although some RECN-like mechanisms have been proposed for distributed-routing networks, they are not scalable due to the huge amount of control memory that they require in medium-size or large networks. In this paper, we propose Distributed-Routing-Based Congestion Management (DRBCM), a new scalable technique which, following the RECN principles, totally prevents congestion from producing HoL-blocking in multistage interconnection networks (MINs) using tag-based distributed routing. Simulation results indicate that, regardless of network size, DRBCM presents small resource requirements to keep network performance at maximum level even in scenarios of heavy congestion, where it utterly outperforms (with a gain up to 70 percent) current solutions for distributed-routing networks, like the InfiniBand congestion-control mechanism based on injection throttling. Thus, DRBCM is an efficient, cost-effective, and scalable solution for congestion management.
IEEE Transactions on Parallel and Distributed Systems 10/2013; 24(10):1918-1929. · 2.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Resumen— In recent years, large scale distributed virtual environments have become a major trend in distributed applications, mainly due to the enormous popularity of multi-player online games in the enter- tainment industry. Thus, scalability has become an essential issue for these highly interactive systems. In this paper, we propose a new synchronization techni- que for those distributed virtual environments that are based on networked-server architectures. Unlike other methods described in the literature, the propo- sed technique takes into account the updating mes- sages exchanged by avatars, thus releasing the ser- vers from updating the location of such avatars when synchronizing the state of the system. As a result, the communications required for synchronization are greatly reduced, and this method results more scala- ble. Also, these communications are distributed along the whole synchronization period, in order to reduce workload peaks. Performance evaluation results show that the proposed approach significantly reduces the percentage of CPU utilization in the servers when compared with other existing methods, therefore sup- porting a higher number of avatars. Additionally, the system response time is reduced accordingly. Palabras clave— Distributed virtual environments, synchronization technique
Proceedings of XVII Jornadas de Paralelismo. 09/2013;
[Show abstract][Hide abstract] ABSTRACT: High-radix switches reduce network cost and improve network performance, especially in large switch-based interconnection networks. However, there are some problems related to the integration scale to implement such switches in a single chip. An interesting alternative for building high-radix switches consists of combining several current smaller single-chip switches to obtain switches with a greater number of ports. A key design issue of this kind of high-radix switches is the internal switch configuration, specifically, the correspondence between the ports of these high-radix switches and the ports of their smaller internal single-chip switches. In this paper we use artificial intelligence and data mining techniques in order to obtain the optimal internal configuration of all the switches in the network of large supercomputers running parallel applications. Simulation results show that using the resultant switch configurations, it is possible to achieve similar performance as with single-chip switches with the same radix, which would be unfeasible with the current integration scale.
Journal of Parallel and Distributed Computing 09/2013; 73(9):1239-1250. · 1.12 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Head-of-Line (HoL) blocking is a well-known phenomenon that may dramatically degrade the performance of the modern high-performance interconnection networks. Many techniques have been proposed to solve this problem, most of them based on separating traffic flows into different queues at switch ports. However, the efficiency of these proposals may vary depending on the network topology or routing algorithm, as many of them are not aware of any specific network configuration. By contrast, other schemes are tailored to specific topologies like fat-trees, achieving a greater efficiency than "topology-agnostic" schemes. In this paper we propose a straightforward queuing scheme intended to be used in an efficient, recently-proposed hybrid topology. Our proposal significantly boosts network performance with respect to other queuing schemes while requiring similar or fewer resources. Moreover, the implementation of this scheme in InfiniBand-based networks is elementary thanks to the mapping of Service-Levels to Virtual-Lanes supported by this specification.
Proceedings of the 19th international conference on Parallel Processing; 08/2013
[Show abstract][Hide abstract] ABSTRACT: This work introduces a novel refresh mechanism that leverages reuse information to decide which blocks should be refreshed in an energy-aware eDRAM last-level cache. Experimental results show that, compared to a conventional eDRAM cache, the energy-aware approach achieves refresh energy savings up to 71%, while the reduction on the overall dynamic energy is by 65% with negligible performance losses.
Proceedings of the 27th international ACM conference on International conference on supercomputing; 06/2013