
Ramon Beivide- PhD
- Professor (Full) at University of Cantabria
Ramon Beivide
- PhD
- Professor (Full) at University of Cantabria
About
163
Publications
27,925
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,487
Citations
Introduction
Current institution
Additional affiliations
September 1991 - September 2016
Publications
Publications (163)
Low-diameter network topologies require non-minimal routing, such as Valiant routing, to avoid network congestion under challenging traffic patterns like the so-called adversarial. However, this mechanism tends to increase the average path length, base latency, and network load. The use of shorter non-minimal paths has the potential to enhance perf...
Since today’s HPC and data center systems can comprise hundreds of thousands of servers and beyond, it is crucial to equip them with a network that provides high performance. New topologies proposed to achieve such performance need to be evaluated under different traffic conditions, aiming to closely replicate real-world scenarios. While most optim...
Resumen-Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computing...
Actualmente las empresas del sector de la Acuicultura deben de controlar una gran cantidad de factores para poder realizar su actividad de la manera más productiva. Debido a esta alta complejidad, las empresas en colaboración con universidades buscan métodos para planificar la cría de peces de manera que aporte el máximo beneficio. En este aspecto...
To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valia...
Nowadays, the Artificial Intelligent (AI) techniques are applied in enterprise software to solve Big Data and Business Intelligence (BI) problems. But most AI techniques are computationally excessive, and they become unfeasible for common business use. Therefore, specific high performance computing is needed to reduce the response time and make the...
Supercomputers and datacenters comprise hundreds of thousands of servers. Different network topologies have been proposed to attain such a high scalability, from Flattened Butterfly and Dragonfly to the most disruptive Jellyfish, which is based on a random graph. The routing problem on such networks remains a challenge that can be tackled either as...
A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to th...
Many-core processors demand scalable, efficient and low latency NoCs. Bypass routers are an affordable solution to attain low latency in relatively simple topologies like the mesh. SMART improves on traditional bypass routers implementing multi-hop bypass which reduces the importance of the distance between pairs of nodes. Nevertheless, the conserv...
Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use the bypass require completely empty buffers in the intermediate routers. This restricts the amount of flits that use the...
Heterogeneous systems are present from powerful supercomputers, to mobile devices, including desktop computers, thanks to their excellent performance and energy consumption. The ubiquity of these architectures in both desktop systems and medium-sized service servers allow enough variability to exploit a wide range of problems, such as multimedia wo...
Fat-trees (FTs) are widely known topologies that, among other advantages, provide full bisection bandwidth. However, many implementations of FTs are made slimmed to cheapen the infrastructure, since most applications do not make use of this full bisection bandwidth. In this paper Extended Generalized Random Folded Clos (XGRFC) interconnection netwo...
Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypa...
Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that o...
Current compute-intensive applications largely exceed the resources of single-core processors. To face this problem, multi-core processors along with parallel computing techniques have become a solution to increase the computational performance. Likewise, multi-processors are fundamental to support new technologies and new science applications chal...
Low latency and low implementation cost are two key requirements in NoCs. SMART routers implement multi-hop bypass, obtaining latency values close to an ideal point-to-point interconnect. However, it requires a significant amount of resources such as Virtual Channels (VCs), which are not used as efficiently as possible, preventing bypass in certain...
Low-diameter network topologies require non-minimal routing to avoid network congestion, such as Valiant routing. This increases base latency but avoids congestion issues. Optimized restricted variants focus on reducing path length. However, these optimizations only reduce paths for local traffic, where source and destination of each packet belong...
Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient intera...
Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-ex...
Asynchronous task-based programming models are gaining popularity to address programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, inefficient interactions between su...
Low-diameter networks require non-minimal adaptive routing to deal with varying traffic characteristics and avoid pathological performance. Such routing is based on local estimations of network congestion, based on link-level flow control credits. Dragonfly networks based on the extensions of commodity Ethernet networks using OpenFlow have been pro...
The emergence of heterogeneous systems has been very notable recently. The nodes of the most powerful computers integrate several compute accelerators, like GPUs. Profiting from such node configurations is not a trivial endeavour. OmpSs is a framework for task based parallel applications, that allows the execution of OpenCl kernels on different com...
Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering stages of their pipeline, reducing router delay to a single-cycle. However, the conditions to use the bypass are unnecessarily conservative, requiring completely empty buffers...
Heterogeneous systems composed by a CPU and a set of hardware accelerators have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presen...
The growing complexity of multi-core architec-tures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, fine-grained...
Valiant routing randomizes network traffic to avoid pathological congestion issues by diverting traffic to a random intermediate switch. It has received significant attention in recently proposed high-radix, low-diameter topologies, which are prone to congestion issues. It has been implemented obliviously, or as the basis of some non-minimal adapti...
Big scale, high performance and fault-tolerance, low-cost and graceful expandability are pursued features in current datacenter networks (DCN). Although there have been many proposals for DCNs, most modern installations are equipped with classical folded Clos networks. Recently, regular random topologies, as the Jellyfish, have been proposed for DC...
The Graph500 benchmark attempts to steer the design of High-Performance Computing systems to maximize the performance under memory-constricted application workloads. A realistic simulation of such benchmarks for architectural research is challenging due to size and detail limitations. By contrast, synthetic traffic workloads constitute one of the l...
Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computing effort...
The interconnection network comprises a significant portion of the cost of
large parallel computers, both in economic terms and power consumption. Several
previous proposals exploit large-radix routers to build scalable low-distance
topologies with the aim of minimizing these costs. However, they fail to
consider potential unbalance in the network...
Commodity Ethernet networks are used in many HPC systems. Extensions based on OpenFlow have been proposed for large HPC deployments, considering scalability and power consumption concerns. Such designs employ low-diameter topologies to minimize power consumption, such as Flattened Butterflies or Dragonflies. However, these topologies require non-mi...
The use of heterogeneous systems in supercomputing is on the rise as they improve both performance and energy efficiency. However, the programming of these machines requires considerable effort to get the best results in massively data-parallel applications. Maat is a library that enables OpenCL programmers to efficiently execute single data-parall...
As BigData applications have gained momentum over the last years, the Graph500 benchmark has appeared in an attempt to steer the design of HPC systems to maximize the performance under memory-constricted application workloads. A realistic simulation of such benchmarks for architectural research is challenging due to size and detail limitations, and...
Dragonfly networks arrange network routers in a two-level hierarchy, providing a competitive cost-performance solution for large systems. Non-minimal adaptive routing (adaptive misrouting) is employed to fully exploit the path diversity and increase the performance under adversarial traffic patterns. Network fairness issues arise in the dragonfly f...
High-performance computing (HPC) is recognized as one of the pillars for further progress in science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging architectural challenges in order to reach Exascale level of performance, projected for the year 2020. The much larger embedded and mobile market allows...
This article provides background information about interconnection networks, an analysis of previous developments, and an overview of the state of the art. The main contribution of this article is to highlight the importance of the interpolation and extrapolation of technological changes and physical constraints in order to predict the optimum futu...
Heterogeneous architectures have experienced a tremendous development in the last decade thanks to their excellent cost/performance ratio and low power consumption. But heterogeneity significantly complicates both programming and efficient utilisation of all the available resources. As a result, programmers have ended up using fixed roles for each...
The interconnection network comprises a significant portion of the cost of large parallel computers, both in economic terms and power consumption. Several previous proposals exploit large-radix routers to build scalable low-distance topologies with the aim of minimizing these costs. However, they fail to consider potential unbalance in the network...
Dragonfly networks have a two-level hierarchical arrangement of the network routers, and allow for a competitive cost-performance solution in large systems. Non-minimal adaptive routing is employed to fully exploit the path diversity and increase the performance under adversarial traffic patterns. Throughput unfairness prevents a balanced use of th...
In the late years many different interconnection networks have been used with two main tendencies. One is characterized by the use of high-degree routers with long wires while the other uses routers of much smaller degree. The latter rely on two-dimensional mesh and torus topologies with shorter local links. This paper focuses on doubling the degre...
In this paper a wide family of identifying codes over regular Cayley graphs
of degree four which are built over finite Abelian groups is presented. Some of
the codes in this construction are also perfect. The graphs considered include
some well-known graphs such as tori, twisted tori and Kronecker products of two
cycles. Therefore, the codes can be...
Current High-Performance Computing (HPC) and data center networks rely on large-radix routers. Hamming graphs (Cartesian products of complete graphs) and dragonflies (two-level direct networks with nodes organized in groups) are some direct topologies proposed for such networks. The original definition of the dragonfly topology is very loose, with...
Adaptive deadlock-free routing mechanisms are required to handle variable traffic patterns in dragonfly networks. However, distance-based deadlock avoidance mechanisms typically employed in Dragonflies increase the router cost and complexity as a function of the maximum allowed path length. This paper presents on-the-fly adaptive routing (OFAR), a...
The present work is devoted to characterize the family of symmetric
undirected Cayley graphs over finite Abelian groups for degrees 4 and 6.
Dragonfly topologies are recent network designs that are considered one of the most promising interconnect options for Exascale systems. They offer a low diameter and low network cost, but do so at the expense of path diversity, which makes them vulnerable to certain adversarial traffic patterns. Indirect routing approaches can alleviate the perfor...
Torus networks of moderate degree have been widely used in the supercomputer industry. Tori are superb when used for executing applications that require near-neighbor communications. Nevertheless, they are not so good when dealing with global communications. Hence, typical 3D implementations have evolved to 5D networks, among other reasons, to redu...
Torus networks of moderate degree have been widely used in the supercomputer
industry. Tori are superb when used for executing applications that require
near-neighbor communications. Nevertheless, they are not so good when dealing
with global communications. Hence, typical 3D implementations have evolved to
5D networks, among other reasons, to redu...
High-radix hierarchical networks are cost-effective topologies for large scale computers. In such networks, routers are organized in super nodes, with local and global interconnections. These networks, known as Dragonflies, outperform traditional topologies such as multi-trees or tori, in cost and scalability. However, depending on the traffic patt...
Many current VLSI on-chip multiprocessors and systems-on-chip employ point-to-point switched interconnection networks. Rings and 2D-meshes are among the most popular interconnection topologies for these increasingly important onchip networks. Nevertheless, rings cannot scale beyond dozens of nodes and meshes are asymmetric. Two of the key features...
Dragonfly networks are appealing topologies for large-scale Data center and HPC networks, that provide high throughput with low diameter and moderate cost. However, they are prone to congestion under certain frequent traffic patterns that saturate specific network links. Adaptive non-minimal routing can be used to avoid such congestion. That kind o...
A complete family of Cayley graphs of degree four, denoted as L-networks, is considered in this paper. L-networks are 2D mesh-based topologies with wrap-around connections. L-networks constitute a graph-based model which englobe many previously proposed 2D interconnection networks. Some of them have been extensively used in the industry as the unde...
Twisted torus topologies have been proposed as an alternative to toroidal rectangular networks, improving distance parameters and providing network symmetry. However, twisting is apparently less amenable to task mapping algorithms of real life applications. In this paper we make an analytical study of different mapping and concentration techniques...
Network contention is seen as a major hurdle to achieve higher throughput in today's large-scale high-performance computing systems. Even more so with the current trend of employing blocking networks driven by the need of reducing cost. Additionally, the effect is aggravated by current system schedulers that allocate jobs as soon as nodes become av...
Dragonfly networks are composed of interconnected groups of routers. Adaptive routing allows packets to be forwarded minimally or non-minimally adapting to the traffic conditions in the network. While minimal routing sends traffic directly between groups, non-minimal routing employs an intermediate group to balance network load.
A random selection...
Network design aspects that influence cost and performance can be classified according to their distance from the applications, into issues concerning topology, switch technology, link technology, network adapter, and communication library. The network adapter has a privileged position to take decisions with more global information than any other c...
This work attempts to compare size and cost of two network topologies proposed for large-radix routers: concentrated torus and dragonflies. We study and compare the scalability, cost and fault tolerance of each network. On average, we found that a concentrated torus can be a cost-efficient option for middle-range networks.
Dragonfly networks have been recently proposed for the interconnection network of forthcoming exascale supercomputers. Relying on large-radix routers, they build a topology with low diameter and high throughput, divided into multiple groups of routers. While minimal routing is appropriate for uniform traffic patterns, adversarial traffic patterns c...
This paper analyzes the robustness of the king networks for fault tolerance. To this aim, a performance evaluation of two well known fault tolerant routing algorithms in king as well as 2d networks is done. Immunet that uses two virtual channels and Immucube, that has a better performance while requiring three virtual channels. Experimental results...
Deadlock free routing techniques for torus topologies have been a subject of deep study in the field of HPC interconnects and many proposals exist in the literature. Practical deadlock avoidance techniques can be classified into two main categories, requiring either a segregation of traffic in non-cyclic virtual networks or some form of injection c...
Inter-application network contention is seen as a major hurdle to achieve higher throughput in today's large-scale high-performance capacity systems. This effect is aggravated by current system schedulers that allocate jobs as soon as nodes become available, thus producing job fragmentation, i.e., the tasks of one job might be spread throughout the...
Transactional Memory (TM) intends to simplify the design and implementation of the shared-memory data structures used in parallel
software. Many Software TM systems are based on writer-locks to protect the data being modified. Such implementations can
suffer from the “privatization” problem, in which transactional and non-transactional accesses to...
Many current parallel computers are built around a torus interconnection network. Machines from Cray, HP, and IBM, among others, make use of this topology. In terms of topological advantages, square (2D) or cubic (3D) tori would be the topologies of choice. However, for different practical reasons, 2D and 3D tori with different number of nodes per...
Many shared-memory parallel systems use lock-based synchronization mechanisms to provide mutual exclusion or reader-writer access to memory locations. Software locks are inefficient either in memory usage, lock transfer time, or both. Proposed hardware locking mechanisms are either too specific (for example, requiring static assignment of threads t...
A graph-based model of perfect two-dimensional codes is presented in this work. This model facilitates the study of the metric properties of the codes. Signal spaces are modeled by means of Cayley graphs defined over the Gaussian integers and denoted as Gaussian graphs. Codewords of perfect codes will be represented by vertices of a quotient graph...
In this paper we consider perfect codes over two dimensional QAM-type constellations of any cardinal. Such constellations are going to be modeled by L-graphs, which are the two-dimensional family of multidimensional circulants, defined. We show that Gaussian graphs, Lee graphs and the Kronecker product of two cycles are included in this family. The...
In this paper we propose two new topologies for on-chip networks that we have denoted as king mesh and king torus. These are a higher degree evolution of the classical mesh and torus topologies. In a king network packets can traverse the networks using orthogonal and diagonal movements like the king on a chess board. First we present a topological...
The search for perfect error-correcting codes has received intense interest since the seminal work by Hamming. Decades ago, Golomb and Welch studied perfect codes for the Lee metric in multidimensional torus constellations. In this work, we focus our attention on a new class of four-dimensional signal spaces which include tori as subcases. Our cons...
A family of oblivious routing schemes for fat trees and their slimmed versions is presented in this work. First, two popular oblivious routing algorithms, which we refer to as S-mod-k and D-mod-k, are analyzed in detail. S-mod-k is the default routing algorithm given as an example in the first works formally describing fat tree networks. D-mod-k ha...
New static source routing algorithms for High Performance Computing (HPC) are presented in this work. The target parallel architectures are based on the commonly used fat-tree networks and their slimmed versions. The evaluation of such proposals and their comparison against currently used routing mechanisms have been driven by realistic traffic gen...
To deal with the "memory wall" problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, NonUniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is thre...
A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remains connected, Immunet is able to deal with any numb...
In this paper we consider a broad family of toroidal networks, denoted as Gaussian networks, which include many previously proposed and used topologies. We will define such networks by means of the Gaussian integers, the subset of the complex numbers with integer real and imaginary parts. Nodes in Gaussian networks are labeled by Gaussian integers,...
In order to propose a new metric over QAM constellations, diagonal Gaussian graphs defined over quotients of the Gaussian integers are introduced in this paper. Distance properties of the constellations are detailed by means of the vertex-to-vertex distribution of this family of graphs. Moreover, perfect codes for this metric are considered. Finall...
Without care, Hardware Transactional Memory presents several performance pathologies that can degrade its performance. Among them, writers of commonly read variables can suffer from starvation. Though different solutions have been proposed for HTM systems, hybrid systems can still suffer from this performance problem, given that software transactio...
A set of signal points is called a hexagonal constellation if it is possible to define a metric so that each point has exactly
six neighbors at distance 1 from it. As sets of signal points, quotient rings of the ring of Eisenstein-Jacobi integers are
considered. For each quotient ring, the corresponding graph is defined. In turn, the distance betwe...
An algebraic methodology for defining new metrics over two-dimensional signal spaces is presented in this work. We have mainly considered quadrature amplitude modulation (QAM) constellations which have previously been modeled by quotient rings of Gaussian integers. The metric over these constellations, based on the distance concept in circulant gra...
Although they have been the main server technology for many years, multiprocessors are undergoing a renaissance due to multi-core
chips and the attractive scalability properties of combining a number of such multi-core chips into a system. The widespread
use of multiprocessor systems will make performance losses due to consistency models and synchr...
Cayley graphs over quotients of the quaternion integers are going to be used to define a new metric over four dimensional lattices. We will consider perfect 1-error correcting codes according to this metric space. We will show that, in some cases, these lattices can be represented as two-dimensional constellations, which allow us to state a relatio...
Many parallel computers use Tori interconnection networks. Machines from Cray, HP and IBM, among others, exploit these topologies. In order to maintain full network symmetry, 2D and 3D Tori must have the same number of nodes (k) per dimension resulting in square or cubic topologies. Nevertheless, for practical reasons, computer engineers have desig...
In this paper we present perfect codes for two-dimensional constellations derived from generalized Gaussian graphs, a family of graphs built over quotient rings of Gaussian integers. Using the generalized Gaussian graphs distance, we solve the problem of finding t-dominating sets and, then, we build new perfect codes over these graphs. The well-kno...
This paper explores the suitability of dense circulant graphs of degree four for the design of on-chip interconnection networks. Networks based on these graphs reduce the Torus diameter in a factor √2, which translates into significant performance gains for unicast traffic. In addition, they are clearly superior to Tori when managing collective com...
A strategy to implement adaptive routing in irregular networks is presented and analyzed in this work. A simple and widely applicable deadlock avoidance method, applied to a ring embedded in the network topology, constitutes the basis of this high-performance packet switching. This adaptive router improves the network capabilities by allocating mor...
Chip Multiprocessors (CMPs) are an efficient way of designing and use the huge amount of transistors on a chip. Different cores on a chip can compose a shared memory system with a very low-latency interconnect at a very low cost. Unfortunately, consistency models and synchronization styles of popular programming models for multiprocessors impose se...
The basis for designing error-correcting codes for two dimensional signal sets is considered in this paper. Both, algebraic and graph-theoretical approaches are employed in this research for establishing the fundamentals of these codes. We give a solution to the t-dominating set problem in a subfamily of degree four circulant graphs which directly...