About
74
Publications
19,702
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
907
Citations
Introduction
Current institution
Additional affiliations
October 2011 - present
March 2006 - October 2011
Education
March 2006 - October 2011
September 1996 - June 1997
September 1993 - June 1996
Publications
Publications (74)
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increasin...
Many server applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increa...
Fourth Strategic Research Agenda (SRA 4) for High Performance Computing (HPC) technology. Edited by ETP4HPC, the European High-Performance Computing Platform, with the support of the EXDCI-2 project.
Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tai...
Heterogeneous multi-core systems such as big/little architectures have been introduced as an attractive server design option with the potential to improve performance under power constraints in data centres. Since both big high-performing and little power-efficient cores can run on the same system sharing the workload processing, thread mapping/sch...
Application performance on novel memory systems is typically estimated using a hardware simulator. The simulation is, however, time consuming, which limits the number of design options that can be explored within a practical length of time. Also, although memory simulators are typicallywell validated, current CPU simulators have various shortcoming...
In a virtualized computing server (node) with multiple Virtual Machines (VMs), it is necessary to dynamically allocate memory among the VMs. In many cases, this is done only considering the memory demand of each VM without having a node-wide view. There are many solutions for the dynamic memory allocation problem, some of which use machine learning...
The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is...
This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different te...
Many server applications achieve only a fraction of their theoretical peak performance due to bottlenecks in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be under-utilized. It is very hard for developers and runtime systems to ensure that all these critical resources are fully expl...
Application performance on novel memory systems is typically estimated using a hardware simulator. The simulation is, however, time consuming, which limits the number of design options that can be explored within a practical length of time. Also, although memory simulators are typicallywell validated, current CPU simulators have various shortcoming...
The approaching end of DRAM scaling and expansion of emerging memory technologies is motivating a lot of research in future memory systems. Novel memory systems are typically explored by hardware simulators that are slow and often have a simplified or obsolete abstraction of the CPU. This study presents PROFET, an analytical model that predicts how...
In a virtualized computing server (node) with multiple Virtual Machines (VMs), it is necessary to dynamically allocate memory among the VMs. In many cases, this is done only considering the memory demand of each VM without having a node-wide view. There are many solutions for the dynamic memory allocation problem, some of which use machine learning...
Managing memory capacity in virtualized environments is still a challenging problem. Many solutions have been proposed and implemented, including memory ballooning and memory hotplug. But these mechanisms are slow to respond to changes in virtual machine (VM) memory demands. Transcendent Memory (tmem) was introduced to improve responsiveness in mem...
Various extensions of TCP/IP have been proposed to reduce network latency; examples include Explicit Congestion Notification (ECN), Data Center TCP (DCTCP) and several proposals for Active Queue Management (AQM). Combining these techniques requires adjusting various parameters, and recent studies have found that it is difficult to do so while obtai...
While most HPC systems use the traditional "shared nothing" system architecture, with self-contained nodes communicating via the device I/O and associated software layers, there are several ongoing initiatives to build systems that share resources at a coarser granu-larity, even across the whole machine. Such systems open up new opportunities for r...
While most HPC systems use the traditional "shared nothing" system architecture, with self-contained nodes communicating via the device I/O and associated software layers, there are several ongoing initiatives to build systems that share resources at a coarser granu-larity, even across the whole machine. Such systems open up new opportunities for r...
Stream programming is a promising step towards portable, efficient, correct use of parallelism. A stream program is built from kernels that communicate only through point-to-point streams. The stream compiler maps a portable stream program onto the target, automatically sizing communications buffers and applying optimizing transformations such as b...
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged...
Radical changes in computing are foreseen for the next decade. The US IEEE society wants to "reboot computing" and the HiPEAC Vision 2017 sees the time to "re-invent computing", both by challenging its basic assumptions. This document presents the "EuroLab-4-HPC Long-Term Vision on High-Performance Computing" of August 2017, a road mapping effort w...
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged...
Energy efficiency and performance improvements have been two of the major concerns of
current Data Centers. With the advent of Big Data, more information is generated year
after year, and even the most aggressive predictions of the largest network equipment
manufacturer have been surpassed due to the non-stop growing network traffic generated
by cu...
Energy and power are key challenges in high-performance computing. System energy-efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. Interconnects are important targets of optimization since their links are always-on consuming power even during idle periods. The Energy Efficient Ethernet standards T...
In 2013, U.S. data centers accounted for 2.2% of the country’s total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important data center workloads in cloud computing are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to opti...
Managing memory capacity in cloud environments is a challenging problem, mainly due to the variability in virtual machine (VM) memory demand that sometimes can’t be met by the memory of one node. New architectures have introduced hardware support for a shared global address space that, together with fast interconnects, enables resource sharing amon...
An important challenge of modern data centres running Hadoop workloads is to minimise energy consumption, a significant proportion of which is due to the network. Significant network savings are already possible using Energy Efficient Ethernet, supported by a large number of NICs and switches, but recent work has demonstrated that the packet coales...
An important challenge of modern data centers is to reduce energy consumption, of which a substantial proportion is due to the network. Switches and NICs supporting the recent energy efficient Ethernet (EEE) standard are now available, but current practice is to disable EEE in production use, since its effect on real world application performance i...
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, inc...
An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower me...
In 2013, U.S. data centers accounted for 2.2% of the country's total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important workloads are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to reduce power consumption due to inc...
Energy consumption is by far the most important contributor to HPC cluster operational costs, and it accounts for a significant share of the total cost of ownership. Advanced energy-saving techniques in HPC components have received significant research and development effort, but a simple measure that can dramatically reduce energy consumption is o...
The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs) that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware contexts, an important aspect of system performance. Nevertheless, thread assignment of applic...
One of the greatest challenges in HPC is total system power and energy consumption. Whereas HPC interconnects have traditionally been designed with a focus on bandwidth and latency, there is an increasing interest in minimising the inter- connect’s energy consumption. This paper complements ongoing efforts related to power reduction and energy prop...
The backbone of a large-scale supercomputer is the interconnection network. As compute nodes become more energy-efficient, the interconnect is accounting for an increas- ing proportion of the total system energy consumption. The interconnect’s energy consumption is, however, only starting to receive serious attention. Some hardware-based schemes ha...
Energy costs are an increasing part of the total cost of ownership of HPC systems. As HPC systems become increasingly energy proportional in an effort to reduce energy costs, interconnect links stand out for their inefficiency. Commodity interconnect links remain 'always-on', consuming full power even when no data is being transmitted. Although var...
In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86.
In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smart-phones and...
As of June 2012, 41 % of all systems in the TOP5OO use Gigabit Ethernet. Ethernet has been a strong contender in the HPC interconnect market for its competitive performance and low cost. However, until recently, little emphasis has been thrown upon bringing about energy efficient HPC interconnects. To illustrate, in a majority if not all Ethernet b...
One of the greatest challenges in computer architecture is how to write efficient, portable, and correct software for multi-core processors. A promising approach is to expose more parallelism to the compiler, through the use of domain-specific languages. The compiler can then perform complex transformations that the programmer would otherwise have...
Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully parti...
Star Superscalar is a task-based programming model. The programmer starts with an ordinary C program, and adds pragmas to mark functions as tasks, identifying their inputs and outputs. When the main thread reaches a task, an instance of the task is added to a run-time dependency graph, and later scheduled to run on a processor. Variants of Star Sup...
Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buff...
This paper presents a partitioning and allocation algorithm for an iterative stream compiler, targeting heterogeneous multiprocessors with constrained distributed memory and any communications topology. We introduce a novel definition of connectedness that enables the algorithm to model the capabilities of the compiler. The algorithm uses convexity...
Stream programming offers a portable way for regular applications such as digital video, software radio, multimedia and 3D
graphics to exploit a multiprocessor machine. The compiler maps a portable stream program onto the target, automatically sizing
communications buffers and applying optimizing transformations such as task fission or fusion, unro...
In this paper we present the initial development of a streaming environment based on a programming model and machine description.
The stream programming model consists of an extension to the C language and it’s translation towards a streaming machine.
The extensions will be a set of OpenMP-like directives. We show how a serial application can be co...
The market predictions for MP3-based appliances are extremely positive. The ability to maintain impressive sound quality whilst reducing the data requirements by a factor of 1:10 or more, has led to an explosion of content on the Internet. Traditionally, a DSP processor may have been specified as an implementation platform for MP3. However, analysi...
Streaming applications are based on a data-driven approach where compute components consume and produce unbounded data vec- tors. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs However, programming efficiently for streaming architectures is a very challenging task, having to care...