Diego R. LlanosUniversity of Valladolid | UVA · Department of Informatics
Diego R. Llanos
PhD in Computer Architecture
About
160
Publications
19,103
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
928
Citations
Introduction
Skills and Expertise
Additional affiliations
October 2002 - present
Publications
Publications (160)
As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA tooling and its problems. To do so, we evaluate the performance portability of two frameworks for developing...
Slides of the conference paper "Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL", presented at Heteropar 2024 (Euro-Par workshop), in Madrid, aug 2024.
As Field Programmable Gate Arrays (FPGAs) computing capabilities continue to grow, also does the interest on building scientific accelerators around them. Tools like Xilinx's High-Level Synthesis (HLS) help to bridge the gap between traditional high-level languages such as C and C++, and low-level hardware description languages such as VHDL and Ver...
Reconfigurable hardware circuits, such as field-programmable gate arrays, have gained popularity in the high-performance computing (HPC) community in recent years. Nevertheless, their real contribution to accelerating HPC workloads is unclear in both potential and extent.
There are many works devoted to improving the matrix product computation, as it is used in a wide variety of scientific applications arising from many different fields. In this work, we propose alternative data distribution policies and communication patterns to reduce the elapsed time when computing triangular matrix products in distributed memory...
Matrix multiplication is one of the most costly linear algebra operations, very often present in scientific computational applications. Current generic linear algebra libraries, such as ScaLAPACK and its recent evolution SLATE, include functionalities for generic and triangular matrix multiplication. They generally rely on block-cyclic partitioning...
Computational platforms for high-performance scientific applications are becoming more heterogenous, including hardware accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efcient and careful management of the computational resources of this type of hardware to obtain the best possible performance. How...
Heterogeneous systems with several kinds of devices, such as multi-core CPUs, GPUs, FPGAs, among others, are now commonplace. Exploiting all these devices with device-oriented programming models, such as CUDA or OpenCL, requires expertise and knowledge about the underlying hardware to tailor the application to each specific device, thus degrading p...
Motion Estimation is one of the main tasks behind any video encoder. It is a computationally costly task; therefore, it is usually delegated to specific or reconfigurable hardware, such as FPGAs. Over the years, multiple FPGA implementations have been developed, mainly using hardware description languages such as Verilog or VHDL. Since programming...
The determination of Lagrangian Coherent Structures (LCS) is becoming very important in several disciplines, including cardiovascular engineering, aerodynamics, and geophysical fluid dynamics. From the computational point of view, the extraction of LCS consists of two main steps: The flowmap computation and the resolution of Finite Time Lyapunov Ex...
Iterative stencil computations are widely used in numerical simulations. They present a high degree of parallelism, high locality and mostly-coalesced memory access patterns. Therefore, GPUs are good candidates to speed up their computation. However, the development of stencil programs that can work with huge grids in distributed systems with multi...
The Computing Journal gratefully acknowledges the editorial work of the scholars listed below on the special issue entitled “Parallel and Distributed Processing: Advances on Architectures and Applications of Parallel Systems”.
La estimación de movimiento es una de las principales tareas detrás de cualquier codificador de vídeo. Es una tarea computacionalmente costosa, por lo que habitualmente se suele delegar a hardware específico o reconfigurable, como FPGAs. A lo largo de los años se han desarrollado múltiples implementaciones del algoritmo para FPGAs, utilizando princ...
La extracción de Estructuras Coherentes Lagrangianas (LCS) es común en diversos campos de dinámica de fluidos, centrados en estudiar el comportamiento de las partículas que integran determinados flujos presentes en la naturaleza, los cuerpos humanos y animales, determinados fluidos artificiales, etc. En concreto, en el proceso de extracción de LCS...
Loops are a rich source of parallelism. Unfortunately, many loops cannot be safely parallelized at compile time because the compiler is not able to guarantee that there will be no dependence violations. Thread-Level Speculation (TLS) techniques, either hardware or software-based, allow the parallel execution of non-analyzable loops, issuing the exe...
To predict the effectiveness of building evacuations is a very difficult task in the general case. In a previous work, the historical results of 47 evacuation drills in 15 different university buildings, both academic and residential, involving more than 19 000 persons, was analyzed, and a method based on dimensional analysis and statistical regres...
The Raspberry Pi (RPi) boards family is not only a set of versatile devices suitable for quick prototyping, but robust, low-cost systems that can be used in production. For example, RPi 3B and RPi 3B+ models have integrated WiFi/Bluetooth interfaces , so they can be used to interact with Bluetooth Low Energy (BLE) beacons. In particular, distance a...
In distributed-memory systems, data redistributions are operations that change the ownership and location of a selected subset of a data structure at runtime. They allow the improvement of the performance of parallel algorithms which operate on changing or partial domains, aiming to create a balanced workload among the active processes. To manually...
Hyperspectral image registration is a relevant task for real-time applications such as environmental disaster management or search and rescue scenarios. The HYFMGPU algorithm was proposed as a single-GPU high-performance solution, but the need for a distributed version has arisen due to the continuous evolution of sensors that generate images with...
Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms for this problem were not really devoted to real-time performance. The HYFMGPU algorithm arose as a high-performance GPU-based solution to solve such a lack. Nevertheless, a si...
In this paper we present XtremeLoc, a low-cost indoor positioning system designed to work in situations where GPS is not a valid alternative. XtremeLoc relies on the use of portable, low-cost, Bluetooth Low Energy beacons using the iBeacon protocol. Instead of setting these beacons in fixed positions, they are carried by the persons or goods to be...
The time needed to evacuate a building depends on many factors. Some are related to people’s behavior, while others are related to the physical characteristics of the building. This paper analyzes the historical data of 47 evacuation drills in 15 different university buildings, both academic and residential, involving more than 19 000 persons. We p...
Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several impl...
In 2018 we introduced STERLING, a framework designed to encourage citizenship to develop recycling habits. This framework is composed by a low-cost, low-energy sensor installed in recycling containers to measure fill level and other physical parameters, together with a mobile app and an associated web-based server. The sensor is activated magnetica...
El reparto de la carga de trabajo en sistemas heterogéneos es una tarea complicada ya que no todos los nodos de un sistema contienen los mismos recursos computacionales. El tipo de distribuciones de datos más comúnmente utilizado es el reparto equitativo entre todos los procesos. Sin embargo, para los sistemas heterogéneos es necesario una política...
Los coprocesadores de alto rendimiento, como las Unidades de Procesamiento Gráfico (GPUs), presentan un ratio alto entre rendimiento y coste jun-to con un bajo consumo de energía. Por ello, los sistemas heterogéneos que los incluyen han experimentado un crecimiento significativo. Sin embargo, la programación de estos dispositivos sigue suponiendo u...
In this poster we summarize the recent research advances of our group designing and building a plug-in to enable a weighted partitioning of data in Hitmap library.
The PERIL project starts from the collaboration of the MoBiVAP research group and the Health and Safety Service at the University of Valladolid (Spain), and the Castilla y León regional government. The aim of the project is to keep track of persons inside buildings, with the main goal of facilitating localization in case of an emergency. The PERIL...
Current HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programm...
Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtime systems supporting this kind of programming are responsible for exploiting the parallelism by concurrently executing the different stages as soon as their dependencies are met. In this paper we introduce a new par...
During the first decade of the twenty-first century, the advent of multicore processing reached its maturity level, with the help of shared-memory programming models such as OpenMP, that allows to parallelize both legacy and new C and Fortran applications in a shared-memory environments. Meanwhile, message-passing programming models such as MPI all...
During the last decade, parallel programming has evolved in an unprecedent way. Fifteen years ago, the future of parallel computing seemed to consist on the advent of multicore processors composed by an ever-increasing number in the core count per CPU, and their interconnection to form larger clusters. Programming models, such as OpenMP that allows...
Waste disposal and recycling is becoming one of the main problems in Western countries. Improving both recycling culture among citizenship and waste collection and treatment logistics is critical to augment the percentage of waste being recycled. In this paper we present STERLING, an initiative that aims to help in both fields. STERLING is a framew...
El uso de aceleradores hardware de alto rendimiento, tales como las unidades de procesamiento gráfico (GPUs), ha ido en creciente aumento en los sistemas de supercomputación. Esta tendencia en fácilmente apreciable en la lista de computadoras mostradas por la clasificación TOP500. Programar este tipo de dispositivos es una tarea costosa que requier...
Current HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. Generating coordination codes using different vendor specific programming models and languages to obtain the best p...
Pursuing a college degree is a task that requires a great amount of time and effort. Universities are facing a big challenge to attract students and keep them motivated. The gamification of education is a practice that expects to increase the students’ engagement, which in turn increases learning outcomes. Nevertheless, obtaining beneficial results...
Despite the efforts of the authorities, that promote the use of alternative transportation systems, the traffic still increases in European cities, leading not only to traffic jams but also to pollution episodes. Delivery vehicles are part of both problems, because of their intensive use, the advent of e-commerce, the limited number and sizes of lo...
Las rutinas de álgebra lineal BLAS son ampliamente utilizadas en aplicaciones científicas de todo tipo. Existen implementaciones específicamente optimizadas para diferentes tipos de plataformas de cómputo incluyendo aceleradores. Por ejemplo, la implementación contenida en la biblioteca Intel MKL, aparte de ejecutarse en CPUs, incluye versiones par...
En el patrón de computación denominado stencil cada elemento de una estructura de datos de tipo array se actualiza iterativamente en función de los valores de sus vecinos. Entre otras aplicaciones, este patrón permite resolver numéricamente sistemas de ecuaciones en derivadas parciales, por lo que es de gran interés en el computo científico, crecie...
La localización de activos en el interior de edificios es un problema con aplicaciones en diferentes campos y actividades, como la sanidad, prevención de riesgos laborales o diferentes actividades comerciales. En estos lugares, donde la localización mediante GPS no está disponible, es necesario ofrecer una nueva solución que permita resolver el pro...
Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should t...
Supercomputers are becoming more heterogeneous. They are composed by several machines with different computation capabilities and different kinds and families of accelerators, such as GPUs or Intel Xeon Phi coprocessors. Programming these machines is a hard task, that requires a deep study of the architectural details, in order to exploit efficient...
OpenACC is a parallel programming model for automatic parallelization of sequential code using compiler directives or pragmas. OpenACC is intended to be used with accelerators such as GPUs and Xeon Phi. The different implementations of the standard , although still in early development, are primarily focused on GPU execution. In this study, we anal...
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops withou...
OpenACC is a parallel programming model for hardware accelerators, such as GPUs or Xeon Phi, which has been in development for several years by now. During this time, different compilers have appeared, both commercial and open source, which are still on development stage. Due to the fact that both the OpenACC standard and its implementations are re...
Parallelization of sequential applications requires extracting information about the loops and how their variables are accessed, and afterwards, augmenting the source code with extra code depending on such information. In this paper we propose a framework that avoids such an error-prone, time-consuming task. Our solution leverages the compile-time...
Pursuing a college degree is a task that requires a great amount of time and effort. Universities are facing a big challenge to attract students and keep them motivated. The gamification of education is a practice that expects to increase the students engagement, which in turn increases learning outcomes. Nevertheless, obtaining beneficial results...
We propose to move to runtime, part of the compile-time analysis needed to generate
the communication code for distributed-memory systems, in order to better exploit the
capacilities of the execution platforms.
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementat...
OpenACC es un modelo de programación paralela para aceleradores de tipo GPU y Xeon PHI que lleva en desarrollo algunos años. Durante este tiempo han aparecido distintos compiladores, tanto comerciales como de código abierto, que se encuentran aún en un estado temprano de desarrollo. Dado que tanto el estándar como sus implementaciones son relativam...
Transactional Memory (TM) is a technique that aims to mitigate the performance losses that are inherent to the serialization of accesses in critical sections. Some studies have shown that the use of TM may lead to performance improvements, despite the existence of management overheads. However, the relative performance of TM, with respect to classi...
Thread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time dependence analysis. In this work we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.
Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtimes supporting this kind of programming are responsible of exploiting the parallelism by concurrently executing the different stages when their dependencies have been met. In this paper we introduce a new parallel pr...
Programming for distributed-memory systems imposes specific challenges. In these systems, minimizing synchronization and communication overheads is key for performance improvement. A typical approach is to use a message-passing paradigm to exploit static partition policies and to generate coarse-grain computations with aggregated communication phas...
The single-source shortest path (SSSP) problem arises in many different fields. In this paper, we present a GPU SSSP algorithm implementation. Our work significantly speeds up the computation of the SSSP, not only with respect to a CPU-based version, but also to other state-of-the-art GPU implementations based on Dijkstra. Both GPU implementations...
During the last decade, parallel processing architectures have become a powerful tool to deal with massively-parallel problems that require high performance computing (HPC). The last trend of HPC is the use of heterogeneous environments, that combine different computational processing devices, such as CPU-cores and graphics processing units (GPUs)....
Scheduling is one of the factors that most directly affect performance in Thread-Level Speculation (TLS). Since loops may present dependences that cannot be predicted before runtime, finding a good chunk size is not a simple task. The most used mechanism, Fixed-Size Chunking (FSC), requires many “dry-runs” to set the optimal chunk size. If the loop...
The polyhedral model can be used to automatically generate distributed-memory communications for affine nested loops. Recently, new communication schemes that reduce the communication volume have been presented. In this paper we study the extra computational effort introduced at run-time by the code generated to manage the communication details acr...
Currently, the generation of parallel codes which are portable to different kinds of parallel computers is a challenge. Many approaches have been proposed during the last years following two different paths. Programming from scratch using new programming languages and models that deal with parallelism explicitly, or automatically generating paralle...
Parallelization of sequential applications requires extracting information about the loops and how their variables are accessed, and afterwards, augmenting the source code with extra code depending on such information. In this paper we propose a framework that avoids such an error-prone, time-consuming task. Our solution leverages the compile-time...
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops withou...
Software-based, thread-level speculation (TLS) is a software technique that optimistically executes in parallel loops whose fully-parallel semantics can not be guaranteed at compile time. Modern TLS libraries allow to handle arbitrary data structures speculatively. This desired feature comes at the high cost of local store and/or remote recovery ti...
OpenMP directives are the de-facto standard for shared-memory parallel programming. However, OpenMP does not guarantee the correctness of the parallel execution of a given loop if runtime data dependences arise. Consequently, many highlyparallel regions cannot be safely parallelized with OpenMP due to the possibility of a dependence violation. In t...
The hierarchical methods aim to discover a hierarchy in the graph that, for road networks, usually corresponds with their hierarchical nature. A graph hierarchy is a division of the graph nodes into levels. To define it, some precomputation is needed. There are different precomputations that can be done in the graph depending on the hierarchical al...
In this chapter we will briefly compare the search spaces of the query phase of the approaches described. An in-depth comparison of all algorithms reviewed goes beyond the objectives of this paper, because it must take into account the particular distribution of nodes and edges in the graph, the relative positions of source and target nodes, and th...
Many applications in different domains need to calculate the shortest-path between two points in a graph. In this paper we describe this shortest path problem in detail, starting with the classic Dijkstra's algorithm and moving to more advanced solutions that are currently applied to road network routing, including the use of heuristics and precomp...
The non-hierarchical preprocessing methods aim to avoid settling unnecessary nodes using information obtained in a preprocessing phase that does not follow a hierarchical structure. The nature of these approaches is diverse, extracting different sets of data during precomputation. The following subsections will describe the approaches that fall int...
The easiest way to represent the information of a network is to transform it into a graph, where every link will be an edge, and every possible joint to change to another link will be a node. Some groups of nodes could represent cities, stations, or only intersection points, depending on the scenario to be considered. In this chapter, we present a...
In this chapter we will review some classical solutions to the shortest-path problem, including Dijkstra’s algorithm, some improvements on its data structures, and its bidirectional and heuristic variants. None of these solutions require precomputation. To better show how each algorithm works, we will see how they explore the graph proposed as our...
Current multicomputers are typically built as interconnected clusters of shared-memory multicore computers. A common programming approach for these clusters is to simply use a message-passing paradigm, launching as many processes as cores available. Nevertheless, to better exploit the scalability of these clusters and highly-parallel multicore syst...
Download Free Sample
Many applications in different domains need to calculate the shortest-path between two points in a graph. In this paper we describe this shortest path problem in detail, starting with the classic Dijkstra's algorithm and moving to more advanced solutions that are currently applied to road network routing, including the use of h...
During the last years, GPU manycore devices have demonstrated their usefulness to accelerate computationally intensive problems. Although arriving at a parallelization of a highly parallel algorithm is an affordable task, the optimization of GPU codes is a challenging activity. The main reason for this is the number of parameters, programming choic...
Dealing with both dense and sparse data in parallel environments usually leads to two different approaches: To rely on a monolithic, hard-to-modify parallel library, or to code all data management details by hand. In this paper we propose a third approach, that delivers good performance while the underlying library structure remains modular and ext...
Las directivas de OpenMP se pueden considerar como el estándar de programación paralela en memoria compartida. Sin embargo, OpenMP no garantiza que la ejecución paralela de un bucle siga la semántica secuencial si aparecen dependencias entre las instrucciones. En este trabajo proponemos aumentar la funcionalidad de OpenMP agregando soporte de paral...
Actualmente los clústers de computadoras que se utilizan para computación de alto rendimiento se construyen interconectando máquinas de memoria compartida. Como modelo de programación común para este tipo de