The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations
Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced “programmability” in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance.
Available from: hal.inria.fr
- "As a result, higher performance portability is also achieved thanks to the hardware abstraction layer introduced by runtime systems . These efforts resulted in the design of the MAGMA library  on top of StarPU, the DPLASMA library  on top of DAGuE and the adaptation of the existing FLAME library  to heterogeneous multicore systems using the SuperMatrix  runtime system. More recently, this approach has been used for more irregular applications. "
[Show abstract] [Hide abstract]
ABSTRACT: To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG-of-tasks parallelism regained popularity in the high performance, scientific computing community. In this context, enabling HPC applications to perform efficiently when dealing with graphs of parallel tasks that could potentially run simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling multiple parallel tasks over the same set of hardware resources introduces many issues, such as undesirable cache flushes or memory bus contention. In this paper, we show how runtime system-based scheduling contexts can be used to dynamically enforce locality of parallel tasks on multicore machines. We extend an existing generic sparse direct solver to use our mechanism and introduce a new decomposition method based on proportional mapping that is used to build the scheduling contexts. We propose a runtime-level dynamic context management policy to cope with the very irregular behaviour of the application. A detailed performance analysis shows significant performance improvements of the solver over various multicore hardware.
Available from: Pierre Ramet
- "These runtimes can be generic, like the two runtimes used in the context of this study (StarPU  or PaRSEC ), or more specialized like QUARK . These efforts resulted in the design of the DPLASMA library  on top of PaRSEC and the adaptation of the existing FLAME library . On the sparse direct methods front, preliminary work has resulted in mono-GPU implementations based on offloading parts of the computations to the GPU   . "
[Show abstract] [Hide abstract]
ABSTRACT: The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.
Available from: downloads.hindawi.com
- "When a new architecture becomes available, an appropriate implementation is selected. For this purpose, FLAME establishes a separation of concerns between the code and the target architecture by coding the dense linear algebra library at a higher level of abstraction and leaving the computations and data movement in the hands of the runtime system . However, FLAME addresses the issue of how to write the inner kernels on the new architecture (e.g., the GPU), whereas the current paper focuses more on creating a convenient and practical interface that plugs the inner kernels into the main framework after they have been developed. "
[Show abstract] [Hide abstract]
ABSTRACT: We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix--vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.