About
376
Publications
44,747
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,573
Citations
Publications
Publications (376)
This paper presents a approach for measuring the time spent by HPC applications in the operating system's kernel. We use the SystemTap interface to insert timers before and after system calls, and take advantage of its stability to design a tool that can be used with multiple versions of the kernel. We evaluate its performance overhead, using an OS...
Distributed data storage services tailored to specific applications have grown popular in the high-performance computing (HPC) community as a way to address I/O and storage challenges. These services offer a variety of specific interfaces, semantics, and data representations. They also expose many tuning parameters, making it difficult for their us...
The field of high-performance computing (HPC) has always challenged the research community to design and develop performance observation technology (based on instrumentation, measurement, and analysis methods), keeping pace with the rapid and aggressive evolution of HPC systems’ hardware and software. While the scope of observational concerns is br...
Application memory access patterns are crucial in deciding how much traffic is served by the cache and forwarded to the dynamic random-access memory (DRAM). However, predicting such memory traffic is difficult because of the interplay of prefetchers, compilers, parallel execution, and innovations in manufacturer-specific micro-architectures. This r...
Benchmarking is an important challenge in HPC, in particular, to be able to tune the basic blocks of the software environment used by applications. The communication library and distributed run-time environment are among the most critical ones. In particular, many of the routines provided by communication libraries can be adjusted using parameters...
Benchmarking is an important challenge in HPC, in particular, to be able to tune the basic blocks of the software environment used by applications. The communication library and distributed run-time environment are among the most critical ones. In particular, many of the routines provided by communication libraries can be adjusted using parameters...
This document presents the OpenSHMEM extension for the Special Karlsruhe MPI benchmark and the measurement algorithms used to measure the routines.
The reconfigurable computing paradigm with field programmable gate arrays (FPGAs) has received renewed interest in the high-performance computing field due to FPGAs’ unique combination of performance and energy efficiency. However, difficulties in programming and optimizing FPGAs have prevented them from being widely accepted as general-purpose com...
The backtrace is one of the most common operations done by profiling and debugging tools. It consists in determining the nesting of functions leading to the current execution state. Frameworks and standard libraries provide facilities enabling this operation, however, it generally incurs both computational and memory costs. Indeed, walking the stac...
In this article, we introduce DiPOSH, a multi‐network, distributed implementation of the OpenSHMEM standard. The core idea behind DiPOSH is to have an API‐to‐network software stack as slim as possible, in order to minimize the software overhead. Following the heritage of its non‐distributed parent POSH, DiPOSH's communication engine is organized ar...
Integrated shared memory heterogeneous architectures are pervasive because they satisfy the diverse needs of mobile, autonomous, and edge computing platforms. Although specialized processing units (PUs) that share a unified system memory improve performance and energy efficiency by reducing data movement, they also increase contention for this memo...
Because of increasing complexity in the memory hierarchy, predicting the performance of a given application in a given processor is becoming more difficult. The problem is worsened by the fact that the hardware needed to deal with more complex memory traffic also affects energy consumption. Moreover, in a heterogeneous system with shared main memor...
Since its launch in 2010, OpenACC has evolved into one of the most widely used portable programming models for accelerators on HPC systems today. Clacc is a project funded by the US Exascale Computing Project (ECP) to bring OpenACC support for C and C++ to the popular Clang and LLVM compiler infrastructure. In this paper, we describe Clacc's suppor...
Heterogeneous systems have become a staple of the HPC environment. Several directive-based solutions, such as OpenMP and OpenACC, have been developed to alleviate the challenges of programming heterogeneous systems, and these standards strive to provide a single portable programming solution across heterogeneous environments. However, in many ways...
Virtual conference presentation available at https://youtu.be/nvIZglD386U
Exascale systems are expected to have fewer bytes of memory per core available than present petascale systems. Previous analysis of the Open MPI OpenSHMEM runtime has shown that it allocates some object types which use memory proportional to the square of the number of PEs....
Slides for presentation available at https://youtu.be/nvIZglD386U
A major challenge in high-performance computing is performance portability. Using abstraction layers like SYCL, applications can be developed which can target, with the same code base, different execution environments. However, cross-platform code produced in that way will not necessarily provide acceptable performance on multiple platforms. Perfor...
Large scale parallel applications have evolved beyond the tipping point where there are compelling reasons to analyze, visualize and otherwise process output data from scientific simulations in situ rather than writing data to filesystems for post-processing. This modern approach to in situ integration is served by recently developed technologies s...
The TAU Performance System ® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ fu...
Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such a...
As the era of high frequency, single core processors have come to a close, the new paradigm of many core processors has come to dominate. In response to these systems, asynchronous multitasking runtime systems have been developed as a promising solution to efficiently utilize these newly available hardware. Asynchronous multitasking runtime systems...
Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infras...
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we descr...
Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of...
In this poster we discuss the tuning of a full CFD application, FDL3DI and a mini-app, consisting of the Data Parallel Line Relaxation (DPLR) implicit time advancement kernel in the ablation-fluid-structure interaction solver (AFSI) under development, for the Intel Xeon Phi second generation, "Knights Landing," architecture. We identify hot-spots u...
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we descr...
Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity a...
The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper leve...
This chapter explores present-day challenges and those likely to arise as new hardware and software technologies are introduced on the path to exascale. It covers some of the underlying hardware, software, and techniques that enable tools and debuggers. Performance tools and debuggers are critical components that enable computational scientists to...
This poster shows how TAU Commander can help easily characterize the performance of OpenSHMEM applications operating at extreme scales without modifying the application or relying on tool interfaces like PSHMEM.
MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level...
The evolution of parallel architectures towards machines with many-core processors and high node-level concurrency is putting an end to the pure-MPI programming model. Simulations codes must expose multiple levels of parallelisms inside and between nodes, combining different programming models (e.g., MPI+X), to productively use current and future s...
Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically m...
Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU model...
FUN3D is an unstructured-grid computational fluid dynamics suite widely used to support major national research and engineering efforts. FUN3D is being applied to analysis and design problems across all the major service branches at the Department of Defense. These applications span the speed range from subsonic to hypersonic flows and include both...
Multi-Application onLine Profiling (MALP) is a performance tool which has been developed as an alternative to the trace-based approach for fine-grained event collection. Any performance and analysis measurement system must address the problem of data management and projection to meaningful forms. Our concept of a valorization chain is introduced to...
This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive parti...
New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks. Compute nodes are expected to host both general-purpose and special-purpose processors or accelerators, with more complex memory hierarchies...
Current trends for high-performance systems are leading towards hardware overprovisioning where it is no longer possible to run all components at peak power without exceeding a system- or facility-wide power bound. The standard practice of static power scheduling is likely to lead to inefficiencies with over- and under-provisioning of power to comp...
In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain...
Tuning codes for GPGPU architectures is challenging because few performance tools can pinpoint the exact causes of execution bottlenecks. While profiling applications can reveal execution behavior with a particular architecture, the abundance of collected information can also overwhelm the user. Moreover, performance counters provide cumulative val...
Advances in human brain neuroimaging for high-temporal and high-spatial resolution will depend on
localization of Electroencephalography (EEG) signals to their cortex sources. The source localization inverse
problem is inherently ill-posed and depends critically on the modeling of human head electromagnetics.
We present a systematic methodology to...
Current trends for high-performance systems are leading us towards hardware over-provisioning where it is no longer possible to run each component at peak power without exceeding a system or facility wide power bound. In such scenarios, the power consumed by individual components must be artificially limited to guarantee system operation under a gi...
Particle advection is a foundational operation for many flow visualization techniques, including streamlines, Finite-Time Lyapunov Exponents (FTLE) calculation, and stream surfaces. The workload for particle advection problems varies greatly, including significant variation in computational requirements. With this study, we consider the performance...
Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor...
Partitioned global address space (PGAS) applications, such as the Tensor Contraction Engine (TCE) in NWChem, often apply a one-process-per-core mapping in which each process iterates through the following work-processing cycle: (1) determine a work-item dynamically, (2) get data via one-sided operations on remote blocks, (3) perform computation on...
The study of macromolecular systems may require large computer simulations that are too time consuming and resource intensive to execute in full atomic detail. The integral equation coarse-graining approach by Guenza and co-workers enables the exploration of longer time and spatial scales without sacrificing thermodynamic consistency, by approximat...
Many excellent open-source and commercial tools enable the detailed measurement of the performance attributes of applications. However, the process of collecting measurement data and analyzing it remains effort-intensive because of differences in tool interfaces and architectures. Furthermore, insufficient standards and automation may result in los...
Understanding the performance of program execution is essential when optimizing simulations run on high-performance supercomputers. Instrumenting and profiling codes is itself a difficult task and interpreting the resulting complex data is often facilitated through visualization of the gathered measures. However, these measures typically ignore spa...
Electroencephalographic (EEG) oscillations in multiple frequency bands can be observed during functional activity of the cerebral cortex. An important question is whether activity of focal areas of cortex, such as during finger movements, is tracked by focal oscillatory EEG changes. Although a number of studies have compared EEG changes to function...
I/O performance is becoming a key bottleneck in many cases at the ex- treme scale. As the volume of data and application reads and writes increases, it is important to assess the scalability of I/O operations as a key contribu- tor to overall application performance. Optimizing I/O performance presents unique challenges for application developers a...
Practice has shown that programming a new multicore system is a greater challenge than previously thought. The challenge is to produce the resulting system in a way, which is as easy as sequential programming. This new trend has changed the way we think about the whole development process. The aim of this work is to show that it is possible to deve...
The electrical impedance tomortaphy (EIT) problems in anisotropic inhomogeneous media like head tissues belongs to the class of the three-dimensional boundary value problems for elliptic equations with mixed derivatives. The efficiency of the most discussed and usable in practice numerical methods in context of modeling EIT problems is reviewed in...
A hybrid parallel measurement system offers the potential to fuse the principal advantages of probe-based tools, with their exact measures of performance and ability to capture event semantics, and sampling-based tools, with their ability to observe performance detail with less overhead. Creating a hybrid profiling solution is challenging because i...
The Electrical Impedance Tomography (EIT) and electroencephalography (EEG) forward problems in anisotropic inhomogeneous
media like the human head belongs to the class of the three-dimensional boundary value problems for elliptic equations with mixed
derivatives. We introduce and explore the performance of several new promising numerical techniques...
As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instru-mentation techniques, through the insertion of code directly into binaries, are essential to program analyses such as performance evaluation and profiling. In the context of high-performance parallel applications, building an inst...