Sameer Shende

Sameer Shende
University of Oregon | UO · Performance Research Laboratory

PhD

About

168
Publications
38,574
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,155
Citations
Additional affiliations
January 2014 - present
ParaTools, SAS
Position
  • President and Director
January 2008 - present
University of Oregon
Position
  • Director, Performance Research Laboratory
June 2004 - present
ParaTools, Inc.
Position
  • President and Director

Publications

Publications (168)
Chapter
Full-text available
The backtrace is one of the most common operations done by profiling and debugging tools. It consists in determining the nesting of functions leading to the current execution state. Frameworks and standard libraries provide facilities enabling this operation, however, it generally incurs both computational and memory costs. Indeed, walking the stac...
Conference Paper
Full-text available
Since its launch in 2010, OpenACC has evolved into one of the most widely used portable programming models for accelerators on HPC systems today. Clacc is a project funded by the US Exascale Computing Project (ECP) to bring OpenACC support for C and C++ to the popular Clang and LLVM compiler infrastructure. In this paper, we describe Clacc's suppor...
Conference Paper
Full-text available
Introduction Building applications by composing existing libraries and using existing tools can be a tremendous productivity improvement. If existing software is high quality, accessible and reusable, one would be much better off using it than writing your own. At the same time, as HPC and AI/ML software gets more complex, it is getting harder to m...
Conference Paper
Full-text available
Virtual conference presentation available at https://youtu.be/nvIZglD386U Exascale systems are expected to have fewer bytes of memory per core available than present petascale systems. Previous analysis of the Open MPI OpenSHMEM runtime has shown that it allocates some object types which use memory proportional to the square of the number of PEs....
Data
Slides for presentation available at https://youtu.be/nvIZglD386U
Conference Paper
Full-text available
A major challenge in high-performance computing is performance portability. Using abstraction layers like SYCL, applications can be developed which can target, with the same code base, different execution environments. However, cross-platform code produced in that way will not necessarily provide acceptable performance on multiple platforms. Perfor...
Conference Paper
Full-text available
The TAU Performance System ® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ fu...
Conference Paper
Full-text available
Since the beginning, MPI has defined the rank as an implicit attribute associated with the MPI process' environment. In particular, each MPI process generally runs inside a given UNIX process and is associated with a fixed identifier in its WORLD communicator. However, this state of things is about to change with the rise of new abstractions such a...
Conference Paper
Full-text available
Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infras...
Preprint
Full-text available
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we descr...
Chapter
Full-text available
Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of...
Chapter
Full-text available
As the exascale era approaches, it is becoming increasingly important that runtimes be able to scale to very large numbers of processing elements. However, by keeping arrays of sizes proportional to the number of PEs, an OpenSHMEM implementation may be limited in its scalability to millions of PEs. In this paper, we describe techniques for tracking...
Poster
Full-text available
In this poster we discuss the tuning of a full CFD application, FDL3DI and a mini-app, consisting of the Data Parallel Line Relaxation (DPLR) implicit time advancement kernel in the ablation-fluid-structure interaction solver (AFSI) under development, for the Intel Xeon Phi second generation, "Knights Landing," architecture. We identify hot-spots u...
Conference Paper
Full-text available
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we descr...
Article
Full-text available
The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper leve...
Presentation
Full-text available
In this work, core kernels of an existing large-scale unstructured-grid computational fluid dynamics solver are ported to two nascent many-core architectures. Data layout strategies and disparate programming models are detailed, including explicit domain decomposition with message passing and shared-memory approaches. A gamut of optimization techni...
Poster
Full-text available
In the field of computational fluid dynamics (CFD), the Navier- Stokes equations are often solved using an unstructured-grid approach to accommodate geometric complexity. Furthermore, turbulent flows encountered in aerospace applications generally require highly anisotropic meshes, driving the need for implicit solution methodologies to efficiently...
Poster
Full-text available
This poster shows how TAU Commander can help easily characterize the performance of OpenSHMEM applications operating at extreme scales without modifying the application or relying on tool interfaces like PSHMEM.
Conference Paper
Full-text available
MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level...
Poster
Full-text available
FUN3D is an unstructured-grid computational fluid dynamics suite widely used to support major national research and engineering efforts. FUN3D is being applied to analysis and design problems across all the major service branches at the Department of Defense. These applications span the speed range from subsonic to hypersonic flows and include both...
Conference Paper
Full-text available
The advent of many-core architectures poses new challenges to the MPI programming model which has been designed for distributed memory message passing. It is now clear that MPI will have to evolve in order to exploit shared-memory parallelism, either by collaborating with other programming models (MPI+X) or by introducing new shared-memory approach...
Conference Paper
Full-text available
Developing high performance OpenSHMEM applications routinely involves gaining a deeper understanding of software execution, yet there are numerous hurdles to gathering performance metrics in a production environment. Most OpenSHMEM performance profilers rely on the PSHMEM interface but PSHMEM is an optional and often unavailable feature. We present...
Chapter
Full-text available
Multi-Application onLine Profiling (MALP) is a performance tool which has been developed as an alternative to the trace-based approach for fine-grained event collection. Any performance and analysis measurement system must address the problem of data management and projection to meaningful forms. Our concept of a valorization chain is introduced to...
Conference Paper
Full-text available
New computing technologies are expected to change the high-performance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks. Compute nodes are expected to host both general-purpose and special-purpose processors or accelerators, with more complex memory hierarchies...
Conference Paper
Full-text available
In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain...
Conference Paper
Full-text available
Fast, accurate numerical simulations of chemical kinetics are critical to aerospace, manufacturing, materials research, earth systems research, energy research, climate and weather prediction, and air quality prediction. Although these codes address different problems, chemical kinetics simulation often accounts for 60-95% of their runtime. Kppa is...
Article
Full-text available
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray...
Article
Full-text available
Density function theory (DFT) is the most widely employed electronic structure method because of its favorable scaling with system size and accuracy for a broad range of molecular and condensed-phase systems. The advent of massively parallel supercomputers has enhanced the scientific community's ability to study larger system sizes. Ground-state DF...
Chapter
Full-text available
I/O performance is becoming a key bottleneck in many cases at the ex- treme scale. As the volume of data and application reads and writes increases, it is important to assess the scalability of I/O operations as a key contribu- tor to overall application performance. Optimizing I/O performance presents unique challenges for application developers a...
Conference Paper
Full-text available
The ability to measure the performance of OpenMP pro-grams portably across shared memory platforms and across OpenMP compilers is a challenge due to the lack of a widely-implemented perfor-mance interface standard. While the OpenMP community is currently evaluating a tools interface specification called OMPT, at present there are different instrume...
Conference Paper
Full-text available
The recent development of a unified SHMEM framework, OpenSHMEM, has enabled further study in the porting and scaling of ap-plications that can benefit from the SHMEM programming model. This paper focuses on non-numerical graph algorithms, which typically have a low FLOPS/byte ratio. An overview of the space and time complexity of Kruskal's and Prim...
Conference Paper
Full-text available
As software complexity increases, the analysis of code behavior during its execution is becoming more important. Instru-mentation techniques, through the insertion of code directly into binaries, are essential to program analyses such as performance evaluation and profiling. In the context of high-performance parallel applications, building an inst...
Conference Paper
Full-text available
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray...
Conference Paper
Full-text available
Good load-balancing methods are required in order to obtain scalability from the NWChem coupled-cluster module, which allows the detailed study of chemical problems by iteratively solving the Schrodinger equation with an accurate ansatz. In this application, a relatively large amount of task information can be obtained at minimal cost, which sugges...
Conference Paper
Full-text available
Memory errors, such as an invalid memory access, misaligned allocation, or write to deallocated memory, are among the most difficult problems to debug because popular debugging tools do not fully support state inspection when examining failures. This is particularly true for applications written in a combination of Python, C++, C, and Fortran. We p...
Article
Full-text available
Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a...
Article
Full-text available
Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code fo...
Conference Paper
Full-text available
This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the Score-P community instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI or OpenMP and now common mixed-mode hybrid parallelizations....
Conference Paper
Full-text available
Traditional debugging tools do not fully support state inspection while examining failures in multi-language applications written in a combination of Python, C++, C, and Fortran. When an application experiences a runtime fault, such as numerical or memory error, it is difficult to relate the location of the fault to the original source code and exa...
Article
Full-text available
The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations...
Chapter
Full-text available
This paper gives an overview about the Score-P performance measure-ment infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both, the de-veloper and the user perspectives, and presents the design and components of the newly developed Score-P performance m...
Chapter
Full-text available
The rapidly growing number of cores on modern supercomputers imposes scalability demands not only on applications but also on the software tools needed for their development. At the same time, increasing application and system complexity makes the optimization of parallel codes more difficult, creating a need for scalable performance-analysis techn...
Chapter
Full-text available
Evolution and growth of parallel systems requires continued advances in the tools to measure, characterize, and understand parallel performance. Five recent developments in the TAU Performance System are reported. First, an update is given on support for heterogeneous systems with GPUs. Second, event-based sampling is being integrated in TAU to add...