
Rolf RabenseifnerUniversität Stuttgart · High Performance Computing Center Stuttgart
Rolf Rabenseifner
Dr.
About
71
Publications
11,781
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,508
Citations
Citations since 2017
Publications
Publications (71)
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logic...
The Message-passing Interface (MPI) standard provides basic means for adaptations of the mapping of MPI process ranks to processing elements to better match the communication characteristics of applications to the capabilities of the underlying systems. The MPI process topology mechanism enables the MPI implementation to rerank processes by creatin...
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logic...
Event traces are helpful in understanding the performance behavior of message-passing applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order an...
Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node in- terconnect with shared memory parallelization inside each node. We describe p...
Event traces are helpful in understanding the performance behavior of message-passing applications since they allow in-depth analyses of communication and synchronization patterns. However, the absence of synchronized hardware clocks may render the analysis ineffective because inaccurate relative event timings can misrepresent the logical event ord...
In the future, most systems in high-performance computing (HPC) will have a hierarchical hardware design, e.g., a cluster
of ccNUMA or shared memory nodes with each node having several multi-core CPUs. Parallel programming must combine the distributed
memory parallelization on the node inter-connect with the shared memory parallelization inside eac...
To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between t...
This paper provides a comprehensive performance evaluation of the NEC SX-8 system at the High Performance Computing Center Stuttgart which has been in operation since July 2005. It provides a description of the installed hardware together with its performance for some synthetic benchmarks and five real world applications. All the applications achie...
Identifying wait states in event traces of message-passing applications requires measuring temporal displacements between con- current events. In the absence of synchronized hardware clocks, linear interpolation techniques can already account for differences in offset and drift, assuming that the drift of an individual processor is not time de- pen...
Many scientific applications running on today's supercomputers deal with increasingly large data sets and are correspondingly bottlenecked by the time it takes to read or write the data from/to the file system. We therefore undertook a study to characterize the parallel I/O performance of two of today's leading parallel supercomputers: the Columbia...
The HPC Challenge (HPCC) Benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers—SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon Cluster, and NEC SX-8. These five systems use five different networks (SGI...
Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation”
type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect
with the shared memory parallelization inside of each node.
The paper focuses on a parallel implementation of a simu- lated annealing algorithm. In order to take advantage of the properties of modern clustered SMP architectures a hybrid method using a combi- nation of OpenMP nested in MPI is advocated. The development of the reference implementation is proposed. Furthermore, a few load balancing strategies...
The technology advances made in supercomputers and high performance computing clusters over the past few years have been tremendous. Clusters are the most common solution for high performance computing at the present time. In this kind of systems, an important subject is the parallel I/O subsystem design. Parallel file systems (GPFS, PVFS, Lustre,...
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers -SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, NEC SX-8, Cray XT3 and IBM Blue Gene/L. These six systems use also...
In 2003, the DARPA's High Productivity Computing Systems released the HPCC suite. It examines the performance of HPC architectures using kernels with various memory access patterns of well known computational kernels. Consequently, HPCC results bound the performance of real applications as a function of memory access characteristics and define perf...
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SG...
W15P7T-05-C-D001 Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States
The HPC Challenge benchmark suite (HPCC) was released to analyze the performance of high-performance computing architectures using several kernels to measure dieren t memory and hardware access patterns comprising latency based measurements, memory streaming, inter-process communication and oating point computation. HPCC de- nes a set of benchmarks...
Concurrent computing can be applied to heuristic methods for combinatorial optimization to shorten computation time, or equiva- lently, to improve the solution when time is xed. This paper presents several communication schemes for parallel simulated annealing, focus- ing on a combination of OpenMP nested in MPI. Strikingly, even though many public...
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization
on the node interconnect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming
models are compared with pure MPI. Benchmark results of several platforms are presented. This paper...
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms de- pending on the message size, with the goal of min- imizing latency for short messages and minimizing bandwidth use for long messages. Although we...
Nichtlineare Registrierung ist eine der rechenintensivsten Aufgaben in der Bildregistrierung. Dies verbietet oftmals einen
Einsatz dieser Methoden in zeitkritischen Anwendungen im klinischen Einsatz. In dieser Arbeit wird eine sehr effiziente parallelisierte
und vektorisierte Implementierung des diffusiven Registrierungsalgorithmus vorgestellt. Tes...
This paper describes a finite element geo-process modeling software, which is able of solving multiphysics problems in the area of geo science. First results of a water resources management model for the Jordan Valley area are presented. These kind of problems are very demanding in terms of CPU time and memory space, which are typically not availab...
The HPC Challenge benchmark suite has been released by the DARPA HPCS program to help define the performance boundaries of future Petascale computing systems. HPC Challenge is a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance Linpack (HPL)...
We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce...
A 5-year-proling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Mes- sage Passing Interface (MPI) routines is spent in the collective commu- nication routines MPI Allreduce and MPI Reduce. Although MPI im- plementations are now available for about 10 years and all vendors are committed to thi...
This paper deals with the parallel numerical simulation of cavitating o ws. The governing equations are the compressible, time de- pendent Euler equations for a homogeneous two-phase mixture. These equations are solved by an explicit nite volume approach. In opposite to the ideal gas, after each time step uid properties, namely pressure and tempera...
This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analy...
Summary Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem- ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. The hy- brid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other par- allel progr...
This paper deals with the parallel numerical simulation of cavitating ows. The governing equations are the compressible, time dependent Euler equations for a homogeneous two-phase mixture. The equations of state for the density and internal energy are more complicated than for the ideal gas. These equations are solved by an explicit nite volume app...
One of the crucial problems in image processing is Image Matching, i.e., to match two images, or in our case, to match a model with the given image. This problem being highly computation intensive, parallel processing is essential to obtain the solutions in time due to real world constraints. The Hausdor method is used to locate human beings in ima...
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization
on the node inter-connect with the shared memory parallelization inside of each node. The hybrid MPI+OpenMP programming model
is compared with pure MPI and compiler based parallelization. The paper focuses on bandwidth and...
Published in the proceedings of WOMPEI 2002, International Workshop on OpenMP: Experiences and Implementations, part of ISHPC-IV, International Symposium on High Performance Comput- ing, May, 15{17., 2002, Kansai Science City, Japan. LNCS, Springer-Verlag, 2002. c Springer-Verlag, http://www.springer.de/comp/lncs/index.html Abstract. Most HPC syste...
The parallel effective I/O bandwidth benchmark (b_eff_io) is aimed at producing a characteristic average number of the I/O bandwidth achievable with parallel MPI-I/O applications exhibiting various access patterns and using various buffer lengths. It is designed so that 15 minutes should be sufficient for a first pass of all access patterns. First...
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercomputers: b_eff, the communication-specific benchmark examines the parallel message passing performance of a system, and b_eff_io, which characterizes the effective I/O bandwidth. Both benchm...
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercomputers. We start with a communication-specific benchmark, called b_eff that characterizes the message passing performance of a system. Following the same line of development, we extend this...
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercom- puters. We start with a communication-specific benchmark, called b eff that characterizes the message passing perfor- mance of a system. Following the same line of develop- ment, we exten...
The effective I/O bandwidth benchmark (b_eff_io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (indiv...
The effective I/O bandwidth benchmark (b_eff_io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (indiv...
This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the...
105 pages Das Aufzeichnen und Darstellen des Programmflusses sowie des Nachrichtenaustauschs paralleler Anwendungen ist schwierig, wenn jeder Prozessor eine eigene Uhr besitzt, und diese Uhren nicht synchronisiert sind. Mehrere Strategien zur Bildung einer globalen Uhrzeit werden in einem Überblick dargestellt, und die Grenzen werden aufgezeigt. Di...
This paper presents an automatic counter instrumentation and profiling module
. This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written on a special syslog file. The user can get the same information in a different...
. This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get t...
This paper presents an automatic counter instrumentation and profiling module added to the MPI library on our Cray T3E. A statistical summary of the MPI calls of any MPI partition is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same statistical information on a file. Weekly and monthly a stati...
Anfang Juli 1996 wurde der neue Höchstleistungsrechner NEC SX-4/32 am Rechenzentrum der Universität Stuttgart installiert. Der Artikel beschreibt die Anwendung von MPI auf diesem shared-memory Rechner und zeigt erste Ergebnisse. Es soll verdeutlichen, wie MPI auf der NEC SX-4 verwendet wird, wo Probleme mit den derzeitig vorhandenen Implementierung...
The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message...
Several metacomputing projects try to implement MPI for homogeneous and heterogeneous clusters of parallel systems. MPI-GLUE is the first approach which exports nearly full MPI 1.1 to the user's application without losing the efficiency of the vendors" MPI implementations. Inside of each MPP or PVP system the vendor's MPI implementation is used. Be...
The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message...
Event tracing and monitoring of parallel applications are difficult if each processor has its own unsynchronized clock. A survey is given on several strategies to generate a global time, and their limits are discussed. The controlled logical clock is a new method based on Lamport's logical clock and provides a method to modify inexact timestamps of...
Anfang Juli 1996 wurde der neue Höchstleistungsrechner NEC SX-4/32 am Rechenzentrum der Universität Stuttgart installiert. Der Artikel beschreibt die Anwendung von MPI auf diesem shared-memory Rechner und zeigt erste Ergebnisse. Es soll verdeutlichen, wie MPI auf der NEC SX-4 verwendet wird, wo Probleme mit den derzeitig vorhandenen Implementierung...
. DFN-RPC, a remote procedure call tool, was designed to distribute scientific applications accross workstations and compute servers. This document describes the methods in which the DFN-RPC tool supports parallel and distributed applications. Asynchronous RPC's are enhanced into parallel RPC's and combined with data pipes. The startup of processes...
Der DFN-RPC, ein Remote Procedure Call Tool, wurde entwickelt, um wissenschaftlich technische Anwendungsprogramme zwischen Workstation und Compute Server verteilen zu k onnen. Dieser Bericht beschreibt die Methoden, mit denen das DFN-RPC Tool parallele und verteilte Anwendungen unterst utzt. Asynchrone RPCs wurden zu parallelen RPCs erweitert und m...
Taking part in the Early Participation Program of OSF/DCE on IBM RS/6000 workstations, we have examined the RPC of DCE between workstation and compute server under aspects of performance, capability and functionality for scientific-technical applications programmed in Fortran, under user-account. A brief introduction shows the demands expected from...
Hybrid MPI/OpenMP and pure MPI on clusters of multi-core SMP nodes involve several mismatch problems be-tween the parallel programming models and the hardware architectures. Measurements of communication character-istics between cores on the same socket, on the same SMP node, and between SMP nodes on several platforms (includ-ing Cray XT4 and XT5)...
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory paralleliza- tion on the node inter-connect with the shared memory parallelization inside of each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming model...
This report presents development, implementation and use of High Performance Computing (HPC) methods in order to solve modelling problems in applied geo- science. These applications are mostly not academic problems but have arisen from field work at several 'real' investigation sites. 'Real world problems' are still very dicult to simulate, because...
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem-ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This pape...
Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for thr...