Rolf Rabenseifner

Rolf Rabenseifner
Universität Stuttgart · High Performance Computing Center Stuttgart

Dr.

About

71
Publications
10,178
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,332
Citations

Publications

Publications (71)
Article
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logic...
Article
The Message-passing Interface (MPI) standard provides basic means for adaptations of the mapping of MPI process ranks to processing elements to better match the communication characteristics of applications to the capabilities of the underlying systems. The MPI process topology mechanism enables the MPI implementation to rerank processes by creatin...
Conference Paper
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logic...
Article
Event traces are helpful in understanding the performance behavior of message-passing applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order an...
Conference Paper
Full-text available
Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node in- terconnect with shared memory parallelization inside each node. We describe p...
Conference Paper
Event traces are helpful in understanding the performance behavior of message-passing applications since they allow in-depth analyses of communication and synchronization patterns. However, the absence of synchronized hardware clocks may render the analysis ineffective because inaccurate relative event timings can misrepresent the logical event ord...
Conference Paper
In the future, most systems in high-performance computing (HPC) will have a hierarchical hardware design, e.g., a cluster of ccNUMA or shared memory nodes with each node having several multi-core CPUs. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside eac...
Conference Paper
To support the development of efficient parallel codes on cluster systems, event tracing is a widely used technique with a broad spectrum of applications ranging from performance analysis, performance prediction and modeling to debugging. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between t...
Article
This paper provides a comprehensive performance evaluation of the NEC SX-8 system at the High Performance Computing Center Stuttgart which has been in operation since July 2005. It provides a description of the installed hardware together with its performance for some synthetic benchmarks and five real world applications. All the applications achie...
Conference Paper
Identifying wait states in event traces of message-passing applications requires measuring temporal displacements between con- current events. In the absence of synchronized hardware clocks, linear interpolation techniques can already account for differences in offset and drift, assuming that the drift of an individual processor is not time de- pen...
Article
Full-text available
Many scientific applications running on today's supercomputers deal with increasingly large data sets and are correspondingly bottlenecked by the time it takes to read or write the data from/to the file system. We therefore undertook a study to characterize the parallel I/O performance of two of today's leading parallel supercomputers: the Columbia...
Article
Full-text available
The HPC Challenge (HPCC) Benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers—SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon Cluster, and NEC SX-8. These five systems use five different networks (SGI...
Conference Paper
Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation” type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.
Conference Paper
The paper focuses on a parallel implementation of a simu- lated annealing algorithm. In order to take advantage of the properties of modern clustered SMP architectures a hybrid method using a combi- nation of OpenMP nested in MPI is advocated. The development of the reference implementation is proposed. Furthermore, a few load balancing strategies...
Conference Paper
Full-text available
The technology advances made in supercomputers and high performance computing clusters over the past few years have been tremendous. Clusters are the most common solution for high performance computing at the present time. In this kind of systems, an important subject is the parallel I/O subsystem design. Parallel file systems (GPFS, PVFS, Lustre,...
Article
Full-text available
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of six leading supercomputers -SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, NEC SX-8, Cray XT3 and IBM Blue Gene/L. These six systems use also...
Conference Paper
Full-text available
In 2003, the DARPA's High Productivity Computing Systems released the HPCC suite. It examines the performance of HPC architectures using kernels with various memory access patterns of well known computational kernels. Consequently, HPCC results bound the performance of real applications as a function of memory access characteristics and define perf...
Conference Paper
Full-text available
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SG...
Article
W15P7T-05-C-D001 Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States
Conference Paper
The HPC Challenge benchmark suite (HPCC) was released to analyze the performance of high-performance computing architectures using several kernels to measure dieren t memory and hardware access patterns comprising latency based measurements, memory streaming, inter-process communication and oating point computation. HPCC de- nes a set of benchmarks...
Conference Paper
Concurrent computing can be applied to heuristic methods for combinatorial optimization to shorten computation time, or equiva- lently, to improve the solution when time is xed. This paper presents several communication schemes for parallel simulated annealing, focus- ing on a combination of OpenMP nested in MPI. Strikingly, even though many public...
Chapter
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This paper...
Article
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms de- pending on the message size, with the goal of min- imizing latency for short messages and minimizing bandwidth use for long messages. Although we...
Conference Paper
Full-text available
Nichtlineare Registrierung ist eine der rechenintensivsten Aufgaben in der Bildregistrierung. Dies verbietet oftmals einen Einsatz dieser Methoden in zeitkritischen Anwendungen im klinischen Einsatz. In dieser Arbeit wird eine sehr effiziente parallelisierte und vektorisierte Implementierung des diffusiven Registrierungsalgorithmus vorgestellt. Tes...
Article
This paper describes a finite element geo-process modeling software, which is able of solving multiphysics problems in the area of geo science. First results of a water resources management model for the Jordan Valley area are presented. These kind of problems are very demanding in terms of CPU time and memory space, which are typically not availab...
Article
Full-text available
The HPC Challenge benchmark suite has been released by the DARPA HPCS program to help define the performance boundaries of future Petascale computing systems. HPC Challenge is a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance Linpack (HPL)...
Conference Paper
We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce...
Conference Paper
A 5-year-proling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Mes- sage Passing Interface (MPI) routines is spent in the collective commu- nication routines MPI Allreduce and MPI Reduce. Although MPI im- plementations are now available for about 10 years and all vendors are committed to thi...
Conference Paper
Full-text available
This paper deals with the parallel numerical simulation of cavitating o ws. The governing equations are the compressible, time de- pendent Euler equations for a homogeneous two-phase mixture. These equations are solved by an explicit nite volume approach. In opposite to the ideal gas, after each time step uid properties, namely pressure and tempera...
Article
This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analy...
Article
Summary Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem- ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. The hy- brid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other par- allel progr...
Conference Paper
This paper deals with the parallel numerical simulation of cavitating ows. The governing equations are the compressible, time dependent Euler equations for a homogeneous two-phase mixture. The equations of state for the density and internal energy are more complicated than for the ideal gas. These equations are solved by an explicit nite volume app...
Conference Paper
Full-text available
One of the crucial problems in image processing is Image Matching, i.e., to match two images, or in our case, to match a model with the given image. This problem being highly computation intensive, parallel processing is essential to obtain the solutions in time due to real world constraints. The Hausdor method is used to locate human beings in ima...
Conference Paper
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node. The hybrid MPI+OpenMP programming model is compared with pure MPI and compiler based parallelization. The paper focuses on bandwidth and...
Conference Paper
Published in the proceedings of WOMPEI 2002, International Workshop on OpenMP: Experiences and Implementations, part of ISHPC-IV, International Symposium on High Performance Comput- ing, May, 15{17., 2002, Kansai Science City, Japan. LNCS, Springer-Verlag, 2002. c Springer-Verlag, http://www.springer.de/comp/lncs/index.html Abstract. Most HPC syste...
Article
Full-text available
The parallel effective I/O bandwidth benchmark (b_eff_io) is aimed at producing a characteristic average number of the I/O bandwidth achievable with parallel MPI-I/O applications exhibiting various access patterns and using various buffer lengths. It is designed so that 15 minutes should be sufficient for a first pass of all access patterns. First...
Conference Paper
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercomputers: b_eff, the communication-specific benchmark examines the parallel message passing performance of a system, and b_eff_io, which characterizes the effective I/O bandwidth. Both benchm...
Article
Full-text available
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercomputers. We start with a communication-specific benchmark, called b_eff that characterizes the message passing performance of a system. Following the same line of development, we extend this...
Conference Paper
Full-text available
We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercom- puters. We start with a communication-specific benchmark, called b eff that characterizes the message passing perfor- mance of a system. Following the same line of develop- ment, we exten...
Conference Paper
The effective I/O bandwidth benchmark (b_eff_io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (indiv...
Article
Full-text available
The effective I/O bandwidth benchmark (b_eff_io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (indiv...
Article
This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the...
Article
105 pages Das Aufzeichnen und Darstellen des Programmflusses sowie des Nachrichtenaustauschs paralleler Anwendungen ist schwierig, wenn jeder Prozessor eine eigene Uhr besitzt, und diese Uhren nicht synchronisiert sind. Mehrere Strategien zur Bildung einer globalen Uhrzeit werden in einem Überblick dargestellt, und die Grenzen werden aufgezeigt. Di...
Article
. This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written on a special syslog file. The user can get the same information in a different...
Conference Paper
. This paper presents an automatic counter instrumentation and profiling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get t...
Article
This paper presents an automatic counter instrumentation and profiling module added to the MPI library on our Cray T3E. A statistical summary of the MPI calls of any MPI partition is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same statistical information on a file. Weekly and monthly a stati...
Article
Anfang Juli 1996 wurde der neue Höchstleistungsrechner NEC SX-4/32 am Rechenzentrum der Universität Stuttgart installiert. Der Artikel beschreibt die Anwendung von MPI auf diesem shared-memory Rechner und zeigt erste Ergebnisse. Es soll verdeutlichen, wie MPI auf der NEC SX-4 verwendet wird, wo Probleme mit den derzeitig vorhandenen Implementierung...
Article
Full-text available
The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message...
Conference Paper
Several metacomputing projects try to implement MPI for homogeneous and heterogeneous clusters of parallel systems. MPI-GLUE is the first approach which exports nearly full MPI 1.1 to the user's application without losing the efficiency of the vendors" MPI implementations. Inside of each MPP or PVP system the vendor's MPI implementation is used. Be...
Article
The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message...
Article
Event tracing and monitoring of parallel applications are difficult if each processor has its own unsynchronized clock. A survey is given on several strategies to generate a global time, and their limits are discussed. The controlled logical clock is a new method based on Lamport's logical clock and provides a method to modify inexact timestamps of...
Article
Anfang Juli 1996 wurde der neue Höchstleistungsrechner NEC SX-4/32 am Rechenzentrum der Universität Stuttgart installiert. Der Artikel beschreibt die Anwendung von MPI auf diesem shared-memory Rechner und zeigt erste Ergebnisse. Es soll verdeutlichen, wie MPI auf der NEC SX-4 verwendet wird, wo Probleme mit den derzeitig vorhandenen Implementierung...
Article
. DFN-RPC, a remote procedure call tool, was designed to distribute scientific applications accross workstations and compute servers. This document describes the methods in which the DFN-RPC tool supports parallel and distributed applications. Asynchronous RPC's are enhanced into parallel RPC's and combined with data pipes. The startup of processes...
Article
Der DFN-RPC, ein Remote Procedure Call Tool, wurde entwickelt, um wissenschaftlich technische Anwendungsprogramme zwischen Workstation und Compute Server verteilen zu k onnen. Dieser Bericht beschreibt die Methoden, mit denen das DFN-RPC Tool parallele und verteilte Anwendungen unterst utzt. Asynchrone RPCs wurden zu parallelen RPCs erweitert und m...
Conference Paper
Taking part in the Early Participation Program of OSF/DCE on IBM RS/6000 workstations, we have examined the RPC of DCE between workstation and compute server under aspects of performance, capability and functionality for scientific-technical applications programmed in Fortran, under user-account. A brief introduction shows the demands expected from...
Article
Full-text available
Hybrid MPI/OpenMP and pure MPI on clusters of multi-core SMP nodes involve several mismatch problems be-tween the parallel programming models and the hardware architectures. Measurements of communication character-istics between cores on the same socket, on the same SMP node, and between SMP nodes on several platforms (includ-ing Cray XT4 and XT5)...
Article
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory paralleliza- tion on the node inter-connect with the shared memory parallelization inside of each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming model...
Article
This report presents development, implementation and use of High Performance Computing (HPC) methods in order to solve modelling problems in applied geo- science. These applications are mostly not academic problems but have arisen from field work at several 'real' investigation sites. 'Real world problems' are still very dicult to simulate, because...
Article
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem-ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This pape...
Article
Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for thr...

Network

Cited By

Projects

Project (1)
Project
Adaptive High-Performance I/O Systems