Network Bandwidth Measurements and Ratio
Analysis with the HPC Challenge Benchmark
Rolf Rabenseifner, Sunil R. Tiyyagura, and Matthias M¨ uller
High-Performance Computing-Center (HLRS), University of Stuttgart,
Allmandring 30, D-70550 Stuttgart, Germany,
rabenseifner | sunil | email@example.com,
www.hlrs.de/people/rabenseifner/ | .../sunil/ | .../mueller/
Proceedings, EuroPVM/MPI 2005, Sep. 18-21, Sorrento, Italy, LNCS, Springer-Verlag, 2005.
c ?Springer-Verlag, http://www.springer.de/comp/lncs/index.html
Abstract. The HPC Challenge benchmark suite (HPCC) was released
to analyze the performance of high-performance computing architectures
using several kernels to measure different memory and hardware access
patterns comprising latency based measurements, memory streaming,
inter-process communication and floating point computation. HPCC de-
fines a set of benchmarks augmenting the High Performance Linpack used
in the Top500 list. This paper describes the inter-process communication
benchmarks of this suite. Based on the effective bandwidth benchmark, a
special parallel random and natural ring communication benchmark has
been developed for HPCC. Ping-Pong benchmarks on a set of process
pairs can be used for further characterization of a system. This paper
analyzes first results achieved with HPCC. The focus of this paper is
on the balance between computational speed, memory bandwidth, and
Keywords. HPCC, network bandwidth, effective bandwidth, Linpack,
HPL, STREAM, DGEMM, PTRANS, FFTE, latency, benchmarking.
1 Introduction and Related Work
The HPC Challenge benchmark suite (HPCC) [5,6] was designed to provide
benchmark kernels that examine different aspects of the execution of real appli-
cations. The first aspect is benchmarking the system with different combinations
of high and low temporal and spatial locality of the memory access. HPL (High
Performance Linpack) , DGEMM [2,3] PTRANS (parallel matrix transpose)
, STREAM , FFTE (Fast Fourier Transform) , and RandomAccess are
dedicated to this task. Other aspects are measuring basic parameters like achiev-
able computational performance (again HPL), the bandwidth of the memory ac-
cess (STREAM copy or triad), and latency and bandwidth of the inter-process
communication based on ping-pong benchmarks and on parallel effective band-
width benchmarks [7,9].
This paper describes in Section 2 the latency and bandwidth benchmarks
used in the HPCC suite. Section 3 analyzes bandwidth and latency measure-
ments submitted to the HPCC web interface . In Section 4, the ratio between
2Rolf Rabenseifner, Sunil R. Tiyyagura, and Matthias M¨ uller
computational performance, memory and inter-process bandwidth is analyzed
to compare system architectures (and not only specific systems). In Section 5,
the ratio analysis is extended to the whole set of benchmarks to compare the
largest systems in the list and also different network types.
2 Latency and Bandwidth Benchmark
The latency and bandwidth benchmark measures two different communication
patterns. First, it measures the single-process-pair latency and bandwidth, and
second, it measures the parallel all-processes-in-a-ring latency and bandwidth.
For the first pattern, ping-pong communication is used on a pair of processes.
Several different pairs of processes are used and the maximal latency and minimal
bandwidth over all pairs is reported. While the ping-pong benchmark is executed
on one process pair, all other processes are waiting in a blocking receive. To limit
the total benchmark time used for this first pattern to 30 sec, only a subset of
the set of possible pairs is used. The communication is implemented with MPI
standard blocking send and receive.
In the second pattern, all processes are arranged in a ring topology and
each process sends and receives a message from its left and its right neighbor in
parallel. Two types of rings are reported: a naturally ordered ring (i.e., ordered
by the process ranks in MPI COMM WORLD), and the geometric mean of the
bandwidth of ten different randomly chosen process orderings in the ring. The
communication is implemented (a) with MPI standard non-blocking receive and
send, and (b) with two calls to MPI Sendrecv for both directions in the ring.
Always the fastest of both measurements are used. For latency or bandwidth
measurement, each ring measurement is repeated 8 or 3 times – and for random
ring with different patterns – and only the best result is chosen. With this type of
parallel communication, the bandwidth per process is defined as total amount of
message data divided by the number of processes and the maximal time needed
in all processes. The latency is defined as the maximum time needed in all
processes divided by the number of calls to MPI Sendrecv (or MPI Isend) in
each process. This definition is similar to the definition with ping-pong, where
the time is measured for the sequence of a send and a recv, and again send and
recv, and then divided by 2. In the ring benchmark, the same pattern is done by
all processes instead of a pair of processes. This benchmark is based on patterns
studied in the effective bandwidth communication benchmark [7,9].
For benchmarking latency and bandwidth, 8 byte and 2,000,000 byte long
messages are used. The major results reported by this benchmark are:
• maximal ping pong latency,
• average latency of parallel communication in randomly ordered rings,
• minimal ping pong bandwidth,
• bandwidth per process in the naturally ordered ring,
• average bandwidth per process in randomly ordered rings.
Additionally reported values are the latency of the naturally ordered ring, and
the remaining values in the set of minimum, maximum, and average of the ping-
pong latency and bandwidth.
10Rolf Rabenseifner, Sunil R. Tiyyagura, and Matthias M¨ uller
The authors would like to thank all persons and institutions that have uploaded
data to the HPCC database, Jack Dongarra and Piotr Luszczek, for the invi-
tation to Rolf Rabenseifner to include his effective bandwidth benchmark into
the HPCC suite, Holger Berger for the HPCC results on the NEC SX-6+ and
helpful discussions on the HPCC analysis, David Koester for his helpful remarks
on the HPCC Kiviat diagrams, and Gerrit Schulz and Michael Speck, student
co-workers, who have implemented parts of the software.
1. John McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance
2. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Iain S. Duff: A set of level
3 basic linear algebra subprograms. ACM Transactions on Mathematical Software
(TOMS), 16(1):1–17, March 1990.
3. Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Iain S. Duff: Algorithm
679; a set of level 3 basic linear algebra subprograms: model implementation and
test programs. ACM Transactions on Mathematical Software (TOMS), 16(1):18–
28, March 1990.
4. Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet: The LINPACK benchmark:
Past, present, and future. Concurrency nd Computation: Practice and Experience,
5. Jack Dongarra and Piotr Luszczek: Introduction to the HPCChallenge Bench-
mark Suite. Computer Science Department Tech Report 2005, UT-CS-05-544.
6. Panel on HPC Challenge Benchmarks: An Expanded View of High End Computers.
SC2004 November 12, 2004
7. Alice E. Koniges, Rolf Rabenseifner and Karl Solchenbach: Benchmark Design
for Characterization of Balanced High-Performance Architectures. In IEEE Com-
puter Society Press, proceedings of the 15th International Parallel and Distributed
Processing Symposium (IPDPS’01), Workshop on Massively Parallel Processing
(WMPP), April 23-27, 2001, San Francisco, USA, Vol. 3. In IEEE Computer So-
ciety Press (http://www.computer.org/proceedings/).
8. Parallel Kernels and Benchmarks (PARKBENCH)
9. Rolf Rabenseifner and Alice E. Koniges: Effective Communication and File-I/O
Bandwidth Benchmarks. In J. Dongarra and Yiannis Cotronis (Eds.), Recent Ad-
vances in Parallel Virtual Machine and Message Passing Interface, proceedings of
the 8th European PVM/MPI Users’ Group Meeting, EuroPVM/MPI 2001, Sep. 23-
26. Santorini, Greece, pp 24-35.
10. Rolf Rabenseifner: Hybrid Parallel Programming on HPC Platforms. In proceed-
ings of the Fifth European Workshop on OpenMP, EWOMP ’03, Aachen, Germany,
Sept. 22-26, 2003, pp 185-194
11. Daisuke Takahashi, Yasumasa Kanada: High-Performance Radix-2, 3 and 5 Par-
allel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers.
Journal of Supercomputing, 15(2):207–228, Feb. 2000.
12. Nathan Wichmann: Cray and HPCC: Benchmark Developments and Results from
Past Year. Proceedings of CUG 2005, May 16-19, Albuquerque, NM, USA.