Conference PaperPDF Available

Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes

February 2009

February 2009

DOI:10.1109/PDP.2009.43

Source
DBLP

Conference: Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009, Weimar, Germany, 18-20 Febuary 2009

Authors:

Rolf Rabenseifner

Universität Stuttgart

Georg Hager

Friedrich-Alexander-University of Erlangen-Nürnberg

Gabriele Jost

Intel

Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node in- terconnect with shared memory parallelization inside each node. We describe potentials and challenges of the dom- inant programming models on hierarchically structured hardware: Pure MPI (Message Passing Interface), pure OpenMP (with distributed shared memory extensions) and hybrid MPI+OpenMP in several flavors. We pinpoint cases where a hybrid programming model can indeed be the supe- rior solution because of reduced communication needs and memory consumption, or improved load balance. Further- more we show that machine topology has a significant im- pact on performance for all parallelization strategies and that topology awareness should be built into all applica- tions in the future. Finally we give an outlook on possible standardization goals and extensions that could make hy- brid programming easier to do with performance in mind.

NPB BT-MZ and SP-MZ (class E) performance on Ranger for mixed hybrid and pure MPI modes (see text for details on the mixed setup). There is no pure MPI data for 8192 cores as the number of MPI processes is limited to 4096 (zones) in that case.

…

Figures - uploaded by Georg Hager

Content may be subject to copyright.

Content uploaded by Georg Hager

Content may be subject to copyright.

Hybrid MPI/OpenMP Parallel Programming

on Clusters of Multi-Core SMP Nodes

Rolf Rabenseifner

High Performance Computing Center Stuttgart (HLRS), Germany

rabenseifner@hlrs.de

Georg Hager

Erlangen Regional Computing Center (RRZE), Germany

georg.hager@rrze.uni-erlangen.de

Gabriele Jost

Texas Advanced Computing Center (TACC), Austin, TX

gjost@tacc.utexas.edu

Abstract

Today most systems in high-performance computing

(HPC) feature a hierarchical hardware design: Shared

memory nodes with several multi-core CPUs are connected

via a network infrastructure. Parallel programming must

combine distributed memory parallelization on the node in-

terconnect with shared memory parallelization inside each

node. We describe potentials and challenges of the dom-

inant programming models on hierarchically structured

hardware: Pure MPI (Message Passing Interface), pure

OpenMP (with distributed shared memory extensions) and

hybrid MPI+OpenMP in several ﬂavors. We pinpoint cases

where a hybrid programming model can indeed be the supe-

rior solution because of reduced communication needs and

memory consumption, or improved load balance. Further-

more we show that machine topology has a signiﬁcant im-

pact on performance for all parallelization strategies and

that topology awareness should be built into all applica-

tions in the future. Finally we give an outlook on possible

standardization goals and extensions that could make hy-

brid programming easier to do with performance in mind.

1. Mainstream HPC architecture

Today scientists who wish to write efﬁcient parallel soft-

ware for high performance systems have to face a highly

hierarchical system design, even (or especially) on “com-

modity” clusters (Fig. 1 (a)). The price/performance sweet

spot seems to have settled at a point where multi-socket

multi-core shared-memory compute nodes are coupled via

high-speed interconnects. Inside the node, details like UMA

(Uniform Memory Access) vs. ccNUMA (cache coherent

Non-Uniform Memory Access) characteristics, number of

cores per socket and/or ccNUMA domain, shared and sepa-

rate caches, or chipset and I/O bottlenecks complicate mat-

ters further. Communication between nodes usually shows a

rich set of performance characteristics because global, non-

blocking communication has grown out of the affordable

range.

This trend will continue into the foreseeable future,

broadening the available range of hardware designs even

when looking at high-end systems. Consequently, it seems

natural to employ a hybrid programming model which uses

OpenMP for parallelization inside the node and MPI for

message passing between nodes. However, there is always

the option to use pure MPI and treat every CPU core as

a separate entity with its own address space. And ﬁnally,

looking at the multitude of hierarchies mentioned above, the

question arises whether it might be advantageous to employ

a “mixed model” where more than one MPI process with

multiple threads runs on a node so that there is at least some

explicit intra-node communication (Fig. 1 (b)–(d)).

It is not a trivial task to determine the optimal model to

use for some speciﬁc application. There seems to be a gen-

eral lore that pure MPI can often outperform hybrid, but

counterexamples do exist and results tend to vary with in-

put data, problem size etc. even for a given code [1]. This

paper discusses potential reasons for this; in order to get op-

timal scalability one should in any case try to implement the

following strategies: (a) Reduce synchronization overhead

(see Sect. 3.5), (b) reduce load imbalance (Sect. 4.2), (c)

SMP node SMP node

(a)

(c)(d)

(b)

Node interconnect

Socket 1

Socket 2

Socket 1

Socket 2

core

Quad−

CPU

core

Quad−

CPU

core

Quad−

CPU

core

Quad−

CPU

Figure 1. A typical multi-socket multi-core

SMP cluster (a), and three possible paral-

lel programming models that can be mapped

onto it: (b) pure MPI, (c) fully hybrid

MPI/OpenMP, (d) mixed model with more than

one MPI process per node.

reduce computational overhead and memory consumption

(Sect. 4.3), and (d) Minimize MPI communication overhead

(Sect. 4.4).

There are some strong arguments in favor of a hybrid

model which tend to underline the assumption that it should

lead to improved parallel efﬁciency as compared to pure

MPI. In the following sections we will shed some light on

most of these statements and discuss their validity.

This paper is organized as follows: In Sect. 2 we outline

the available programming models on hybrid/hierarchical

parallel platforms, brieﬂy describing their main strengths

and weaknesses. Sect. 3 concentrates on mismatch prob-

lems between parallel models and the parallel hardware:

Insufﬁcient topology awareness of parallel runtime environ-

ments, issues with intra-node message passing, and subop-

timal network saturation. The additional complications that

arise from the necessity to optimize the OpenMP part of a

hybrid code are discussed in Sect. 3.5. In Sect. 4 we then

turn to the beneﬁts that may be expected from employing

hybrid parallelization. In the ﬁnal sections we address pos-

sible future developments in standardization which could

help address some of the problems described and close with

a summary.

Masteronly

No comm/comp overlap

parallel regions

outside parallel

MPI only outside

Overlapping

communication with

computation

MPI comm. by one or

few threads while

others compute

Pure MPI

per core

process

one MPI

Hybrid MPI+OpenMP

MPI: inter−node

OpenMP: inside of

node

communication distributed

virtual

shared

memory

Pure

"OpenMP"

Figure 2. Taxonomy of parallel programming

models on hybrid platforms.

2. Parallel programming models on hybrid

platforms

Fig. 2 shows a taxonomy of parallel programming mod-

els on hybrid platforms. We have added an “OpenMP only”

branch because “distributed virtual shared memory” tech-

nologies like Intel Cluster OpenMP [2] allow the use of

OpenMP-like parallelization even beyond the boundaries of

a single cluster node. See Sect. 2.4 for more information.

This overview ignores the details about how exactly the

threads and processes of a hybrid program are to be mapped

onto hierarchical hardware. The mismatch problems which

are caused by the various alternatives to perform this map-

ping are discussed in detail in Sect. 3.

When using any combination of MPI and OpenMP, the

MPI implementation must feature some kind of threading

support. The MPI-2.1 standard deﬁnes the following levels:

•MPI_THREAD_SINGLE: Only one thread will execute.

•MPI_THREAD_FUNNELED: The process may be multi-

threaded, but only the main thread will make MPI

calls.

•MPI_THREAD_SERIALIZED: The process may be

multi-threaded, and multiple threads may make MPI

calls, but only one at a time: MPI calls are not made

concurrently from two distinct threads.

•MPI_THREAD_MULTIPLE: Multiple threads may call

MPI, with no restrictions.

Any hybrid code should always check for the required level

of threading support using the MPI_Thread_init() call.

2.1. Pure MPI

From a programmer’s point of view, pure MPI ignores

the fact that cores inside a single node work on shared mem-

ory. It can be employed right away on the hierarchical sys-

tems discussed above (see Fig. 1 (b)) without changes to ex-

isting code. Moreover, it is not required for the MPI library

and underlying software layers to support multi-threaded

applications, which simpliﬁes implementation (Optimiza-

tions on the MPI level regarding the inner topology of the

node interconnect, e.g., fat tree or torus, may still be useful

or necessary).

On the other hand, a pure MPI programming model

implicitly assumes that message passing is the correct

paradigm to use for all levels of parallelism available in

the application and that the application “topology” can be

mapped efﬁciently to the hardware topology. This may not

be true in all cases, see Sect. 3 for details. Furthermore, all

communication between processes on the same node goes

through the MPI software layers, which adds to overhead.

Hopefully the library is able to use “shortcuts” via shared

memory in this case, choosing ways of communication that

effectively use shared caches, hardware assists for global

operations, and the like. Such optimizations are usually out

of the programmer’s inﬂuence, but see Sect. 5 for some dis-

cussion regarding this point.

2.2. Hybrid masteronly

The hybrid masteronly model uses one MPI process per

node and OpenMP on the cores of the node, with no MPI

calls inside parallel regions. A typical iterative domain de-

composition code could look like the following:

for (iteration = 1...N)

{

#pragma omp parallel

{

/* numerical code */

}

/* on master thread only */

MPI_Send(bulk data to halo areas in other nodes)

MPI_Recv(halo data from the neighbors)

}

This resembles parallel programming on distributed-

memory parallel vector machines. In that case, the inner

layers of parallelism are not exploited by OpenMP but by

vectorization and multi-track pipelines.

As there is no intra-node message passing, MPI opti-

mizations and topology awareness for this case are not re-

quired. Of course, the OpenMP parts should be optimized

for the topology at hand, e.g., by employing parallel ﬁrst-

touch initialization on ccNUMA nodes or using thread-core

afﬁnity mechanisms.

There are, however, some major problems connected

with masteronly mode:

•All other threads are idle during communication

phases of the master thread which could lead to a

strong impact of communication overhead on scala-

bility. Alternatives are discussed in Sect. 3.1.3 and

Sect. 3.3 below.

•The full inter-node MPI bandwidth might not be satu-

rated by using a single communicating thread.

•The MPI library must be thread-aware on a simple

level by providing MPI_THREAD_FUNNELED. Actu-

ally, a lower thread-safety level would sufﬁce for mas-

teronly, but the MPI-2.1 standard does not provide an

appropriate level less than MPI_THREAD_FUNNELED.

2.3. Hybrid with overlap

One way to avoid idling compute threads during MPI

communication is to split off one or more threads of the

OpenMP team to handle communication in parallel with

useful calculation:

if (my_thread_ID < ...) {

/* communication threads: */

/* transfer halo */

MPI_Send( halo data )

MPI_Recv( halo data )

} else {

/* compute threads: */

/* execute code that does not need halo data */

}

/* all threads: */

/* execute code that needs halo data */

A possible reason to use more than one communication

thread could arise if a single thread cannot saturate the full

communication bandwidth of a compute node (see Sect. 3.3

for details). There is, however,a trade-off because the more

threads are sacriﬁced for MPI, the fewer are available for

overlapping computation.

2.4. Pure OpenMP on clusters

A lot of research has been invested into the implemen-

tation of distributed virtual shared memory software [3]

which allows near-shared-memory programming on dis-

tributed memory parallel machines, notably clusters. Since

2006 Intel offers the “Cluster OpenMP” compiler add-

on, enabling the use of OpenMP (with minor restrictions)

across the nodes of a cluster [2]. Therefore, OpenMP has

literally become a possible programming model for those

machines. It is, to some extent, a hybrid model, being iden-

tical to plain OpenMP inside a shared-memory node but em-

ploying a sophisticated protocol that keeps “shared” mem-

ory pages coherent between nodes at explicit or automatic

OpenMP ﬂush points.

With Cluster OpenMP, frequent page synchronization or

erratic access patterns to shared data must be avoided by all

means. If this is not possible, communication can poten-

tially become much more expensive than with plain MPI.

3. Mismatch problems

It should be evident by now that the main issue with

getting good performance on hybrid architectures is that

none of the programming models at one’s disposal ﬁts op-

timally to the hierarchical hardware. In the following sec-

tions we will elaborate on these mismatch problems. How-

ever, as sketched above, one can also expect hybrid models

to have positive effects on parallel performance (as shown

in Sect. 4). Most hybrid applications suffer from the for-

mer and beneﬁt from the latter to varying degrees, thus it is

near to impossible to make a quantitative judgement with-

out thorough benchmarking.

3.1. The mapping problem: Machine topology

As a prototype mismatch problem we consider the map-

ping of a two-dimensional Cartesian domain decomposition

with 80 sub-domains, organized in a 5×16 grid, on a ten-

node dual-socket quad-core cluster like the one in Fig. 1 (a).

We will analyze the communication behavior of this appli-

cation with respect to the required inter-socket and inter-

node halo exchanges, presupposing that inter-core commu-

nication is fastest, hence favorable. See Sect. 3.2 for a dis-

cussion on the validity of this assumption.

3.1.1. Mapping problem with pure MPI

We assume here that the MPI start mechanism is able to

establish some afﬁnity between processes and cores, i.e.

it is not left to chance which rank runs on which core of

a node. However, defaults vary across implementations.

Fig. 3 shows that there is an immense difference between

sequential and round-robin ranking, which is reﬂected in the

number of required inter-node and inter-socket connections.

In Fig. 3 (a), ranks are mapped to cores, sockets and nodes

(A.. .J) in sequential order, i.e., ranks 0... 7 go to the ﬁrst

node, etc.. This leads at maximum to 17 inter-node and one

inter-socket halo exchanges per node, neglecting boundary

effects. If the default is to place MPI ranks in round-robin

order across nodes (Fig. 3 (b)), i.e., ranks 0. ..9 are mapped

to the ﬁrst core of each node, all the halo communication

uses inter-node connections, which leads to 32 inter-node

and no inter-socket exchanges. Whether the difference mat-

ters or not depends, of course, on the ratio of computational

(a) (b)

CoreSocketNode

F G H I J

A B C D E

E A G C I

F B H D J

I E A G C

H D J F B

G C I E A

D J F B H

C I E A G

B H D J F

A G C I E

J F B H D

A G C I E

B H D J F

F B H D J

E A G C I

D J F B H

C I E A G

0 16 32 48 64

1 17 33 49 65

2 18 34 50 66

3 19 35 51 67

4 20 36 52 68

5 21 37 53 69

6 22 38 54 70

7 23 39 55 71

Figure 3. Inﬂuence of ranking order on the

number of inter-socket (double lines, blue)

and inter-node (single lines, red) halo com-

munications when using pure MPI. (a) Se-

quential mapping, (b) Round-robin mapping.

effort versus amount of halo data, both per process, and the

characteristics of the network.

What is the best ranking order for the domain decom-

position at hand? It is important to realize that the hier-

archical node structure enforces multilevel domain decom-

position which can be optimized for minimizing inter-node

communication: It seems natural to try to reduce the socket

“surface area” exposed to the node boundary, as shown in

Fig. 4 (a), which yields ten inter-node and four inter-socket

halo exchanges per node at maximum. But still there is op-

timization potential, because this process can be iterated to

the socket level (Fig. 4 (b)), cutting the number of inter-

socket connections in half. Comparing Figs. 3 (a), (b) and

Figs. 4 (a), (b), this is the best possible rank order for pure

MPI.

Above considerations should make it clear that it can be

vital to know about the default rank placement used in a

particular parallel environment and modify it if required.

Unfortunately, many commodity clusters are still run today

without a clear concept about rank-core afﬁnity and even no

way to inﬂuence it on a user-friendly level.

(a) (b)

Figure 4. Two possible mappings for multi-

level domain decomposition with pure MPI.

3.1.2. Mapping problem with fully hybrid MPI+OpenMP

Hybrid MPI+OpenMP enforces the domain decomposition

to be a two-level algorithm. On MPI level, a coarse-grained

domain decomposition is performed. Parallelization on

OpenMP level implies a second level domain decomposi-

tion, which may be implicit (loop level parallelization) or

explicit as shown in Fig. 5.

In principle, hybrid MPI+OpenMP presents similar

challenges in terms of topology awareness, i.e. optimal

rank/thread placement, as pure MPI. There is, however, the

added complexity that standard OpenMP parallelization is

based on loop-level worksharing, which is, albeit easy to ap-

ply, not always the optimal choice. On ccNUMA systems,

for instance, it might be better to drop the worksharing con-

cept in favor of thread-level domain decomposition in order

to reduce inter-domain NUMA trafﬁc (see below). On top

of this, proper ﬁrst-touch page placement is required to get

scalable bandwidth inside a node, and thread-core afﬁnity

must be employed. Still one should note that those issues

are not speciﬁc to hybrid MPI+OpenMP programming but

apply to pure OpenMP as well.

In contrast to pure MPI, hybrid parallelization of above

domain decomposition enforces a 2×5 MPI domain grid,

leading to oblong OpenMP subdomains (if explicit domain

decomposition is used on this level, see Fig. 5). Optimal

rank ordering leads to only three inter-node halo exchanges

per node, but each with about four times the data volume.

Thus we arrive at a slightly higher communication effort

Figure 5. Hybrid OpenMP+MPI two-level do-

main decomposition with a 2×5 MPI domain

grid and eight OpenMP threads per node. Al-

though there are fewer inter-node connec-

tions than with optimal MPI rank order (see

Fig. 4 (b)), the aggregate halo size is slightly

larger.

compared to pure MPI (with optimal rank order), a conse-

quence of the non-square domains.

Beyond the requirements of hybrid MPI+OpenMP,

multi-level domain decomposition may be beneﬁcial when

taking cache optimization into account: On the outermost

level the domain is divided into subdomains, one for each

MPI process. On the next level, these are again split into

portions for each thread, and then even further to ﬁt into

successive cache levels (L3, L2, L1). This strategy ensures

maximum access locality, a minimum of cache misses,

NUMA trafﬁc, and inter-node communication, but it must

be performed by the application, especially in the case

of unstructured grids. For portable software development,

standardized methods are desirable for the application to

detect the system topology and characteristic sizes (see also

Sect. 5).

3.1.3. Mapping problem with mixed model

The mixed model (see Fig. 1 (d)) represents a sort of com-

promise between pure MPI and fully hybrid models, featur-

ing potential advantages in terms of network saturation (see

Sect. 3.3 below). It suffers from the same basic drawbacks

as the fully hybrid model, although the impact of a loss of

thread-core afﬁnity may be larger because of the possibly

signiﬁcant differences in OpenMP performance and, more

importantly, MPI communication characteristics for intra-

node message transfer. Fig. 6 shows a possible scenario

where we contrast two alternatives for thread placement. In

Fig. 6 (a), intra-node MPI uses the inter-socket connection

only and shared memory access with OpenMP is kept inside

of each multi-core socket, whereas in Fig. 6 (b) all intra-

node MPI (with masteronly style) is handled inside sock-

ets. However, due to the spreading of the OpenMP threads

belonging to a particular process across two sockets there

is the danger of increased OpenMP startup overhead (see

Sect. 3.5) and NUMA trafﬁc.

Socket 2

Socket 1

Node

Socket 2

Socket 1

Node

(b)(a)

Node interconnectNode interconnect

MPI process 0MPI process 1

MPI process 0MPI process 0

MPI process 1MPI process 1

Thread 0

Thread 1

Thread 3

Thread 2

Thread 0

Thread 1

Thread 3

Thread 2

Figure 6. Two different mappings of threads

to cores for the mixed model with two MPI

processes per eight-core, two-socket node.

As with pure MPI, the message-passing subsystem

should be topology-aware in the sense that optimiza-

tion opportunities for intra-node transfers are actually ex-

ploited. The following section provides some more infor-

mation about performance characteristics of intra-node ver-

sus inter-node MPI.

3.2. Issues with intra-node MPI communication

The question whether the beneﬁts or disadvantages of

different hybrid programming models in terms of communi-

cation behavior really impact application performance can-

not be answered in general since there are far too many pa-

rameters involved. Even so, knowing the characteristics of

the MPI system at hand, one may at least arrive at an ed-

ucated guess. As an example we choose the well-known

PingPong benchmark from the Intel MPI benchmark (IMB)

suite, performed on RRZE’s “Woody” cluster [4] (Fig. 7).

As expected, there are vast differences in achievable band-

widths for in-cache message sizes; surprisingly, starting at

a message size of 43kB inter-node communication outper-

forms inter-socket transfer, saturating at a bandwidth ad-

vantage of roughly a factor of two for large messages. Even

intra-socket communication is slower than IB in this case.

This behavior, which may be attributed to additional copy

operations through shared memory buffers and can be ob-

served in similar ways on many clusters, shows that simplis-

tic assumptions about superior performance of intra-node

connections may be false. Rank ordering should be chosen

accordingly. Please note also that more elaborate low-level

benchmarks than PingPong may be advisable to arrive at

a more complete picture about communication characteris-

tics.

100101102103104105106107108

Message length [bytes]

500

1000

1500

2000

2500

3000

Bandwidth [MBytes/s]

IB inter-node

inter-socket

intra-socket

DDR-IB/PCIe 8x limit

2 MB

43 kB

Figure 7. IMB PingPong bandwidth versus

message size for inter-node, inter-socket,

and intra-socket communication on a two-

socket dual-core Xeon 5160 cluster with

DDR-IB interconnect, using Intel MPI.

At small message sizes, MPI communication is latency-

dominated. For the setup described above we measure the

following latency numbers:

Mode Latency [µs]

IB inter-node 3.22

inter-socket 0.62

intra-socket 0.24

In strong scaling scenarios it is often quite likely that one

“rides the PingPong curve” towards a latency-driven regime

as processor numbers increase, possibly rendering the care-

fully tuned process/thread placement useless.

3.3. Network saturation and sleeping threads with

the masteronly model

The masteronly variant, in which no MPI calls are issued

inside OpenMP-parallel regions, can be used with fully hy-

brid as well as the mixed model. Although being the easiest

way of implementing a hybrid MPI+OpenMP code, it has

two important shortcomings:

1. In the fully hybrid case, a single communicating thread

may not be able to saturate the node’s network connec-

tion. Using a mixed model (see Sect. 3.1.3) with more

than one MPI process per node might solve this prob-

lem, but one has to be aware of possible rank/thread

ordering problems as described in Sect. 3.1. On ﬂat-

memory SMP nodes with no intra-node hierarchical

structure, this may be an attractive and easy to use op-

tion [5]. However, the number of systems with such

characteristics is waning. Current hierarchical archi-

tectures require some more effort in terms of thread-

/core afﬁnity (see Sect. 4.1 for benchmark results in

mixed mode on a contemporary cluster).

2. While the master thread executes MPI code, all other

threads sleep. This effectively makes communica-

tion a purely serial component in terms of Amdahl’s

Law. Overlapping communication with computation

may provide a solution here (see Sect. 3.4 below).

One should note that on many commodity clusters to-

day (including those featuring high-speed interconnects like

InﬁniBand), saturation of a network port can usually be

achieved by a single thread. However, this may change if,

e.g., multiple network controllers or ports are available per

node. As for the second drawback above, one may argue

that MPI provides non-blocking point-to-point operations

which should generally be able to achieve the desired over-

lap. Even so, many MPI implementations allow communi-

cation progress, i.e., actual data transfer, only inside MPI

calls so that real background communication is ruled out.

The non-availability of non-blocking collectives in the cur-

rent MPI standard adds to the problem.

3.4. Overlapping communication and computation

It seems feasible to “split off” one or more OpenMP

threads in order to execute MPI calls, letting the rest

do the actual computations. Just as the fully hybrid

model, this requires the MPI library to support at least

the MPI_THREAD_FUNNELED. However, work distribution

across the non-communicating threads is not straightfor-

ward with this variant, because standard OpenMP work-

sharing works on the whole team of threads only. Nested

parallelism is not an alternative due to its performance

drawbacks and limited availability. Therefore, manual

worksharing must be applied:

if (my_thread_ID < 1) {

MPI_Send( halo data )

MPI_Recv( halo data )

} else {

my_range = (high-low-1) / (num_threads-1) + 1;

my_low = low + (my_thread_ID+1)*my_range;

my_high = high+ (my_thread_ID+1+1)*my_range;

my_high = max(high, my_high)

for (i=my_low; i<my_high; i++) {

/* computation */

}

Apart from the additional programming effort for divid-

ing the computation into halo-dependent and non-halo-

dependent parts (see Sect. 2.3), directives for loop work-

sharing cannot be used any more, making “dynamic” or

“guided” schemes that are essential to use in poorly load-

balanced situations very hard to implement. Thread sub-

teams [6] have been proposed as a possible addition to the

future OpenMP 3.x/4.x standard and would ameliorate the

problem signiﬁcantly. OpenMP tasks, which are part of the

recently passed OpenMP 3.0 standard, also form an elegant

alternative but presume that dynamic scheduling (which is

inherent to the task concept) is acceptable for the applica-

tion.

See Ref. [5] for performance models and measure-

ments comparing parallelization with masteronly style ver-

sus overlapping communication and computation on SMP

clusters with ﬂat intra-node structure.

3.5. OpenMP performance pitfalls

As with standard (non-hybrid) OpenMP, hybrid

MPI+OpenMP is prone to some common performance pit-

falls. Just by switching on OpenMP, some compilers re-

frain from some loop optimizations which may cause a sig-

niﬁcant performance hit. A prominent example is SIMD

vectorization of parallel loops on x86 architectures, which

gives best performance when using 16-byte aligned load/s-

tore instructions. If the compiler cannot apply dynamic loop

peeling [7], a loop parallelized with OpenMP can only be

vectorized using unaligned loads and stores (veriﬁed with

several releases of the Intel compilers, up to version 10.1).

The situation seems to improve gradually, though.

Thread creation/wakeup overhead and frequent synchro-

nization are further typical sources of performance prob-

lems with OpenMP, because they add to serial execution

and thus contribute to Amdahl’s Law on the node level. On

ccNUMA architectures correct ﬁrst-touch page placement

must be employed in order to achieve scalable performance

across NUMA locality domains. In this respect one should

also keep in mind that communicating threads, inside or

outside of parallel regions, may have to partly access non-

local MPI buffers (i.e. from other NUMA domains).

Due to, e.g., limited memory bandwidth, it may be pref-

erential in terms of performance or power consumption to

use fewer threads than available cores inside of each MPI

process [8]. This leads again to several afﬁnity options (sim-

ilar to Fig. 6 (a) and (b)) and may impact MPI inter-node

communication.

4. Expected hybrid parallelization beneﬁts

We have made it clear in the previous section that the par-

allel programming models described so far do not really ﬁt

onto standard hybrid hardware. Consequently, one should

always try to optimize the parallel environment, especially

in terms of thread/core mapping and the correct choice of

hybrid execution mode, in order to minimize the mismatch

problems.

On the other hand, as pointed out in the introduction,

several real beneﬁts can be expected from hybrid program-

ming models as opposed to pure MPI. We will elaborate on

the most important aspects in the following sections.

4.1. Additional levels of parallelism

In some applications, there is a coarse outer level of par-

allelism which can be easily exploited by message passing,

but is strictly limited to a certain number of workers. In such

a case, a viable way to improve scalability beyond this limit

is to use OpenMP in order to speed up each MPI process,

e.g. by identifying parallelizable loops at an inner level. A

prominent example is the BT-MZ benchmark from the NPB

(Multi-Zone NAS Parallel Benchmarks) suite. See Sect. 4.1

for details.

Benchmark results on “Ranger” at TACC

Here we present some performance results that were ob-

tained on a “Sun Constellation Cluster” named “Ranger”

[9], a high-performance compute resource at the Texas Ad-

vanced Computing Center (TACC) in Austin. It comprises

a DDR InﬁniBand network which connects 3936 ccNUMA

compute blades (nodes), each with four 2.3GHz AMD

Opteron “Barcelona” quad-core chips and 32GB of mem-

ory. This allows for 16-way shared memory programming

within each node. At four ﬂops per cycle, the overall peak

performance is 579 TFlop/s. For compiling the benchmarks

we employed PGI’s F90 compiler in version 7.1, directing

it to optimize for Barcelona processors. MVAPICH was

used for MPI communication, and numactl for implement-

ing thread-core and thread-memory afﬁnity.

The NAS Parallel Benchmark (NPB) Multi-Zone (MZ)

[10] codes BT-MZ and SP-MZ (class E) were chosen to ex-

emplify the beneﬁts and limitations of hybrid mode. The

purpose of the NPB-MZ is to capture the multiple levels of

parallelism inherent in many full scale applications. Each

benchmark exposes a different challenge to scalability: BT-

MZ is a block tridiagonal simulated CFD code. The size of

the zones varies widely, with a ratio of about 20 between the

largest and the smallest zone. This poses a load balancing

problem when only coarse-grained parallelism is exploited

on a large number of cores. SP-MZ is a scalar pentadiago-

nal simulated CFD code with equally sized zones, so from

a workload point of view the best performance should be

achieved by pure MPI. A detailed discussion of the per-

formance characteristics of these codes is presented in [11].

The class E problem size for both benchmarks comprises

an aggregate grid size of 4224×3456×92 points and a total

number of 4096 zones. Each MPI process is assigned a set

of zones to work on, according to a bin-packing algorithm

1024 2048 4096 8192

# cores

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Performance [TFlop/s]

SP-MZ (MPI)

SP-MZ (MPI+OpenMP)

BT-MZ (MPI)

BT-MZ (MPI+OpenMP)

Figure 8. NPB BT-MZ and SP-MZ (class E)

performance on Ranger for mixed hybrid and

pure MPI modes (see text for details on the

mixed setup). There is no pure MPI data for

8192 cores as the number of MPI processes

is limited to 4096 (zones) in that case.

to achieve a balanced workload. Static worksharing is used

on the OpenmMP level. Due to the implementation of the

benchmarks the maximum number of MPI processes is lim-

ited to the number of zones for SP-MZ as well as BT-MZ.

Fig. 8 shows results at 1024 to 8192 cores. For both

BT-MZ and SP-MZ the mixed hybrid mode enables scala-

bility beyond the number of zones. In the case of BT-MZ,

reducing the number of MPI processes and using OpenMP

threads allows for better load balancing while maintaining a

high level of parallelism. SP-MZ scales well with pure MPI,

but reducing the number of MPI processes cuts down on the

amount of data to be communicated and the total number of

MPI calls. At 4096 cores the hybrid version is 9.6 % faster.

Thus, for both benchmarks, hybrid MPI+OpenMP outper-

forms pure MPI. SP-MZ shows best results with mixed hy-

brid mode using half of the maximum possible MPI pro-

cesses at 2 threads each. The best mixed hybrid mode for

BT-MZ depends on the coarse grain load balancing that can

be achieved and varies with the number of available cores.

We must emphasize that the use of afﬁnity mechanisms

(numactl, in this particular case) is absolutely essential for

getting good performance and reproducibility on this cc-

NUMA architecture.

4.2. Improved load balancing

If the problem at hand has load balancing issues, some

kind of dynamic balancing should be implemented. In MPI,

this is a problem for which no generic recipes exist. It is

highly dependent on the numerics and potentially requires

signiﬁcant communication overhead. It is therefore hard to

implement in production codes.

One big advantage of OpenMP over MPI lies in the pos-

sible use of “dynamic” or “guided” loop scheduling. No ad-

ditional programming effort or data movement is required.

However, one should be aware that non-static scheduling is

suboptimal for memory-bound code on ccNUMA systems

because of unpredictable (and non-reproducible) access pat-

terns; if guided or dynamic schedule is unavoidable, one

should at least employ round-robin page placement for ar-

ray data in order to get some level of parallel data access.

For the hybrid case, simple static load balancing on the

outer (MPI) level and dynamic/guided loop scheduling for

OpenMP can be used as a compromise. Note that if dy-

namic OpenMP load balancing is prohibitive because of

NUMA locality constraints, a mixed model (Fig. 1 (d)) may

be advisable where one MPI process runs in each NUMA

locality domain and dynamic scheduling is applied to the

threads therein.

4.3. Reduced memory consumption

Although one might believe that there should be no data

duplication or, more generally, data overhead between MPI

processes, this is not true in reality. E.g., in domain de-

composition scenarios, the more MPI domains a problem is

divided into, the larger the aggregated surface and thus the

larger the amount of memory required for halos. Other data

like buffers internal to the MPI subsystem, but also lookup

tables, global constants, and everything that is usually du-

plicated for efﬁciency reasons, adds to memory consump-

tion. This pertains to redundant computation as well.

One the other hand, if there are multiple (t) threads per

MPI process, duplicated data is reduced by a factor of t(this

is also true for halo layers if not using domain decomposi-

tion on the OpenMP level). Although this may seem like a

small advantage today, one must keep in mind that the num-

ber of cores per CPU chip is constantly increasing. In the

future, tens and even hundreds of cores per chip may lead

to a dramatic reduction of available memory per core.

It should be clear from the considerations in the previous

sections that it is not straightforward to pick the optimal

number of OpenMP threads per MPI process for a given

problem and system. Even assuming that mismatch/afﬁnity

problems can be kept under control, using too many threads

can have negative effects on network saturation, whereas

too man MPI processes might lead to intolerable memory

consumption.

4.4. Further opportunities

Using multiple threads per process may have some ben-

eﬁts on the algorithmic side due to larger physical domains

inside of each MPI process. This can happen whenever a

larger domain is advisable in order to get improved numer-

ical accuracy or convergence properties. Examples are:

•A multigrid algorithm is employed only per MPI do-

main, i.e. inside each process, but not between do-

mains.

•Separate preconditioners are used inside and between

MPI processes.

•MPI domain decomposition is based on physical

zones.

An often used argument in favor of hybrid programming

is the potential reduction in MPI communication in compar-

ison to pure MPI. As shown in Sect. 3.1 and 5, this point

deserves some scrutiny because one must compare optimal

domain decompositions for both alternatives. However, the

number of messages sent and received per node does de-

crease which helps to reduce the adverse effects of MPI la-

tency. The overall aggregate message size is diminished as

well if intra-process “messages”, i.e. NUMA trafﬁc, are not

counted. In the fully hybrid case, no intra-node MPI is re-

quired at all, which may allow the use of a simpler (and

hopefully more efﬁcient) variant of the message-passing li-

brary, e.g., by not loading the shmem device driver. And

ﬁnally, a hybrid model enables incorporation of functional

parallelism in a very straightforward way: Just like using

one thread per process for concurrent communication/com-

putation as described above, one can equally well split off

another thread for, e.g., I/O or other chores that would be

hard to incorporate into the parallel workﬂow with pure

MPI. This could even reduce the non-parallelizable part of

the computation and thus enhance overall scalability.

5. Aspects of future standardization efforts

In Sect. 3 we have argued that mismatch problems need

special care, not only with hybrid programming, but also

under pure MPI. However, correct rank ordering and the

decisions between pure and mixed models cannot be op-

timized without knowledge about machine characteristics.

This includes, among other things, inter-node, inter-socket

and intra-socket communication bandwidths and latencies,

and information on the hardware topology in and between

nodes (cores per chip, chips per socket, shared caches,

NUMA domains and networks, and message-passing net-

work topology). Today, the programmer is often forced

to use non-portable interfaces in order to acquire this data

(examples under Linux are libnuma/numactl and the Intel

“cpuinfo” tool; other tools exist for other architectures and

operating systems) or perform their own low-level bench-

marks to ﬁgure out topology features.

What is needed for the future is a standardized interface

with an abstraction layer that shifts the non-portable pro-

gramming effort to a library provider. In our opinion, the

right place to provide such an interface is the MPI library,

which has to be adapted to the speciﬁc hardware anyway.

At least the most basic topology and (quantitative) commu-

nication performance characteristics could be done inside

MPI at little cost. Thus we propose the inclusion of a topol-

ogy/performance interface into the future MPI 3.0 standard,

see also [12].

As mentioned in Sect. 3.3, there are already some ef-

forts to include a subteam feature into upcoming OpenMP

standards. We believe this feature to be essential for hybrid

programming on current and future architectures, because

it will greatly facilitate functional parallelism and enable

standard dynamic load balancing inside multi-threaded MPI

processes.

6. Conclusions

In this paper we have pinpointed the issues and poten-

tials in developing high performance parallel codes on cur-

rent and future hierarchical systems. Mismatch problems,

i.e. the unsuitability of current hybrid hardware for running

highly parallel workloads, are often hard to solve, let alone

in a portable way. However, the potential gains in scalabil-

ity and absolute performance may be worth the signiﬁcant

coding effort. New features in future MPI and OpenMP

standards may constitute a substantial improvement in that

respect.

Acknowledgements

We greatly appreciate the excellent support and the com-

puting time provided by the HPC group at the Texas Ad-

vanced Computing Center. Fruitful discussions with Rainer

Keller and Gerhard Wellein are gratefully acknowledged.

References

[1] R. Loft, S. Thomas, J. Dennis: Terascale Spectral Element

Dynamical Core for Atmospheric General Circulation

Models. Proceedings of SC2001, Denver, USA.

[2] Cluster OpenMP for Intel compilers. http://

software.intel.com/en-us/articles/

cluster-openmp-for-intel-compilers

[3] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R.

Rajamony, W. Yu, W. Zwaenepoel: TreadMarks: Shared

Memory Computing on Networks of Workstations. IEEE

Computer 29(2), 18–28 (1996).

[4] http://www.hpc.rrze.uni-erlangen.de/systeme/

woodcrest-cluster.shtml

[5] R. Rabenseifner, G. Wellein: Communication and Op-

timization Aspects of Parallel Programming Models on

Hybrid Architectures. International Journal of High Per-

formance Computing Applications 17(1), 49–62 (2003).

[6] B. M. Chapman, L. Huang, H. Jin, G. Jost, B. R. de

Supinski: Toward Enhancing OpenMP’s Work-Sharing

Directives. In W. E. Nagel et al. (Eds.): Proceedings of

Euro-Par 2006, LNCS4128, 645–654. Springer (2006).

[7] M. St¨urmer, G. Wellein, G. Hager, H. K¨ostler, U. R¨ude:

Challenges and potentials of emerging multicore archi-

tectures. In: S. Wagner et al. (Eds.), High Performance

Computing in Science and Engineering, Garching/Munich

2007, 551–566, Springer (2009).

[8] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S.

Nikolopoulos, B. R. de Supinski, M. Schulz: Predic-

tion Models for Multi-dimensional Power-Performance

Optimization on Many Cores. In D. Tarditi, K. Olukotun

(Eds.), Proceedings on the Seventeenth International Con-

ference on Parallel Architectures and Compilation Tech-

niques (PACT08), Toronto, Canada, Oct. 25–29, 2008.

[9] http://www.tacc.utexas.edu/services/

userguides/ranger/

[10] R. F. Van Der Wijngaart, H. Jin: NAS Parallel Bench-

marks, Multi-Zone Versions. NAS Technical Report NAS-

03-010, NASA Ames Research Center, Moffett Field, CA,

2003.

[11] H. Jin, R. F. Van Der Wijngaart: Performance Character-

istics of the multi-zone NAS Parallel Benchmarks. Journal

of Parallel and Distributed Computing, Vol. 66, Special Is-

sue: 18th International Parallel and Distributed Processing

Symposium, pp. 674–685, May 2006.

[12] MPI Forum: MPI-2.0 Journal of Development (JOD),

Sect. 5.3 “Cluster Attributes”, http://www.mpi-forum.

org, July 18, 1997.

[13] R. Rabenseifner, G. Hager, G. Jost,R. Keller: Hybrid MPI

and OpenMP Parallel Programming. Half-day Tutorial

No. S-10 at SC07, Reno, NV, Nov. 10–16, 2007.

[14] R. Rabenseifner: Some Aspects of Message-Passing on

Future Hybrid Systems. Invited talk at 15th European

PVM/MPI Users’ Group Meeting, EuroPVM/MPI 2008,

Sep. 7–10, 2008, Dublin, Ireland. LNCS 5205, pp 8–10,

Springer (2008).

High-performance flow classification of big data using hybrid CPU-GPU clusters of cloud environments

Article

Full-text available

Aug 2024

The network switches in the data plane of SDN are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data, are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like tuple space search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms including graphics processing units (GPUs), clusters of central processing units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 million packets per second (Mpps). That is, the hybrid cluster produced by the integration of CUDA, MPI, and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.

Computation of X-ray and Neutron Scattering Patterns to Benchmark Atomistic Simulations against Experiments

Article

Full-text available

Jan 2024
INT J MOL SCI

Molecular Dynamics simulations study material structure and dynamics at the atomic level. X-ray and neutron scattering experiments probe exactly the same time- and length scales as the simulations. In order to benchmark simulations against measured scattering data, a program is required that computes scattering patterns from simulations with good single-core performance and support for parallelization. In this work, the existing program Sassena is used as a potent solution to this requirement for a range of scattering methods, covering pico- to nanosecond dynamics, as well as the structure from some Ångströms to hundreds of nanometers. In the case of nanometer-level structures, the finite size of the simulation box, which is referred to as the finite size effect, has to be factored into the computations for which a method is described and implemented into Sassena. Additionally, the single-core and parallelization performance of Sassena is investigated, and several improvements are introduced.

Efficient Iterative Arbitrary High-Order Methods: an Adaptive Bridge Between Low and High Order

Article

Full-text available

Sep 2023

We propose a new paradigm for designing efficient p -adaptive arbitrary high-order methods. We consider arbitrary high-order iterative schemes that gain one order of accuracy at each iteration and we modify them to match the accuracy achieved in a specific iteration with the discretization accuracy of the same iteration. Apart from the computational advantage, the newly modified methods allow to naturally perform the p -adaptivity, stopping the iterations when appropriate conditions are met. Moreover, the modification is very easy to be included in an existing implementation of an arbitrary high-order iterative scheme and it does not ruin the possibility of parallelization, if this was achievable by the original method. An application to the Arbitrary DERivative (ADER) method for hyperbolic Partial Differential Equations (PDEs) is presented here. We explain how such a framework can be interpreted as an arbitrary high-order iterative scheme, by recasting it as a Deferred Correction (DeC) method, and how to easily modify it to obtain a more efficient formulation, in which a local a posteriori limiter can be naturally integrated leading to the p -adaptivity and structure-preserving properties. Finally, the novel approach is extensively tested against classical benchmarks for compressible gas dynamics to show the robustness and the computational efficiency.

Hierarchical Management of Extreme-Scale Task-Based Applications

Chapter

Aug 2023

The scale and heterogeneity of exascale systems increment the complexity of programming applications exploiting them. Task-based approaches with support for nested tasks are a good-fitting model for them because of the flexibility lying in the task concept. Resembling the hierarchical organization of the hardware, this paper proposes establishing a hierarchy in the application workflow for mapping coarse-grain tasks to the broader hardware components and finer-grain tasks to the lowest levels of the resource hierarchy to benefit from lower-latency and higher-bandwidth communications and exploiting locality. Building on a proposed mechanism to encapsulate within the task the management of its finer-grain parallelism, the paper presents a hierarchical peer-to-peer engine orchestrating the execution of workflow hierarchies with fully-decentralized management. The tests conducted on the MareNostrum 4 supercomputer using a prototype implementation prove the validity of the proposal supporting the execution of up to 707,653 tasks using 2,400 cores and achieving speedups of up to 106 times faster than executions of a single workflow and centralized management.Keywordsdistributed systemsexascaletask-basedprogramming modelworkflowhierarchyruntime systempeer-to-peerdecentralized management

A High-Performance Automated Large-Area Land Cover Mapping Framework

Article

Full-text available

Jun 2023

Land cover mapping plays a pivotal role in global resource monitoring, sustainable development research, and effective management. However, the complexity of the mapping process, coupled with significant computational and data storage requirements, often leads to delays between data processing and product publication, thereby bringing challenges to creating multi-timesteps large-area products for monitoring dynamic land cover. Therefore, improving the efficiency of each stage in land cover mapping and automating the mapping process is currently an urgent issue to be addressed. This study proposes a high-performance automated large-area land cover mapping framework (HALF). By leveraging Docker and workflow technologies, the HALF effectively tackles model heterogeneity in complex land cover mapping processes, thereby simplifying model deployment and achieving a high degree of decoupling between production models. It optimizes key processes by incorporating high-performance computing techniques. To validate these methods, this study utilized Landsat imagery data and extracted samples using GLC_FCS and FROM_GLC, all of which were acquired at a spatial resolution of 30 m. Several 10° × 10° regions were chosen globally to illustrate the viability of generating large-area land cover using the HALF. In the sample collection phase, the HALF introduced an automated method for generating samples, which overlayed multiple prior products to generate a substantial number of samples, thus saving valuable manpower resources. Additionally, the HALF utilized high-performance computing technology to enhance the efficiency of the sample–image matching phase, thereby achieving a speed that was ten times faster than traditional matching methods. In the mapping stage, the HALF employed adaptive classification models to train the data in each region separately. Moreover, to address the challenge of handling a large number of classification results in a large area, the HALF utilized a parallel mosaicking method for classification results based on the concept of grid division, and the average processing time for a single image was approximately 6.5 s.

HALF: A High-performance Automated Large-scale Land Cover Mapping Framework

Preprint

Full-text available

May 2023

Large-scale land cover plays a crucial role in global resource monitoring and management, as well as research on sustainable development. However, the complexity of the mapping process, coupled with significant computational and data storage requirements, often leads to delays between data processing and product publication, creating challenges for dynamic monitoring of large-scale land cover. Therefore, improving the efficiency of each stage in large-scale land cover mapping and automating the mapping process is currently an urgent and critical issue that needs to be addressed. We propose a high-performance automated large-scale land cover mapping framework(HALF) that introduces high-performance computing technology to the field of land cover production. HALF optimizes key processes, such as automated sample point extraction, sample-remote sensing image matching, and large-scale classification result mosaic and update. We selected several 10°×10° regions globally and the research makes several significant contributions:(1)We design HALF for land cover mapping based on docker and CWL-Airflow, which solves the heterogeneity of models between complex processes in land cover mapping and simplifies the model deployment process. By introducing workflow organization, this method achieves a high degree of decoupling between the production models of each stage and the overall process, enhancing the scalability of the framework. (2)HALF propose an automatic sample points method that generates a large number of samples by overlaying and analyzing multiple prior products, thus saving the cost of manual sample selection. Using high-performance computing technology improved the computational efficiency of sample-image matching and feature extraction phase, with 10 times faster than traditional matching methods.(3)HALF propose a high-performance classification result mosaic method based on the idea of grid division. By quickly establishing the spatial relationship between the image and the product and performing parallel computing, the efficiency of the mosaicking in large areas is significantly improved. The average processing time for a single image is around 6.5 seconds.

Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes

Conference Paper

Feb 2024

Parallelization Algorithm of Computer Vision for Autonomous Vehicles

Conference Paper

May 2023

HiBGT: High-Performance Bayesian Group Testing for COVID-19

Conference Paper

Dec 2022

Improving Communication Asynchrony and Concurrency for Adaptive MPI Endpoints

Conference Paper

Nov 2022

Challenges and Potentials of Emerging Multicore Architectures

Chapter

Full-text available

Jan 2009

We present performance results on two current multicore architectures, aSTI (Sony, Toshiba, and IBM) Cell processor included in the new Playstation™3 and a Sun UltraSPARC T2 (“Niagara 2”) machine. On the Niagara 2 we analyze typical performance patterns that emerge from the peculiar way the memory controllers are activated on this chip using the standard STREAM benchmark and a shared-memory parallel lattice Boltzmann code. On the Cell processor we measure the memory bandwidth and run performance tests for LBM simulations. Additionally, we show results for an application in image processing on the Cell processor, where it is required to solve nonlinear anisotropic PDEs.

Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks.

Conference Paper

Full-text available

Jan 2004

Summary form only given. We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multizone, is extended from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multilevel programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes.

Prediction models for multi-dimensional power-performance optimization on many cores

Conference Paper

Full-text available

Oct 2008

Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware si- multaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, on- line performance predictor, which we deploy to address the prob- lem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both perfor- mance and energy savings.

Terascale Spectral Element Dynamical Core for Atmospheric General Circulation Models

Conference Paper

Full-text available

Dec 2001

Climate modeling is a grand challenge problem where scientific progress is measured not in terms of the largest problem that can be solved but by the highest achievable integration rate. These models have been notably absent in previous Gordon Bell competitions due to their inability to scale to large processor counts. A scalable and efficient spectral element atmospheric model is presented. A new semi-implicit time stepping scheme accelerates the integration rate relative to an explicit model by a factor of two, achieving 130 years per day at T63L30 equivalent resolution. Execution rates are reported for the standard shallow water and Held-Suarez climate benchmarks on IBM SP clusters. The explicit T170 equivalent multi-layer shallow water model sustains 343 Gflops at NERSC, 206 Gflops at NPACI (SDSC) and 127 Gflops at NCAR. An explicit Held-Suarez integration sustains 369 Gflops on 128 16-way IBM nodes at NERSC.

TreadMarks: Shared Memory Computing on Networks of Workstations

Article

Full-text available

Mar 1996

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value

Hybrid MPI and OpenMP Parallel Programming

Conference Paper

Sep 2006

Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation” type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.

Toward Enhancing OpenMP’s Work-Sharing Directives

Conference Paper

Aug 2006
Lect Notes Comput Sci

OpenMP provides a portable programming interface for shared mem- ory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requies greater flexibility in light of the ste adily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressiv- ity problems in the current OpenMP specification. We then pro pose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application perfor- mance on emerging and large-scale platforms while preserving ease of program- ming.

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Article

Mar 2003
INT J HIGH PERFORM C

Summary Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem- ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. The hy- brid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other par- allel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also whether programming paradigms can separate the opti- mization of communication and computation. Benchmark results are presented for hybrid and pure MPI communi- cation. This paper analyzes the strength and waekness of several parallel programming models on clusters of SMP nodes.

Performance characteristics of the multi-zone NAS parallel benchmarks

Article

May 2006

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.

Jin: NAS Parallel Bench-marks, Multi-Zone Versions

Jan 2003

R F Van
H Wijngaart

R. F. Van Der Wijngaart, H. Jin: NAS Parallel Bench-marks, Multi-Zone Versions. NAS Technical Report NAS-03-010, NASA Ames Research Center, Moffett Field, CA, 2003.

Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes

Abstract and Figures

Recommended publications

N-Neuron Simulation Using Multiprocessor Cluster

Static/Dynamic Analyses for Validation and Improvements of Multi-Model HPC Applications.

A Review on New Paradigm's of Parallel Programming A Review on New Paradigm's of Parallel Programmin...

Some Aspects of Message-Passing on Future Hybrid Systems (Extended Abstract)