Conference PaperPDF Available

Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes

Authors:

Abstract and Figures

Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node in- terconnect with shared memory parallelization inside each node. We describe potentials and challenges of the dom- inant programming models on hierarchically structured hardware: Pure MPI (Message Passing Interface), pure OpenMP (with distributed shared memory extensions) and hybrid MPI+OpenMP in several flavors. We pinpoint cases where a hybrid programming model can indeed be the supe- rior solution because of reduced communication needs and memory consumption, or improved load balance. Further- more we show that machine topology has a significant im- pact on performance for all parallelization strategies and that topology awareness should be built into all applica- tions in the future. Finally we give an outlook on possible standardization goals and extensions that could make hy- brid programming easier to do with performance in mind.
Content may be subject to copyright.
Hybrid MPI/OpenMP Parallel Programming
on Clusters of Multi-Core SMP Nodes
Rolf Rabenseifner
High Performance Computing Center Stuttgart (HLRS), Germany
rabenseifner@hlrs.de
Georg Hager
Erlangen Regional Computing Center (RRZE), Germany
georg.hager@rrze.uni-erlangen.de
Gabriele Jost
Texas Advanced Computing Center (TACC), Austin, TX
gjost@tacc.utexas.edu
Abstract
Today most systems in high-performance computing
(HPC) feature a hierarchical hardware design: Shared
memory nodes with several multi-core CPUs are connected
via a network infrastructure. Parallel programming must
combine distributed memory parallelization on the node in-
terconnect with shared memory parallelization inside each
node. We describe potentials and challenges of the dom-
inant programming models on hierarchically structured
hardware: Pure MPI (Message Passing Interface), pure
OpenMP (with distributed shared memory extensions) and
hybrid MPI+OpenMP in several flavors. We pinpoint cases
where a hybrid programming model can indeed be the supe-
rior solution because of reduced communication needs and
memory consumption, or improved load balance. Further-
more we show that machine topology has a significant im-
pact on performance for all parallelization strategies and
that topology awareness should be built into all applica-
tions in the future. Finally we give an outlook on possible
standardization goals and extensions that could make hy-
brid programming easier to do with performance in mind.
1. Mainstream HPC architecture
Today scientists who wish to write efficient parallel soft-
ware for high performance systems have to face a highly
hierarchical system design, even (or especially) on “com-
modity” clusters (Fig. 1 (a)). The price/performance sweet
spot seems to have settled at a point where multi-socket
multi-core shared-memory compute nodes are coupled via
high-speed interconnects. Inside the node, details like UMA
(Uniform Memory Access) vs. ccNUMA (cache coherent
Non-Uniform Memory Access) characteristics, number of
cores per socket and/or ccNUMA domain, shared and sepa-
rate caches, or chipset and I/O bottlenecks complicate mat-
ters further. Communication between nodes usually shows a
rich set of performance characteristics because global, non-
blocking communication has grown out of the affordable
range.
This trend will continue into the foreseeable future,
broadening the available range of hardware designs even
when looking at high-end systems. Consequently, it seems
natural to employ a hybrid programming model which uses
OpenMP for parallelization inside the node and MPI for
message passing between nodes. However, there is always
the option to use pure MPI and treat every CPU core as
a separate entity with its own address space. And finally,
looking at the multitude of hierarchies mentioned above, the
question arises whether it might be advantageous to employ
a “mixed model” where more than one MPI process with
multiple threads runs on a node so that there is at least some
explicit intra-node communication (Fig. 1 (b)–(d)).
It is not a trivial task to determine the optimal model to
use for some specific application. There seems to be a gen-
eral lore that pure MPI can often outperform hybrid, but
counterexamples do exist and results tend to vary with in-
put data, problem size etc. even for a given code [1]. This
paper discusses potential reasons for this; in order to get op-
timal scalability one should in any case try to implement the
following strategies: (a) Reduce synchronization overhead
(see Sect. 3.5), (b) reduce load imbalance (Sect. 4.2), (c)
SMP node SMP node
(a)
(c)(d)
(b)
Node interconnect
Socket 1
Socket 2
Socket 1
Socket 2
core
Quad−
CPU
core
Quad−
CPU
core
Quad−
CPU
core
Quad−
CPU
Figure 1. A typical multi-socket multi-core
SMP cluster (a), and three possible paral-
lel programming models that can be mapped
onto it: (b) pure MPI, (c) fully hybrid
MPI/OpenMP, (d) mixed model with more than
one MPI process per node.
reduce computational overhead and memory consumption
(Sect. 4.3), and (d) Minimize MPI communication overhead
(Sect. 4.4).
There are some strong arguments in favor of a hybrid
model which tend to underline the assumption that it should
lead to improved parallel efficiency as compared to pure
MPI. In the following sections we will shed some light on
most of these statements and discuss their validity.
This paper is organized as follows: In Sect. 2 we outline
the available programming models on hybrid/hierarchical
parallel platforms, briefly describing their main strengths
and weaknesses. Sect. 3 concentrates on mismatch prob-
lems between parallel models and the parallel hardware:
Insufficient topology awareness of parallel runtime environ-
ments, issues with intra-node message passing, and subop-
timal network saturation. The additional complications that
arise from the necessity to optimize the OpenMP part of a
hybrid code are discussed in Sect. 3.5. In Sect. 4 we then
turn to the benefits that may be expected from employing
hybrid parallelization. In the final sections we address pos-
sible future developments in standardization which could
help address some of the problems described and close with
a summary.
Masteronly
No comm/comp overlap
parallel regions
outside parallel
MPI only outside
Overlapping
communication with
computation
MPI comm. by one or
few threads while
others compute
Pure MPI
per core
process
one MPI
Hybrid MPI+OpenMP
MPI: inter−node
OpenMP: inside of
node
communication distributed
virtual
shared
memory
Pure
"OpenMP"
Figure 2. Taxonomy of parallel programming
models on hybrid platforms.
2. Parallel programming models on hybrid
platforms
Fig. 2 shows a taxonomy of parallel programming mod-
els on hybrid platforms. We have added an “OpenMP only”
branch because “distributed virtual shared memory” tech-
nologies like Intel Cluster OpenMP [2] allow the use of
OpenMP-like parallelization even beyond the boundaries of
a single cluster node. See Sect. 2.4 for more information.
This overview ignores the details about how exactly the
threads and processes of a hybrid program are to be mapped
onto hierarchical hardware. The mismatch problems which
are caused by the various alternatives to perform this map-
ping are discussed in detail in Sect. 3.
When using any combination of MPI and OpenMP, the
MPI implementation must feature some kind of threading
support. The MPI-2.1 standard defines the following levels:
MPI_THREAD_SINGLE: Only one thread will execute.
MPI_THREAD_FUNNELED: The process may be multi-
threaded, but only the main thread will make MPI
calls.
MPI_THREAD_SERIALIZED: The process may be
multi-threaded, and multiple threads may make MPI
calls, but only one at a time: MPI calls are not made
concurrently from two distinct threads.
MPI_THREAD_MULTIPLE: Multiple threads may call
MPI, with no restrictions.
Any hybrid code should always check for the required level
of threading support using the MPI_Thread_init() call.
2
2.1. Pure MPI
From a programmer’s point of view, pure MPI ignores
the fact that cores inside a single node work on shared mem-
ory. It can be employed right away on the hierarchical sys-
tems discussed above (see Fig. 1 (b)) without changes to ex-
isting code. Moreover, it is not required for the MPI library
and underlying software layers to support multi-threaded
applications, which simplifies implementation (Optimiza-
tions on the MPI level regarding the inner topology of the
node interconnect, e.g., fat tree or torus, may still be useful
or necessary).
On the other hand, a pure MPI programming model
implicitly assumes that message passing is the correct
paradigm to use for all levels of parallelism available in
the application and that the application “topology” can be
mapped efficiently to the hardware topology. This may not
be true in all cases, see Sect. 3 for details. Furthermore, all
communication between processes on the same node goes
through the MPI software layers, which adds to overhead.
Hopefully the library is able to use “shortcuts” via shared
memory in this case, choosing ways of communication that
effectively use shared caches, hardware assists for global
operations, and the like. Such optimizations are usually out
of the programmer’s influence, but see Sect. 5 for some dis-
cussion regarding this point.
2.2. Hybrid masteronly
The hybrid masteronly model uses one MPI process per
node and OpenMP on the cores of the node, with no MPI
calls inside parallel regions. A typical iterative domain de-
composition code could look like the following:
for (iteration = 1...N)
{
#pragma omp parallel
{
/* numerical code */
}
/* on master thread only */
MPI_Send(bulk data to halo areas in other nodes)
MPI_Recv(halo data from the neighbors)
}
This resembles parallel programming on distributed-
memory parallel vector machines. In that case, the inner
layers of parallelism are not exploited by OpenMP but by
vectorization and multi-track pipelines.
As there is no intra-node message passing, MPI opti-
mizations and topology awareness for this case are not re-
quired. Of course, the OpenMP parts should be optimized
for the topology at hand, e.g., by employing parallel first-
touch initialization on ccNUMA nodes or using thread-core
affinity mechanisms.
There are, however, some major problems connected
with masteronly mode:
All other threads are idle during communication
phases of the master thread which could lead to a
strong impact of communication overhead on scala-
bility. Alternatives are discussed in Sect. 3.1.3 and
Sect. 3.3 below.
The full inter-node MPI bandwidth might not be satu-
rated by using a single communicating thread.
The MPI library must be thread-aware on a simple
level by providing MPI_THREAD_FUNNELED. Actu-
ally, a lower thread-safety level would suffice for mas-
teronly, but the MPI-2.1 standard does not provide an
appropriate level less than MPI_THREAD_FUNNELED.
2.3. Hybrid with overlap
One way to avoid idling compute threads during MPI
communication is to split off one or more threads of the
OpenMP team to handle communication in parallel with
useful calculation:
if (my_thread_ID < ...) {
/* communication threads: */
/* transfer halo */
MPI_Send( halo data )
MPI_Recv( halo data )
} else {
/* compute threads: */
/* execute code that does not need halo data */
}
/* all threads: */
/* execute code that needs halo data */
A possible reason to use more than one communication
thread could arise if a single thread cannot saturate the full
communication bandwidth of a compute node (see Sect. 3.3
for details). There is, however,a trade-off because the more
threads are sacrificed for MPI, the fewer are available for
overlapping computation.
2.4. Pure OpenMP on clusters
A lot of research has been invested into the implemen-
tation of distributed virtual shared memory software [3]
which allows near-shared-memory programming on dis-
tributed memory parallel machines, notably clusters. Since
2006 Intel offers the “Cluster OpenMP” compiler add-
on, enabling the use of OpenMP (with minor restrictions)
across the nodes of a cluster [2]. Therefore, OpenMP has
literally become a possible programming model for those
machines. It is, to some extent, a hybrid model, being iden-
tical to plain OpenMP inside a shared-memory node but em-
ploying a sophisticated protocol that keeps “shared” mem-
3
ory pages coherent between nodes at explicit or automatic
OpenMP flush points.
With Cluster OpenMP, frequent page synchronization or
erratic access patterns to shared data must be avoided by all
means. If this is not possible, communication can poten-
tially become much more expensive than with plain MPI.
3. Mismatch problems
It should be evident by now that the main issue with
getting good performance on hybrid architectures is that
none of the programming models at one’s disposal fits op-
timally to the hierarchical hardware. In the following sec-
tions we will elaborate on these mismatch problems. How-
ever, as sketched above, one can also expect hybrid models
to have positive effects on parallel performance (as shown
in Sect. 4). Most hybrid applications suffer from the for-
mer and benefit from the latter to varying degrees, thus it is
near to impossible to make a quantitative judgement with-
out thorough benchmarking.
3.1. The mapping problem: Machine topology
As a prototype mismatch problem we consider the map-
ping of a two-dimensional Cartesian domain decomposition
with 80 sub-domains, organized in a 5×16 grid, on a ten-
node dual-socket quad-core cluster like the one in Fig. 1 (a).
We will analyze the communication behavior of this appli-
cation with respect to the required inter-socket and inter-
node halo exchanges, presupposing that inter-core commu-
nication is fastest, hence favorable. See Sect. 3.2 for a dis-
cussion on the validity of this assumption.
3.1.1. Mapping problem with pure MPI
We assume here that the MPI start mechanism is able to
establish some affinity between processes and cores, i.e.
it is not left to chance which rank runs on which core of
a node. However, defaults vary across implementations.
Fig. 3 shows that there is an immense difference between
sequential and round-robin ranking, which is reflected in the
number of required inter-node and inter-socket connections.
In Fig. 3 (a), ranks are mapped to cores, sockets and nodes
(A.. .J) in sequential order, i.e., ranks 0... 7 go to the first
node, etc.. This leads at maximum to 17 inter-node and one
inter-socket halo exchanges per node, neglecting boundary
effects. If the default is to place MPI ranks in round-robin
order across nodes (Fig. 3 (b)), i.e., ranks 0. ..9 are mapped
to the first core of each node, all the halo communication
uses inter-node connections, which leads to 32 inter-node
and no inter-socket exchanges. Whether the difference mat-
ters or not depends, of course, on the ratio of computational
(a) (b)
CoreSocketNode
F G H I J
A B C D E
E A G C I
F B H D J
I E A G C
H D J F B
G C I E A
D J F B H
C I E A G
B H D J F
A G C I E
J F B H D
A G C I E
B H D J F
F B H D J
E A G C I
D J F B H
C I E A G
0 16 32 48 64
1 17 33 49 65
2 18 34 50 66
3 19 35 51 67
8
9
10
11
12
13
14
15
24
25
26
27
28
29
30
31
40
41
42
43
44
45
46
47
56
57
58
59
60
61
62
63
72
73
74
75
76
77
78
79
4 20 36 52 68
5 21 37 53 69
6 22 38 54 70
7 23 39 55 71
Figure 3. Influence of ranking order on the
number of inter-socket (double lines, blue)
and inter-node (single lines, red) halo com-
munications when using pure MPI. (a) Se-
quential mapping, (b) Round-robin mapping.
effort versus amount of halo data, both per process, and the
characteristics of the network.
What is the best ranking order for the domain decom-
position at hand? It is important to realize that the hier-
archical node structure enforces multilevel domain decom-
position which can be optimized for minimizing inter-node
communication: It seems natural to try to reduce the socket
“surface area” exposed to the node boundary, as shown in
Fig. 4 (a), which yields ten inter-node and four inter-socket
halo exchanges per node at maximum. But still there is op-
timization potential, because this process can be iterated to
the socket level (Fig. 4 (b)), cutting the number of inter-
socket connections in half. Comparing Figs. 3 (a), (b) and
Figs. 4 (a), (b), this is the best possible rank order for pure
MPI.
Above considerations should make it clear that it can be
vital to know about the default rank placement used in a
particular parallel environment and modify it if required.
Unfortunately, many commodity clusters are still run today
without a clear concept about rank-core affinity and even no
way to influence it on a user-friendly level.
4
(a) (b)
0
1
2
3
16
17
18
19
32
33
34
35
48
49
50
51
64
65
66
67
4
5
6
7
20
21
22
23
36
37
38
39
52
53
54
55
68
69
70
71
8
9
10
11
24
25
26
27
40
41
42
43
56
57
58
59
72
73
74
75
12
13
14
15
28
29
30
31
44
45
46
47
60
61
62
63
76
77
78
79
0
1
2
3
16
17
18
19
32
33
34
35
48
49
50
51
4
5
6
7
20
21
22
23
36
37
38
39
52
53
54
55
8
9
10
11
24
25
26
27
40
41
42
43
56
57
58
59
12
13
14
15
28
29
30
31
44
45
46
47
60
61
62
63
76
77
78
79
72
73
74
75
68
69
70
71
64
65
66
67
Figure 4. Two possible mappings for multi-
level domain decomposition with pure MPI.
3.1.2. Mapping problem with fully hybrid MPI+OpenMP
Hybrid MPI+OpenMP enforces the domain decomposition
to be a two-level algorithm. On MPI level, a coarse-grained
domain decomposition is performed. Parallelization on
OpenMP level implies a second level domain decomposi-
tion, which may be implicit (loop level parallelization) or
explicit as shown in Fig. 5.
In principle, hybrid MPI+OpenMP presents similar
challenges in terms of topology awareness, i.e. optimal
rank/thread placement, as pure MPI. There is, however, the
added complexity that standard OpenMP parallelization is
based on loop-level worksharing, which is, albeit easy to ap-
ply, not always the optimal choice. On ccNUMA systems,
for instance, it might be better to drop the worksharing con-
cept in favor of thread-level domain decomposition in order
to reduce inter-domain NUMA traffic (see below). On top
of this, proper first-touch page placement is required to get
scalable bandwidth inside a node, and thread-core affinity
must be employed. Still one should note that those issues
are not specific to hybrid MPI+OpenMP programming but
apply to pure OpenMP as well.
In contrast to pure MPI, hybrid parallelization of above
domain decomposition enforces a 2×5 MPI domain grid,
leading to oblong OpenMP subdomains (if explicit domain
decomposition is used on this level, see Fig. 5). Optimal
rank ordering leads to only three inter-node halo exchanges
per node, but each with about four times the data volume.
Thus we arrive at a slightly higher communication effort
Figure 5. Hybrid OpenMP+MPI two-level do-
main decomposition with a 2×5 MPI domain
grid and eight OpenMP threads per node. Al-
though there are fewer inter-node connec-
tions than with optimal MPI rank order (see
Fig. 4 (b)), the aggregate halo size is slightly
larger.
compared to pure MPI (with optimal rank order), a conse-
quence of the non-square domains.
Beyond the requirements of hybrid MPI+OpenMP,
multi-level domain decomposition may be beneficial when
taking cache optimization into account: On the outermost
level the domain is divided into subdomains, one for each
MPI process. On the next level, these are again split into
portions for each thread, and then even further to fit into
successive cache levels (L3, L2, L1). This strategy ensures
maximum access locality, a minimum of cache misses,
NUMA traffic, and inter-node communication, but it must
be performed by the application, especially in the case
of unstructured grids. For portable software development,
standardized methods are desirable for the application to
detect the system topology and characteristic sizes (see also
Sect. 5).
3.1.3. Mapping problem with mixed model
The mixed model (see Fig. 1 (d)) represents a sort of com-
promise between pure MPI and fully hybrid models, featur-
ing potential advantages in terms of network saturation (see
Sect. 3.3 below). It suffers from the same basic drawbacks
as the fully hybrid model, although the impact of a loss of
thread-core affinity may be larger because of the possibly
significant differences in OpenMP performance and, more
importantly, MPI communication characteristics for intra-
node message transfer. Fig. 6 shows a possible scenario
where we contrast two alternatives for thread placement. In
Fig. 6 (a), intra-node MPI uses the inter-socket connection
only and shared memory access with OpenMP is kept inside
of each multi-core socket, whereas in Fig. 6 (b) all intra-
node MPI (with masteronly style) is handled inside sock-
ets. However, due to the spreading of the OpenMP threads
belonging to a particular process across two sockets there
is the danger of increased OpenMP startup overhead (see
Sect. 3.5) and NUMA traffic.
5
Socket 2
Socket 1
Node
Socket 2
Socket 1
Node
(b)(a)
Node interconnectNode interconnect
MPI process 0MPI process 1
MPI process 0MPI process 0
MPI process 1MPI process 1
Thread 0
Thread 1
Thread 3
Thread 2
Thread 0
Thread 1
Thread 3
Thread 2
T0
T1
T2
T3
T0
T1
T2
T3
Figure 6. Two different mappings of threads
to cores for the mixed model with two MPI
processes per eight-core, two-socket node.
As with pure MPI, the message-passing subsystem
should be topology-aware in the sense that optimiza-
tion opportunities for intra-node transfers are actually ex-
ploited. The following section provides some more infor-
mation about performance characteristics of intra-node ver-
sus inter-node MPI.
3.2. Issues with intra-node MPI communication
The question whether the benefits or disadvantages of
different hybrid programming models in terms of communi-
cation behavior really impact application performance can-
not be answered in general since there are far too many pa-
rameters involved. Even so, knowing the characteristics of
the MPI system at hand, one may at least arrive at an ed-
ucated guess. As an example we choose the well-known
PingPong benchmark from the Intel MPI benchmark (IMB)
suite, performed on RRZE’s “Woody” cluster [4] (Fig. 7).
As expected, there are vast differences in achievable band-
widths for in-cache message sizes; surprisingly, starting at
a message size of 43kB inter-node communication outper-
forms inter-socket transfer, saturating at a bandwidth ad-
vantage of roughly a factor of two for large messages. Even
intra-socket communication is slower than IB in this case.
This behavior, which may be attributed to additional copy
operations through shared memory buffers and can be ob-
served in similar ways on many clusters, shows that simplis-
tic assumptions about superior performance of intra-node
connections may be false. Rank ordering should be chosen
accordingly. Please note also that more elaborate low-level
benchmarks than PingPong may be advisable to arrive at
a more complete picture about communication characteris-
tics.
100101102103104105106107108
Message length [bytes]
0
500
1000
1500
2000
2500
3000
Bandwidth [MBytes/s]
IB inter-node
inter-socket
intra-socket
DDR-IB/PCIe 8x limit
2 MB
43 kB
Figure 7. IMB PingPong bandwidth versus
message size for inter-node, inter-socket,
and intra-socket communication on a two-
socket dual-core Xeon 5160 cluster with
DDR-IB interconnect, using Intel MPI.
At small message sizes, MPI communication is latency-
dominated. For the setup described above we measure the
following latency numbers:
Mode Latency [µs]
IB inter-node 3.22
inter-socket 0.62
intra-socket 0.24
In strong scaling scenarios it is often quite likely that one
“rides the PingPong curve” towards a latency-driven regime
as processor numbers increase, possibly rendering the care-
fully tuned process/thread placement useless.
3.3. Network saturation and sleeping threads with
the masteronly model
The masteronly variant, in which no MPI calls are issued
inside OpenMP-parallel regions, can be used with fully hy-
brid as well as the mixed model. Although being the easiest
way of implementing a hybrid MPI+OpenMP code, it has
two important shortcomings:
1. In the fully hybrid case, a single communicating thread
may not be able to saturate the node’s network connec-
tion. Using a mixed model (see Sect. 3.1.3) with more
than one MPI process per node might solve this prob-
lem, but one has to be aware of possible rank/thread
ordering problems as described in Sect. 3.1. On flat-
memory SMP nodes with no intra-node hierarchical
structure, this may be an attractive and easy to use op-
tion [5]. However, the number of systems with such
6
characteristics is waning. Current hierarchical archi-
tectures require some more effort in terms of thread-
/core affinity (see Sect. 4.1 for benchmark results in
mixed mode on a contemporary cluster).
2. While the master thread executes MPI code, all other
threads sleep. This effectively makes communica-
tion a purely serial component in terms of Amdahl’s
Law. Overlapping communication with computation
may provide a solution here (see Sect. 3.4 below).
One should note that on many commodity clusters to-
day (including those featuring high-speed interconnects like
InfiniBand), saturation of a network port can usually be
achieved by a single thread. However, this may change if,
e.g., multiple network controllers or ports are available per
node. As for the second drawback above, one may argue
that MPI provides non-blocking point-to-point operations
which should generally be able to achieve the desired over-
lap. Even so, many MPI implementations allow communi-
cation progress, i.e., actual data transfer, only inside MPI
calls so that real background communication is ruled out.
The non-availability of non-blocking collectives in the cur-
rent MPI standard adds to the problem.
3.4. Overlapping communication and computation
It seems feasible to “split off” one or more OpenMP
threads in order to execute MPI calls, letting the rest
do the actual computations. Just as the fully hybrid
model, this requires the MPI library to support at least
the MPI_THREAD_FUNNELED. However, work distribution
across the non-communicating threads is not straightfor-
ward with this variant, because standard OpenMP work-
sharing works on the whole team of threads only. Nested
parallelism is not an alternative due to its performance
drawbacks and limited availability. Therefore, manual
worksharing must be applied:
if (my_thread_ID < 1) {
MPI_Send( halo data )
MPI_Recv( halo data )
} else {
my_range = (high-low-1) / (num_threads-1) + 1;
my_low = low + (my_thread_ID+1)*my_range;
my_high = high+ (my_thread_ID+1+1)*my_range;
my_high = max(high, my_high)
for (i=my_low; i<my_high; i++) {
/* computation */
}
}
Apart from the additional programming effort for divid-
ing the computation into halo-dependent and non-halo-
dependent parts (see Sect. 2.3), directives for loop work-
sharing cannot be used any more, making “dynamic” or
“guided” schemes that are essential to use in poorly load-
balanced situations very hard to implement. Thread sub-
teams [6] have been proposed as a possible addition to the
future OpenMP 3.x/4.x standard and would ameliorate the
problem significantly. OpenMP tasks, which are part of the
recently passed OpenMP 3.0 standard, also form an elegant
alternative but presume that dynamic scheduling (which is
inherent to the task concept) is acceptable for the applica-
tion.
See Ref. [5] for performance models and measure-
ments comparing parallelization with masteronly style ver-
sus overlapping communication and computation on SMP
clusters with flat intra-node structure.
3.5. OpenMP performance pitfalls
As with standard (non-hybrid) OpenMP, hybrid
MPI+OpenMP is prone to some common performance pit-
falls. Just by switching on OpenMP, some compilers re-
frain from some loop optimizations which may cause a sig-
nificant performance hit. A prominent example is SIMD
vectorization of parallel loops on x86 architectures, which
gives best performance when using 16-byte aligned load/s-
tore instructions. If the compiler cannot apply dynamic loop
peeling [7], a loop parallelized with OpenMP can only be
vectorized using unaligned loads and stores (verified with
several releases of the Intel compilers, up to version 10.1).
The situation seems to improve gradually, though.
Thread creation/wakeup overhead and frequent synchro-
nization are further typical sources of performance prob-
lems with OpenMP, because they add to serial execution
and thus contribute to Amdahl’s Law on the node level. On
ccNUMA architectures correct first-touch page placement
must be employed in order to achieve scalable performance
across NUMA locality domains. In this respect one should
also keep in mind that communicating threads, inside or
outside of parallel regions, may have to partly access non-
local MPI buffers (i.e. from other NUMA domains).
Due to, e.g., limited memory bandwidth, it may be pref-
erential in terms of performance or power consumption to
use fewer threads than available cores inside of each MPI
process [8]. This leads again to several affinity options (sim-
ilar to Fig. 6 (a) and (b)) and may impact MPI inter-node
communication.
4. Expected hybrid parallelization benefits
We have made it clear in the previous section that the par-
allel programming models described so far do not really fit
onto standard hybrid hardware. Consequently, one should
always try to optimize the parallel environment, especially
in terms of thread/core mapping and the correct choice of
7
hybrid execution mode, in order to minimize the mismatch
problems.
On the other hand, as pointed out in the introduction,
several real benefits can be expected from hybrid program-
ming models as opposed to pure MPI. We will elaborate on
the most important aspects in the following sections.
4.1. Additional levels of parallelism
In some applications, there is a coarse outer level of par-
allelism which can be easily exploited by message passing,
but is strictly limited to a certain number of workers. In such
a case, a viable way to improve scalability beyond this limit
is to use OpenMP in order to speed up each MPI process,
e.g. by identifying parallelizable loops at an inner level. A
prominent example is the BT-MZ benchmark from the NPB
(Multi-Zone NAS Parallel Benchmarks) suite. See Sect. 4.1
for details.
Benchmark results on “Ranger” at TACC
Here we present some performance results that were ob-
tained on a “Sun Constellation Cluster” named “Ranger”
[9], a high-performance compute resource at the Texas Ad-
vanced Computing Center (TACC) in Austin. It comprises
a DDR InfiniBand network which connects 3936 ccNUMA
compute blades (nodes), each with four 2.3GHz AMD
Opteron “Barcelona” quad-core chips and 32GB of mem-
ory. This allows for 16-way shared memory programming
within each node. At four flops per cycle, the overall peak
performance is 579 TFlop/s. For compiling the benchmarks
we employed PGI’s F90 compiler in version 7.1, directing
it to optimize for Barcelona processors. MVAPICH was
used for MPI communication, and numactl for implement-
ing thread-core and thread-memory affinity.
The NAS Parallel Benchmark (NPB) Multi-Zone (MZ)
[10] codes BT-MZ and SP-MZ (class E) were chosen to ex-
emplify the benefits and limitations of hybrid mode. The
purpose of the NPB-MZ is to capture the multiple levels of
parallelism inherent in many full scale applications. Each
benchmark exposes a different challenge to scalability: BT-
MZ is a block tridiagonal simulated CFD code. The size of
the zones varies widely, with a ratio of about 20 between the
largest and the smallest zone. This poses a load balancing
problem when only coarse-grained parallelism is exploited
on a large number of cores. SP-MZ is a scalar pentadiago-
nal simulated CFD code with equally sized zones, so from
a workload point of view the best performance should be
achieved by pure MPI. A detailed discussion of the per-
formance characteristics of these codes is presented in [11].
The class E problem size for both benchmarks comprises
an aggregate grid size of 4224×3456×92 points and a total
number of 4096 zones. Each MPI process is assigned a set
of zones to work on, according to a bin-packing algorithm
1024 2048 4096 8192
# cores
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Performance [TFlop/s]
SP-MZ (MPI)
SP-MZ (MPI+OpenMP)
BT-MZ (MPI)
BT-MZ (MPI+OpenMP)
Figure 8. NPB BT-MZ and SP-MZ (class E)
performance on Ranger for mixed hybrid and
pure MPI modes (see text for details on the
mixed setup). There is no pure MPI data for
8192 cores as the number of MPI processes
is limited to 4096 (zones) in that case.
to achieve a balanced workload. Static worksharing is used
on the OpenmMP level. Due to the implementation of the
benchmarks the maximum number of MPI processes is lim-
ited to the number of zones for SP-MZ as well as BT-MZ.
Fig. 8 shows results at 1024 to 8192 cores. For both
BT-MZ and SP-MZ the mixed hybrid mode enables scala-
bility beyond the number of zones. In the case of BT-MZ,
reducing the number of MPI processes and using OpenMP
threads allows for better load balancing while maintaining a
high level of parallelism. SP-MZ scales well with pure MPI,
but reducing the number of MPI processes cuts down on the
amount of data to be communicated and the total number of
MPI calls. At 4096 cores the hybrid version is 9.6 % faster.
Thus, for both benchmarks, hybrid MPI+OpenMP outper-
forms pure MPI. SP-MZ shows best results with mixed hy-
brid mode using half of the maximum possible MPI pro-
cesses at 2 threads each. The best mixed hybrid mode for
BT-MZ depends on the coarse grain load balancing that can
be achieved and varies with the number of available cores.
We must emphasize that the use of affinity mechanisms
(numactl, in this particular case) is absolutely essential for
getting good performance and reproducibility on this cc-
NUMA architecture.
4.2. Improved load balancing
If the problem at hand has load balancing issues, some
kind of dynamic balancing should be implemented. In MPI,
this is a problem for which no generic recipes exist. It is
highly dependent on the numerics and potentially requires
8
significant communication overhead. It is therefore hard to
implement in production codes.
One big advantage of OpenMP over MPI lies in the pos-
sible use of “dynamic” or “guided” loop scheduling. No ad-
ditional programming effort or data movement is required.
However, one should be aware that non-static scheduling is
suboptimal for memory-bound code on ccNUMA systems
because of unpredictable (and non-reproducible) access pat-
terns; if guided or dynamic schedule is unavoidable, one
should at least employ round-robin page placement for ar-
ray data in order to get some level of parallel data access.
For the hybrid case, simple static load balancing on the
outer (MPI) level and dynamic/guided loop scheduling for
OpenMP can be used as a compromise. Note that if dy-
namic OpenMP load balancing is prohibitive because of
NUMA locality constraints, a mixed model (Fig. 1 (d)) may
be advisable where one MPI process runs in each NUMA
locality domain and dynamic scheduling is applied to the
threads therein.
4.3. Reduced memory consumption
Although one might believe that there should be no data
duplication or, more generally, data overhead between MPI
processes, this is not true in reality. E.g., in domain de-
composition scenarios, the more MPI domains a problem is
divided into, the larger the aggregated surface and thus the
larger the amount of memory required for halos. Other data
like buffers internal to the MPI subsystem, but also lookup
tables, global constants, and everything that is usually du-
plicated for efficiency reasons, adds to memory consump-
tion. This pertains to redundant computation as well.
One the other hand, if there are multiple (t) threads per
MPI process, duplicated data is reduced by a factor of t(this
is also true for halo layers if not using domain decomposi-
tion on the OpenMP level). Although this may seem like a
small advantage today, one must keep in mind that the num-
ber of cores per CPU chip is constantly increasing. In the
future, tens and even hundreds of cores per chip may lead
to a dramatic reduction of available memory per core.
It should be clear from the considerations in the previous
sections that it is not straightforward to pick the optimal
number of OpenMP threads per MPI process for a given
problem and system. Even assuming that mismatch/affinity
problems can be kept under control, using too many threads
can have negative effects on network saturation, whereas
too man MPI processes might lead to intolerable memory
consumption.
4.4. Further opportunities
Using multiple threads per process may have some ben-
efits on the algorithmic side due to larger physical domains
inside of each MPI process. This can happen whenever a
larger domain is advisable in order to get improved numer-
ical accuracy or convergence properties. Examples are:
A multigrid algorithm is employed only per MPI do-
main, i.e. inside each process, but not between do-
mains.
Separate preconditioners are used inside and between
MPI processes.
MPI domain decomposition is based on physical
zones.
An often used argument in favor of hybrid programming
is the potential reduction in MPI communication in compar-
ison to pure MPI. As shown in Sect. 3.1 and 5, this point
deserves some scrutiny because one must compare optimal
domain decompositions for both alternatives. However, the
number of messages sent and received per node does de-
crease which helps to reduce the adverse effects of MPI la-
tency. The overall aggregate message size is diminished as
well if intra-process “messages”, i.e. NUMA traffic, are not
counted. In the fully hybrid case, no intra-node MPI is re-
quired at all, which may allow the use of a simpler (and
hopefully more efficient) variant of the message-passing li-
brary, e.g., by not loading the shmem device driver. And
finally, a hybrid model enables incorporation of functional
parallelism in a very straightforward way: Just like using
one thread per process for concurrent communication/com-
putation as described above, one can equally well split off
another thread for, e.g., I/O or other chores that would be
hard to incorporate into the parallel workflow with pure
MPI. This could even reduce the non-parallelizable part of
the computation and thus enhance overall scalability.
5. Aspects of future standardization efforts
In Sect. 3 we have argued that mismatch problems need
special care, not only with hybrid programming, but also
under pure MPI. However, correct rank ordering and the
decisions between pure and mixed models cannot be op-
timized without knowledge about machine characteristics.
This includes, among other things, inter-node, inter-socket
and intra-socket communication bandwidths and latencies,
and information on the hardware topology in and between
nodes (cores per chip, chips per socket, shared caches,
NUMA domains and networks, and message-passing net-
work topology). Today, the programmer is often forced
to use non-portable interfaces in order to acquire this data
(examples under Linux are libnuma/numactl and the Intel
“cpuinfo” tool; other tools exist for other architectures and
operating systems) or perform their own low-level bench-
marks to figure out topology features.
9
What is needed for the future is a standardized interface
with an abstraction layer that shifts the non-portable pro-
gramming effort to a library provider. In our opinion, the
right place to provide such an interface is the MPI library,
which has to be adapted to the specific hardware anyway.
At least the most basic topology and (quantitative) commu-
nication performance characteristics could be done inside
MPI at little cost. Thus we propose the inclusion of a topol-
ogy/performance interface into the future MPI 3.0 standard,
see also [12].
As mentioned in Sect. 3.3, there are already some ef-
forts to include a subteam feature into upcoming OpenMP
standards. We believe this feature to be essential for hybrid
programming on current and future architectures, because
it will greatly facilitate functional parallelism and enable
standard dynamic load balancing inside multi-threaded MPI
processes.
6. Conclusions
In this paper we have pinpointed the issues and poten-
tials in developing high performance parallel codes on cur-
rent and future hierarchical systems. Mismatch problems,
i.e. the unsuitability of current hybrid hardware for running
highly parallel workloads, are often hard to solve, let alone
in a portable way. However, the potential gains in scalabil-
ity and absolute performance may be worth the significant
coding effort. New features in future MPI and OpenMP
standards may constitute a substantial improvement in that
respect.
Acknowledgements
We greatly appreciate the excellent support and the com-
puting time provided by the HPC group at the Texas Ad-
vanced Computing Center. Fruitful discussions with Rainer
Keller and Gerhard Wellein are gratefully acknowledged.
References
[1] R. Loft, S. Thomas, J. Dennis: Terascale Spectral Element
Dynamical Core for Atmospheric General Circulation
Models. Proceedings of SC2001, Denver, USA.
[2] Cluster OpenMP for Intel compilers. http://
software.intel.com/en-us/articles/
cluster-openmp-for-intel-compilers
[3] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R.
Rajamony, W. Yu, W. Zwaenepoel: TreadMarks: Shared
Memory Computing on Networks of Workstations. IEEE
Computer 29(2), 18–28 (1996).
[4] http://www.hpc.rrze.uni-erlangen.de/systeme/
woodcrest-cluster.shtml
[5] R. Rabenseifner, G. Wellein: Communication and Op-
timization Aspects of Parallel Programming Models on
Hybrid Architectures. International Journal of High Per-
formance Computing Applications 17(1), 49–62 (2003).
[6] B. M. Chapman, L. Huang, H. Jin, G. Jost, B. R. de
Supinski: Toward Enhancing OpenMP’s Work-Sharing
Directives. In W. E. Nagel et al. (Eds.): Proceedings of
Euro-Par 2006, LNCS4128, 645–654. Springer (2006).
[7] M. St¨urmer, G. Wellein, G. Hager, H. K¨ostler, U. R¨ude:
Challenges and potentials of emerging multicore archi-
tectures. In: S. Wagner et al. (Eds.), High Performance
Computing in Science and Engineering, Garching/Munich
2007, 551–566, Springer (2009).
[8] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S.
Nikolopoulos, B. R. de Supinski, M. Schulz: Predic-
tion Models for Multi-dimensional Power-Performance
Optimization on Many Cores. In D. Tarditi, K. Olukotun
(Eds.), Proceedings on the Seventeenth International Con-
ference on Parallel Architectures and Compilation Tech-
niques (PACT08), Toronto, Canada, Oct. 25–29, 2008.
[9] http://www.tacc.utexas.edu/services/
userguides/ranger/
[10] R. F. Van Der Wijngaart, H. Jin: NAS Parallel Bench-
marks, Multi-Zone Versions. NAS Technical Report NAS-
03-010, NASA Ames Research Center, Moffett Field, CA,
2003.
[11] H. Jin, R. F. Van Der Wijngaart: Performance Character-
istics of the multi-zone NAS Parallel Benchmarks. Journal
of Parallel and Distributed Computing, Vol. 66, Special Is-
sue: 18th International Parallel and Distributed Processing
Symposium, pp. 674–685, May 2006.
[12] MPI Forum: MPI-2.0 Journal of Development (JOD),
Sect. 5.3 “Cluster Attributes”, http://www.mpi-forum.
org, July 18, 1997.
[13] R. Rabenseifner, G. Hager, G. Jost,R. Keller: Hybrid MPI
and OpenMP Parallel Programming. Half-day Tutorial
No. S-10 at SC07, Reno, NV, Nov. 10–16, 2007.
[14] R. Rabenseifner: Some Aspects of Message-Passing on
Future Hybrid Systems. Invited talk at 15th European
PVM/MPI Users’ Group Meeting, EuroPVM/MPI 2008,
Sep. 7–10, 2008, Dublin, Ireland. LNCS 5205, pp 8–10,
Springer (2008).
10
... Therefore, one can easily convert a serial programing code in OpenMP into a parallel code by appending some instructions. Moreover, OpenMP has standard and simple conventions, provides a crossplatform approach, and is popular among professional programmers [23,24] . ...
... The rule sets it can produce are of three types: Firewall rules (FW), Access Control List (ACL), and IP Chains (IPC). The second module creates a collection of random packets using the statistical specifications of the first module's classifiers [23] . Below we briefly describe the abovementioned rule set types. ...
Article
Full-text available
The network switches in the data plane of SDN are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data, are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like tuple space search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms including graphics processing units (GPUs), clusters of central processing units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 million packets per second (Mpps). That is, the hybrid cluster produced by the integration of CUDA, MPI, and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.
... Generally speaking, multi-core optimization with distributed memory parallelization alone suffers from high memory usage and overhead. On the other hand, shared memory parallelization has limited capacity in comparison to distributed memory parallelization [90] and can suffer from load balancing issues [91]. Therefore, the main goal of the multi-core optimization was to implement hybrid parallelization, i.e., distributed memory parallelization (MPI) and shared memory parallelization together. ...
Article
Full-text available
Molecular Dynamics simulations study material structure and dynamics at the atomic level. X-ray and neutron scattering experiments probe exactly the same time- and length scales as the simulations. In order to benchmark simulations against measured scattering data, a program is required that computes scattering patterns from simulations with good single-core performance and support for parallelization. In this work, the existing program Sassena is used as a potent solution to this requirement for a range of scattering methods, covering pico- to nanosecond dynamics, as well as the structure from some Ångströms to hundreds of nanometers. In the case of nanometer-level structures, the finite size of the simulation box, which is referred to as the finite size effect, has to be factored into the computations for which a method is described and implemented into Sassena. Additionally, the single-core and parallelization performance of Sassena is investigated, and several improvements are introduced.
... • parallelization, leading to a reduction of the computational time proportional to the number of employed processors with excellent scaling properties [39,52,55,68,73,81,82]; • structure-preserving schemes, to preserve physical properties at the discrete level without excessive mesh refinements, e.g., positivity preserving schemes [27,53,60,66,67], well-balanced schemes [12, 22, 25-28, 47, 59, 64, 85], TVD or maximum principle preserving schemes [10,48,49,86], and entropy conservative/dissipating schemes [4-7, 23, 24, 40, 44, 45, 56-58, 65, 69, 83, 84]; • high-order methods, which guarantee higher accuracy for coarser meshes and shorter computational times, on smooth problems, as they are able to catch complicated physical structures that low-order methods struggle to obtain, e.g., finite-element-based methods [1,3,8,54,57,70,71], finite volume methods [10,12,22,64,75,85], and discontinuous Galerkin (DG) methods [17,19,23,35,[44][45][46]51]. ...
Article
Full-text available
We propose a new paradigm for designing efficient p -adaptive arbitrary high-order methods. We consider arbitrary high-order iterative schemes that gain one order of accuracy at each iteration and we modify them to match the accuracy achieved in a specific iteration with the discretization accuracy of the same iteration. Apart from the computational advantage, the newly modified methods allow to naturally perform the p -adaptivity, stopping the iterations when appropriate conditions are met. Moreover, the modification is very easy to be included in an existing implementation of an arbitrary high-order iterative scheme and it does not ruin the possibility of parallelization, if this was achievable by the original method. An application to the Arbitrary DERivative (ADER) method for hyperbolic Partial Differential Equations (PDEs) is presented here. We explain how such a framework can be interpreted as an arbitrary high-order iterative scheme, by recasting it as a Deferred Correction (DeC) method, and how to easily modify it to obtain a more efficient formulation, in which a local a posteriori limiter can be naturally integrated leading to the p -adaptivity and structure-preserving properties. Finally, the novel approach is extensively tested against classical benchmarks for compressible gas dynamics to show the robustness and the computational efficiency.
... In distributed systems, nested parallelism is typically achieved by combining different programming models, one supporting the distributed system part and another dealing with the execution within each shared memory system. This is the case of the hybrid MPI + OpenMP model [18], StarSs [17] or the COMPSs + OmpSs combination [6]. Since the runtime systems supporting these models do not share information, developers must master several models and manage the coordination of different levels of parallelism. ...
Chapter
The scale and heterogeneity of exascale systems increment the complexity of programming applications exploiting them. Task-based approaches with support for nested tasks are a good-fitting model for them because of the flexibility lying in the task concept. Resembling the hierarchical organization of the hardware, this paper proposes establishing a hierarchy in the application workflow for mapping coarse-grain tasks to the broader hardware components and finer-grain tasks to the lowest levels of the resource hierarchy to benefit from lower-latency and higher-bandwidth communications and exploiting locality. Building on a proposed mechanism to encapsulate within the task the management of its finer-grain parallelism, the paper presents a hierarchical peer-to-peer engine orchestrating the execution of workflow hierarchies with fully-decentralized management. The tests conducted on the MareNostrum 4 supercomputer using a prototype implementation prove the validity of the proposal supporting the execution of up to 707,653 tasks using 2,400 cores and achieving speedups of up to 106 times faster than executions of a single workflow and centralized management.Keywordsdistributed systemsexascaletask-basedprogramming modelworkflowhierarchyruntime systempeer-to-peerdecentralized management
... The earliest remote sensing models were limited by the performance of desktop softwares running on a single machine. With the development of high-performance computing (HPC), some HPC-based methods have been considered as effective solutions for solving computational challenges, such as MPI/OpenMP [26], Hadoop [27], Spark [28], and CUDA programming supported by GPUs [29]. Many HPC frameworks targeting remote sensing and geographic information computing scenarios have also been proposed, such as Spa-tialHadoop [30], Hadoop GIS [31], GeoFlink [32], etc. ...
Article
Full-text available
Land cover mapping plays a pivotal role in global resource monitoring, sustainable development research, and effective management. However, the complexity of the mapping process, coupled with significant computational and data storage requirements, often leads to delays between data processing and product publication, thereby bringing challenges to creating multi-timesteps large-area products for monitoring dynamic land cover. Therefore, improving the efficiency of each stage in land cover mapping and automating the mapping process is currently an urgent issue to be addressed. This study proposes a high-performance automated large-area land cover mapping framework (HALF). By leveraging Docker and workflow technologies, the HALF effectively tackles model heterogeneity in complex land cover mapping processes, thereby simplifying model deployment and achieving a high degree of decoupling between production models. It optimizes key processes by incorporating high-performance computing techniques. To validate these methods, this study utilized Landsat imagery data and extracted samples using GLC_FCS and FROM_GLC, all of which were acquired at a spatial resolution of 30 m. Several 10° × 10° regions were chosen globally to illustrate the viability of generating large-area land cover using the HALF. In the sample collection phase, the HALF introduced an automated method for generating samples, which overlayed multiple prior products to generate a substantial number of samples, thus saving valuable manpower resources. Additionally, the HALF utilized high-performance computing technology to enhance the efficiency of the sample–image matching phase, thereby achieving a speed that was ten times faster than traditional matching methods. In the mapping stage, the HALF employed adaptive classification models to train the data in each region separately. Moreover, to address the challenge of handling a large number of classification results in a large area, the HALF utilized a parallel mosaicking method for classification results based on the concept of grid division, and the average processing time for a single image was approximately 6.5 s.
... The earliest remote sensing models were limited by the performance of desktop soft-94 wares running on a single machine. With the development of high-performance computing 95 (HPC), some HPC-based methods have been considered as effective solutions for solving 96 computational challenges, such as MPI/OpenMP [24], Hadoop [25], Spark [26], and CUDA 97 programming supported by GPUs [27]. Many HPC frameworks targeting remote sensing 98 and geographic information computing scenarios have also been proposed, such as Spa-99 tialHadoop [28], Hadoop GIS [29], GeoFlink [30], etc. ...
Preprint
Full-text available
Large-scale land cover plays a crucial role in global resource monitoring and management, as well as research on sustainable development. However, the complexity of the mapping process, coupled with significant computational and data storage requirements, often leads to delays between data processing and product publication, creating challenges for dynamic monitoring of large-scale land cover. Therefore, improving the efficiency of each stage in large-scale land cover mapping and automating the mapping process is currently an urgent and critical issue that needs to be addressed. We propose a high-performance automated large-scale land cover mapping framework(HALF) that introduces high-performance computing technology to the field of land cover production. HALF optimizes key processes, such as automated sample point extraction, sample-remote sensing image matching, and large-scale classification result mosaic and update. We selected several 10°×10° regions globally and the research makes several significant contributions:(1)We design HALF for land cover mapping based on docker and CWL-Airflow, which solves the heterogeneity of models between complex processes in land cover mapping and simplifies the model deployment process. By introducing workflow organization, this method achieves a high degree of decoupling between the production models of each stage and the overall process, enhancing the scalability of the framework. (2)HALF propose an automatic sample points method that generates a large number of samples by overlaying and analyzing multiple prior products, thus saving the cost of manual sample selection. Using high-performance computing technology improved the computational efficiency of sample-image matching and feature extraction phase, with 10 times faster than traditional matching methods.(3)HALF propose a high-performance classification result mosaic method based on the idea of grid division. By quickly establishing the spatial relationship between the image and the product and performing parallel computing, the efficiency of the mosaicking in large areas is significantly improved. The average processing time for a single image is around 6.5 seconds.
Chapter
Full-text available
We present performance results on two current multicore architectures, aSTI (Sony, Toshiba, and IBM) Cell processor included in the new Playstation™3 and a Sun UltraSPARC T2 (“Niagara 2”) machine. On the Niagara 2 we analyze typical performance patterns that emerge from the peculiar way the memory controllers are activated on this chip using the standard STREAM benchmark and a shared-memory parallel lattice Boltzmann code. On the Cell processor we measure the memory bandwidth and run performance tests for LBM simulations. Additionally, we show results for an application in image processing on the Cell processor, where it is required to solve nonlinear anisotropic PDEs.
Conference Paper
Full-text available
Summary form only given. We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multizone, is extended from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multilevel programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes.
Conference Paper
Full-text available
Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware si- multaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, on- line performance predictor, which we deploy to address the prob- lem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both perfor- mance and energy savings.
Conference Paper
Full-text available
Climate modeling is a grand challenge problem where scientific progress is measured not in terms of the largest problem that can be solved but by the highest achievable integration rate. These models have been notably absent in previous Gordon Bell competitions due to their inability to scale to large processor counts. A scalable and efficient spectral element atmospheric model is presented. A new semi-implicit time stepping scheme accelerates the integration rate relative to an explicit model by a factor of two, achieving 130 years per day at T63L30 equivalent resolution. Execution rates are reported for the standard shallow water and Held-Suarez climate benchmarks on IBM SP clusters. The explicit T170 equivalent multi-layer shallow water model sustains 343 Gflops at NERSC, 206 Gflops at NPACI (SDSC) and 127 Gflops at NCAR. An explicit Held-Suarez integration sustains 369 Gflops on 128 16-way IBM nodes at NERSC.
Article
Full-text available
Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value
Conference Paper
Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with dual or quad boards, but also “constelation” type systems with large SMP nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node.
Conference Paper
OpenMP provides a portable programming interface for shared mem- ory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requies greater flexibility in light of the ste adily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressiv- ity problems in the current OpenMP specification. We then pro pose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application perfor- mance on emerging and large-scale platforms while preserving ease of program- ming.
Article
Summary Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed mem- ory parallelization on the node inter-connect with the shared memory parallelization inside of each node. The hy- brid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other par- allel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also whether programming paradigms can separate the opti- mization of communication and computation. Benchmark results are presented for hybrid and pure MPI communi- cation. This paper analyzes the strength and waekness of several parallel programming models on clusters of SMP nodes.
Article
We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Jin: NAS Parallel Bench-marks, Multi-Zone Versions
  • R F Van
  • H Wijngaart
R. F. Van Der Wijngaart, H. Jin: NAS Parallel Bench-marks, Multi-Zone Versions. NAS Technical Report NAS-03-010, NASA Ames Research Center, Moffett Field, CA, 2003.