Content uploaded by Henrique Cota Freitas
Author content
All content in this area was uploaded by Henrique Cota Freitas
Content may be subject to copyright.
Static Process Mapping Heuristics for MPI Parallel Processes in
Homogeneous Multi-core Clusters*
Manuela K. Ferreira, Vicente S. Cruz, Rodrigo
Virote Kassick, Philippe O. A. Navaux
Instituto de Informática – Universidade Federal
do Rio Grande do Sul (UFRGS)
Caixa Postal 15.064 – 91.501-970 – Porto Alegre
– RS – Brazil
{mkferreira, vscruz, rodrigovirote.kassick,
navaux}@inf.ufrgs.br
Henrique Cota de Freitas
Instituto de Ciências Exatas e Informática –
Pontifícia Universidade Católica de Minas Gerais
(PUC Minas) – Av. Dom José Gaspar 500 – CEP
30.535-901– Belo Horizonte – MG – Brasil
cota@pucminas.br
Abstract
An important factor that must be considered to
achieve high performance on parallel applications is
the mapping of processes on cores. However, since
this is defined as an NP-Hard problem, it requires
different mapping heuristics that depends on the
application and the hardware on which it will be
mapped. The current architectures can have more than
one multi-core processors per node, and consequently
the process mapping can consider three process
communication types: intrachip, intranode and
internode. This work compare two static process
mapping heuristics, Maximum Weighted Perfect
Matching (MWPM) and Dual Recursive Bipartitioning
(DRB), with the best mapping founded by Exhaustive
Search (ES) in homogeneous multi-core clusters using
MPI. The objective is to compare the performance
improvement of MWPM with already established DRB.
The analysis with 16 processes presents very close
performance improvement for both heuristics, with an
average of 13.79% for MWPM and 14.07% for DRB.
1. Introduction
Multi-core processors can be found from personal
computers to high performance computers (HPC). Four
to eight core in a single chip are common today, and
the trend is that we will have more cores per chip[1]
while the Moore law can by applied for the CMPs[2].
Consequently, old methods of developing and
executing programs do not make a reasonable usage of
new architectures resources, e. g. leaving some cores
wasting energy in an idle state. Hence, new methods
are required to make an effective usage of these
resources.
When we place the processes of a parallel
application on this current multi-core environment, we
have an instance of the process mapping problem [3],
that is the act of decide in which processing unit each
parallel process will execute with the goal to archive
the best performance. It is an NP-Hard problem and
consequently do not have a generic solution, so we
must consider the software details and hardware
platform characteristics on which it will execute to find
a local solution.
The process communication have a significant
impact on performance of parallel applications and
must be taken in consideration on development of
process mapping heuristics to decrease communication
time and increase performance [4][5]. Processes within
higher communication between them must to be
mapped to the processors where the communication
have the smaller cost.
Since the last version of MPICH2, 1.3, its standard
communication method is nemesis [16] that combines
the already used socket communication with the shared
memory communications and takes into account if the
*Partially sponsored by CNPq, CAPES, FAPERGS,
FINEP and Microsoft.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
processes are sharing memory to decide which
communication to use. So, when run MPICH2
processes in a CMP cluster, faster shared memory
communication is used both when processes exchange
data within a chip – intrachip communication – or
within a node – intranode communication. And the
socket communications is used when a process must
send a message to one located on a different node of
the cluster – internode communication.
This paper makes an evaluation between two static
process mapping heuristics for a cluster with multi-core
processors, using MPI, and NAS Parallel Benchmark
as workload [6]. The first is the Maximum Weight
Perfect Matching (MWPM) algorithm [7], which use
the application communication pattern to make the
mapping based on a variant of the Edmonds-Karp
algorithm. The second method applies the Dual
Recursive Bipartitioning (DRB) algorithm, executing
the mapping from the application graph to the machine
architecture graph by a divide and conquer method [8].
Both heuristics are compared with the Exhaustive
Search (ES) that makes the mapping by exhaustively
combining every processes pair on cores pair that share
the same level of the memory hierarchy, finding the
best combination of pairs of processes that have the
maximum communication volume.
The objective of this evaluation is to analyze how
MWPM and DRB are close to the best mapping
method, i.e. the ES. This last has a factorial
complexity, so is not viable when the number of
process increases. Hence, by evaluating other methods
that gives a mapping solution in a shorter time, we can
have an idea about which of them can be used in most
of static mapping on homogeneous multi-core clusters.
The organization of this paper is the following. On
Section 2, we present the characteristics of static
process mapping on multi-core clusters. Section 3
explain the benchmark used to compare the different
mappings. Section 4 presents the three heuristics that
are compared on this work. The heuristics performance
results are presented and analyzed in Section 5. On
Section 6, we show some related works that uses static
process mapping. Finally, Section 7 is dedicated for
conclusion and future works.
2. Process Mapping Overview
The process mapping of parallel applications is an
NP-Hard problem [3], which means that there is not a
generic optimal solution to solve it. For that reason, we
must work on a method that consider the details of a
specific application, such as the hardware
characteristics on which the processes will be mapped,
to find a local and suitable mapping heuristic for this
specific environment and achieve a reasonable
performance. For instance, on heterogeneous platform,
processes whose tasks are specialized could be mapped
to a core where its Instruction Set Architecture (ISA) is
proper to its execution. Another example is the
communication latency. This one challenges the
process mapping algorithm because it must place the
processes on cores such that the communication
overhead is minimized. These two examples could
affect each other, because a process that was placed on
a specialized core can communicate very often with
another process which was mapped on a farther and
also specialized core.
In the current multiprocessed architectures, like
clusters, the existence of multi-core processors are
common. Thus, the process mapping for high
performance computing must consider at least three
processes communication levels. The first one is the
internode, on which processes placed on different
nodes send messages to each other through
interconnection network. Intranode is the second one,
and it happens when two processes within a node but in
different processor communicates using shared memory
or interconnection bus. Finally, the last one is the
intrachip that arose with the multi-core processors. In
this communication type, the cache memory is shared
between the processes which are attached on their
respective cores. Figure 1 shows the tree
communication types.
Figure 1: Internode, Intranode and Intrachip
Communication Types [4].
To explain the types of process placement we show,
in Figure 2, the topology of the Intel Xeon E5405
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
quad-core processor used on our experiments. If two
processes communicate intensively each other, with
small to medium message sizes, then they should be
placed on cores that share L2 cache level, such that the
message passing latency must be reduced [4].
Figure 2: Intel Xeon E5405 Quad-Core Processor
Topology.
3. Workload Description
For the purpose of the process mapping heuristics
evaluation, we used the Numerical Aerodynamic
Simulation Parallel Benchmark (NAS-NPB) workload,
version 3.3 for MPI [6]. This benchmark was selected
because of its generality and presents heterogeneous
communication patterns, so we can measure the
performance of the three heuristics in a general mode.
The Block Tridiagonal (BT), Scalar Pentadiagonal
(SP) and Lower and Upper triangular systems (LU)
applications are parallelized by the division of space
domain, with the sharing of their respective borders.
The Conjugate Gradient (CG) has a random data access
pattern which is well suited for memory performance
analysis. Multigrid (MG), on its turns, does not have a
well defined access pattern since the information are
fetched in a non-linear way in the beginning and
becomes more linear during the execution. The
Embarrassingly Parallel (EP) application is used to
measure the performance peak of computational
systems once the data processing are almost
independent, which is the inverse of Integer Sort (IS),
where exists a high data dependency. Finally, fast
Fourier Transformation (FT) merges linear data access
with data sharing.
Almost all of these applications were developed in
Fortran-90 because of their 64 bits floating-point
instructions, just the IS was implemented in C since it
operates only with integer data. As will be explained in
more details in Section 5, the binaries were generated
with a hybrid C-Fortran compilation.
4. Exhaustive Search and Static Process
Mapping Heuristics
The purpose of our work consists on evaluating two
heuristics based on the volume of processes
communication and comparing their respective
performance. The heuristics are Maximum Weighted
Perfect Matching (MWPM) and Dual Recursive
Bipartitioning (DRB). The objective of this approaches
is to minimize the communication latency of processes
which have a considerable message-exchange bulk.
For the description of both heuristics, it is important
to make some definitions. Let A(P, B) be the
application communication graph, where P is the set of
processes representing the vertexes of A, and B, the set
of edges of A on which B(px, py), with px, py ∈ P,
represents the communication bulk exchanged between
the processes px and py. In general, the three proposals
presented offer a method of placing a pair of processes
(px, py) on cores that share resources which minimize
the communication latency, that is, in this case, a level
of memory hierarchy.
It is necessary a previous execution of the
applications that will be mapped to gather the
processes communication pattern – the amount of
messages transferred between each process. This
information is stored in the set of edges B of graph A,
represented as an adjacency matrix, and it is used as an
input parameter of the both mapping heuristic.
The Exhaustive Search (ES) is used as performance
baseline. It provides the best process mapping, but it
takes an factorial time complexity and is not suited for
a considerable number of processes and cores.
Therefore, a good solution to overcome this issue is to
abdicate from the best mapping method and use
another one which finds a reasonably good mapping in
polynomial time. So, the reduction of the application
performance is amortized by the time decrement of the
allocation definition for the application processes. We
compare the MWPM and DRB heuristics with ES to
examine how close of the best solution the applications
performance obtained by their respective mapping
achieve.
All these methods are focused on process mapping
on cores which share the L2 cache memory level. As
output, they generate an affinity-file containing the
ranks of the processes and cores numbers on which
they should be mapped. The following subsections
describes ES method and the two heuristics: MWPM
and DRB.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
4.1. Exhaustive Search
This method ensures the best process placement
because it obtains the maximum communication
volume by searching exhaustively all combinations of
processes pair that can be mapped on pair of cores.
After finding the best combination, it maps each pair of
processes on a respective pair of cores which shares the
cache L2.
The algorithm receives a list of all possible
processes pairs combinations where each list element,
that is, a possible tasks pairing combination, has n
pairs. The best mapping is defined by selecting the
element of combination list where the sum of all
processes pair communication volume is maximum.
Finally, these pairs are mapped to pairs of cores which
share the L2 cache.
Although ES method provides the best process
mapping, the search comprises all tasks placement
possibilities, which takes an factorial time complexity.
Unfortunately, the number of cores that we have today
is considerably high, which makes the search time of
this method impracticable.
4.2. Maximum Weighted Perfect Matching
The MWPM defines a process placement in a
polynomial time complexity by modeling the mapping
problem as a maximum graph pairing with minimum
cost. This approach arises as a feasible solution and
works in three steps. In the first step, the algorithm
groups the processes that have a significant amount of
exchanged messages in pairs, and places each of them
in pairs of cores that shares a L2 cache level and taking
advantage of intrachip communication. The second
step allocates each pair of processes-pairs on the
processors using the intranode communication. The
third step distributes the pairs of processes-pairs on the
nodes considering the internode communication [15].
The algorithm works by choosing, from the
application graph A(P, B), the processes that should
stay closer within memory hierarchy based on the
amount of communication messages. The purpose is to
find a subset M from B such that for every process p ∈
P there is only one edge b ∈ B on which p is attached,
and the sum of all edges b ∈ M is maximum. This
problem is solved by Edmonds-Karp algorithm in a
time complexity O(N³) and is applied three time to find
the proper mapping with 16 processes.
The first application of the algorithm is
straightforward and discovers the pairs of processes to
be placed on pairs of cores that shares the L2 cache
level. The next application of the MWPM algorithm
defines how to map each pair of processes-pairs on the
pair of processors, and a reasonable approach to
achieve this objective is defining a sharing matrix,
where each row and column represents the processes
pairs defined by the edges of M. This matrix is used as
input of next algorithm execution, where each element
stores the data amount shared by each pair of
processes-pairs. The equation defined to generate the
matrix elements is shown in the formula below, where
(x, y) and (z, k) are the processes pairs attached to the
edges of the subset M returned by the first algorithm
execution, and M(p1, p2) is an edge of that subset
weighted with the data bulk exchanged between
processes p1 and p2.
H
(
x,y
)
,
(
z,w
)
=M
(
x,z
)
+M
(
x,k
)
+M
(
y,z
)
+M
(
y,k
)
The third application of the algorithm is a repetition
of step two, but use the communication matrix
generated on step two to generate a matrix that
represent the amount of communication between a pair
of processes-pairs.
4.3. Dual Recursive Bipartitioning
The approach of DRB process mapping is based on
the divide and conquer method to generate graphs
mapping [8]. Along with A(P, B), let T(C, L) be the
hardware platform target graph, where the vertices set
C represents the processing units – the cores – of the
hardware, and L, the set of edges of T, where L(cx, cy) is
defined as the latency of the communication channel
between the cores cx and cy. Consequently to generate
the graph T it is necessary to measure the
communication latency between each core.
To create the mapping, the algorithm begins with
the processes set P of the application graph A, and a
domain structure comprising the whole set of cores,
that is, the vertices set C of the graph T. Then, it
applies the functions domain bipartitioning and process
bipartitioning on the domain structure and processes
set, respectively, dividing the first one in two cores
subdomains, and the second one in two disjoint
processes subset. These functions are based on graph
partitioning algorithms, and assures that processes
which send a reasonable number of messages among
them are kept in the same subset, and processors which
have low latency to communicate each other are
preserved on the same subdomain as well. The next
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
step of the algorithm consists on attach each one of the
new processes subsets to a subdomain, and
consequently minimizing the communication volume
between these two processes subsets. These stages are
repeated recursively until the disjoint processes sets has
only one process – the basic case – and the subdomain
has only one core, so the process singleton subset is
mapped to a core singleton subdomain [11]. In the end,
the pairs of processes with the highest number of
message exchanged are mapped to processors with the
lowest communication latency.
We used the DRB mapping method through Scotch,
a software that implements this algorithm. Different
from the other allocation techniques, the Scotch
generates a file containing the nodes on which each
process should be mapped. Since the nodes processors
are composed of more than one core, we made an
operating system syscall to directly place the processes
on cores.
5. Experimental Results
To generate the applications performance results,
we used the MPE library of the MPI distribution
MPICH2-1-2. This library is a feature of MPI
distributions which is used for logging some specific
events, and was modified to collect the amount of data
exchanged between each process. Thus, executing the
applications previously with MPE, we catched their
communication pattern to generate the respective
graphs A(P, B) and, finally, apply the mapping
heuristics.
After getting the affinity files, we changed the
applications of NAS-NPB-MPI with the inclusion of
the operating system syscall sched_setaffinity, included
on the C library sched.h, to attach each process to its
respective core. It is important to explain that, since
NAS applications were written in Fortran, the library
shed.h was compiled using C compiler and linked,
later, with the NAS applications, generating a single
executable file for each application.
All benchmarks were compiled with class size C,
that is an intermediary standard size of this benchmarks
set. The BT, CG, EP, FT, IS, LU, SP and MG
benchmarks were executed with 16 processes. To
evaluate the three levels of communication, interchip,
intranode, and internode, the applications with 16
processes runned on a cluster containing two nodes,
each one with two Intel Xeon E5405 quad-core
processor. Figure 3 shows the percentage of gain for
ES and the both heuristics, MWPM and DRB.
The EP and FT benchmarks present negative or
very small gain because all the processes change the
same amount of bytes, i.e. all the positions in the
communication adjacent matrix has the same value.
The CG benchmark has the best gains for both
heuristics. This is due that it has a higher difference of
the amount of communication between specific pairs of
processes.
As expected, the results show that ES has the best
gain compared with the MPICH2 process manager
default mapping, MPWM and DRB in most cases.
However, the performance improvement of these last
two heuristics are close to it. For the execution on 16
cores, the ES had an average improvement of 14.54%,
compared to 13.79% from MPWM and 14.07 from
DRB.
Table 1 presents the MWPM and DRB mapping
heuristics execution time for each application of NAS
on 16 processes compared to ES and MPICH2 default
Figure 3: Percentage of time gain the three methods running NAS-NPB-MPI with 16 processes.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
mapping. We performed multiple executions and this
table presents a confidence interval of 95%.
Even though MVPM and DRB presents equivalent
results, the first one was developed for homogeneous
architectures that share L2 cache between each pair of
cores. The DRB can be applied on generic hardware
architectures, hence it is recommended in most static
process mapping situations.
6. Related Works
Several static process mapping heuristics were
proposed for Symmetric Multiprocessing (SMP)
architectures [9][10], but since the compute nodes are
composed by single-core processors, these heuristics
do not concern about resources sharing between
multiple cores within a chip. It implies that these
techniques may not improve the application
performance on multi-core processors clusters, which is
the focus of our work.
A strategy to reduce the communication latency in
parallel applications that have static communication
pattern which is identified by a previous execution is
presented in [11]. It also uses the DRB algorithm, but
applied to a meteorology application. Once gathering
the communication pattern, it maps the process on
multi-core clusters considering the cache memory
sharing within the chip. They reached a speed-up of 9%
using this technique, nevertheless it is advantageous
only in applications which are executed repeatedly,
where the time spent in the previous execution is
compensated by the speed-up achieved on next
computations. Moreover, in our work this algorithm is
not restricted only to one application, but it is executed
in a generic way to compare the heuristics.
The work presented in [4] evaluates multi-core
architectures through a benchmark set that uses a data
tiling division algorithm to avoid data contention in
cache and main memory that occurs when several
simultaneous access in the same memory address are
made [12]. An interesting conclusion is that when we
have a message size greater than shared cache size, the
benchmarks obtains greater performance using
Infiniband communication bus than using shared
memory. Nonetheless, this algorithm usage requires the
modification of the application source code, which can
be very difficult in real programs.
The work in [5] evaluates the process affinity on
static process mapping with MPI for SMP-CMP
clusters, and the authors argue that its impact on
parallel applications performance is not very clear.
Since there is not a clear understanding of its usage, an
heuristic may be considered to define if the process
affinity is useful for a specific case. Then the execution
of NAS benchmarks is performed to analyze some
characteristics, like scalability and communication
latency, that indicates a reasonable affinity usage. In
our work, we also analyze the impact of process
affinity, but through three process mapping methods
that explore the communication latency characteristic.
These methods aim to minimize this characteristic by
placing the processes on cores that shares main
memory and L2 cache level.
The Hardware Locality (hwloc) is a software
presented in [13] that collects hardware information
related to cores, caches and memory, and provides
them to the mapping application. It was evaluated in
pure MPI and hybrid MPI/OpenMP applications which
uses the informations fetched by this software to
execute a static and dynamic process mapping. The
performance of the first one was checked by comparing
Table 1: Execution time with a confidence interval of 95% for 16 processes, where Default
is the MPICH2 process manager mapping.
ES MWPM DRB
BT 373,37 373,04 373,69 360,29 359,68 360,9 361,14 360,87 361,4 361,09 360,79 361,4
CG 184,42 184,32 184,52 109,5 109,12 109,88 110,08 109,84 110,31 110,09 109,8 110,38
EP 34,54 34,49 34,59 34,53 34,26 34,79 34,79 34,38 35,2 34,66 34,31 35,01
FT 203,02 202,04 204,01 201,42 200,49 202,34 203,48 202,43 204,52 202,6 201,96 203,24
IS 18,03 17,89 18,16 17,3 17,2 17,4 17,32 17,2 17,44 17,31 17,2 17,43
LU 239,46 239,02 239,89 216,09 215,7 216,48 216,77 216,46 217,08 217,06 216,84 217,28
MG 40,69 40,58 40,8 33,28 33,26 33,3 34 33,63 34,37 33,52 33,21 33,82
SP 521.71 520,13 523,3 491,6 490,92 492,27 491,6 490,92 492,27 492,43 491,85 493,01
Default Confid. Interv. 95% Confid. Interv. 95% Confid. Interv. 95% Confid. Interv. 95%
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
a Round-Robin process mapping with the DRB process
mapping that was made using the data collected by the
hwloc, which obtained a speed-up of 26%. The
experiments presented in our work could be extended
by using this tool to recognize the computer
architecture when it is not known beforehand. In
addition, we also use DRB method for evaluation with
other two mapping heuristics to verify if it continues to
maintain the speed-up.
7. Conclusion
In this paper we have done a performance
evaluation with two static process mapping heuristics
based on processes communication volume. The
Exhaustive Search is used as a baseline and presents
the best performance, but the search comprises all tasks
placement possibilities, which takes factorial time
complexity. So it is not viable when the number of
process increases.
The results show that both MWPM and DRB have
gain compared with the MPICH2 process manager
default mapping, in most cases. And also the
performance improvement of these two heuristics are
close to ES. For the execution on 16 cores, the ES had
an average improvement of 14.54%, compared to
13.79% from MWPM and 10.07% from DRB. On
average, the applications mapped with both heuristics
achieved equivalent performance improvement.
However, the MWPM was developed for homogeneous
architectures that share L2 cache between each pair of
cores. DRB do not have this restriction because can be
applied on generic hardware architectures. It made us
conclude that this last heuristic is to be used in most
situations.
As future works, we intend to evaluate these
heuristics taking into account the contention on the
communication caused by the parallel execution of
other processes that do not belongs to the evaluated
benchmark.
References
[1] Asanovic, R. et al, The Landscape of Parallel Computing
Researches: A View from Berkeley. Electrical Enginneering
and Computer Sciences, University of California at
Berkeley, Technical Report No. UCB/EECS2006183,
Dezembro, vol. 18, 2006.
[2] Kumar, R., Tullsen, D. M., Jouppi, N. P. and
Ranganathan, P., Heterogeneous Chip Multiprocessors,
Computer Journal, vol. 38, p.32-38, Issues 11, Nov, 2005.
[3] Bokhari, S.H. On the mapping problem. IEEE Trans.
Comput. C-30, 5, 207-214, 1981.
[4] Chai, L., Gao, Q., Panda, D. Understanding the Impact of
Multicore Architecture in Cluster Computing: A Case Study
with Intel DualCore System, In Cluster Computing and the
Grid, CCGRID, IEEE Internacional Symposium on, p. 471-
478, 2007.
[5] Zhang, C.,Yuan, X., and Srinivasan, A. Processor
Affinity and MPI Performance on SMP-CMP Clusters, the
11th IPDPS Workshop on Parallel and Distributed Scientific
and Engineering Computing (PDSEC), 2010.
[6] National Aeronautics and Space Administration (NASA)
– NAS Parallel Benchmarks (NPB3.3), Available in
http://www.nas.nasa.gov/Resources/Software/npb.html
accessed in may 2010.
[7] C. Osiakwan and S. Akl. The maximum weight perfect
matching problem for complete weighted graphs is in pc. In
Parallel and Distributed Processing, 1990. Proceedings of
the Second IEEE Symposium on, pages 880-887, 9-13 1990.
[8] Pellegrini, F., Roman, J., Experimental Analysis of the
Dual Recursive Bipartitioning Algorithm for Static Mapping.
Research Report, pages 11-38, 1996.
[9] Bhanot, G. et al., Optimizing Task Layout on the Blue
Gene/L Supercomputer, IBM Journal of Research and
DeTimesvelopment, p. 489-500, 2005.
[10] Bollinger, S., Midkiff S., Heuristic Technique for
Processor and Link Assignment in Multicomputers. Journal
IEEE Transactions on Computer, vol. 40, issued 3, March
1991.
[11] Rodrigues, E., et al. Multicore Aware Process Mapping
and its Impact on Communication Overhead of Parallel
Applications. In Computers and Communications, ISCC
2009. IEEE Symposium on, pages 811-817, 5-8 2009.
[12] Drweiga, T., Shared Memory Contention and its Impact
on Multiprocessor Call Control Throughput. Lecture Notes
in Computer Science. Performance Engineering, p. 257-266,
2001.
[13] Broquedis, F., et al. hwloc: A Generic Framework for
Managing Hardware Affinities in HPC Applications . In 18th
Euromicro Conference on Parallel, Distributed and
Networkbased Processing, pp.180-186, 2010.
[14] Righi, R. R., et al. Applying Processes Rescheduling
Over Irregular BSP Application. In Computational Science –
ICCS 2009, volume 5544 of LNCS, pp. 213-223. Springer,
2009.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
[15] Cruz, E. H. M., Alvez, M. A. Z., Navaux, P. O. A.,
Process Mapping Based on Memory Access Traces. wscad-
scc, pp.72-79, 2010 11th Symposium on Computing
Systems, 2010.
[16] Buntina, D., Mercier, G. e Gropp, W., Implementation
and Avaluation of Shared-Memory Communication and
Synchronization Operations in MPICH2 using the Nemesis
Communication Subsystem, Parallel Computing, vol. 33, p.
634-644, Issue 9, Sep. 2007.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.