Conference PaperPDF Available

Static Process Mapping Heuristics for MPI Parallel Processes in Homogeneous Multi-core Clusters

Authors:

Abstract and Figures

An important factor that must be considered to achieve high performance on parallel applications is the mapping of processes on cores. However, since this is defined as an NP-Hard problem, it requires different mapping heuristics that depends on the application and the hardware on which it will be mapped. The current architectures can have more than one multi-core processors per node, and consequently the process mapping can consider three process communication types: intrachip, intranode and internode. This work compare two static process mapping heuristics, Maximum Weighted Perfect Matching (MWPM) and Dual Recursive Bipartitioning (DRB), with the best mapping founded by Exhaustive Search (ES) in homogeneous multi-core clusters using MPI. The objective is to compare the performance improvement of MWPM with already established DRB. The analysis with 16 processes presents very close performance improvement for both heuristics, with an average of 13.79% for MWPM and 14.07% for DRB.
Content may be subject to copyright.
Static Process Mapping Heuristics for MPI Parallel Processes in
Homogeneous Multi-core Clusters*
Manuela K. Ferreira, Vicente S. Cruz, Rodrigo
Virote Kassick, Philippe O. A. Navaux
Instituto de Informática – Universidade Federal
do Rio Grande do Sul (UFRGS)
Caixa Postal 15.064 – 91.501-970 – Porto Alegre
– RS – Brazil
{mkferreira, vscruz, rodrigovirote.kassick,
navaux}@inf.ufrgs.br
Henrique Cota de Freitas
Instituto de Ciências Exatas e Informática –
Pontifícia Universidade Católica de Minas Gerais
(PUC Minas) – Av. Dom José Gaspar 500 – CEP
30.535-901– Belo Horizonte – MG – Brasil
cota@pucminas.br
Abstract
An important factor that must be considered to
achieve high performance on parallel applications is
the mapping of processes on cores. However, since
this is defined as an NP-Hard problem, it requires
different mapping heuristics that depends on the
application and the hardware on which it will be
mapped. The current architectures can have more than
one multi-core processors per node, and consequently
the process mapping can consider three process
communication types: intrachip, intranode and
internode. This work compare two static process
mapping heuristics, Maximum Weighted Perfect
Matching (MWPM) and Dual Recursive Bipartitioning
(DRB), with the best mapping founded by Exhaustive
Search (ES) in homogeneous multi-core clusters using
MPI. The objective is to compare the performance
improvement of MWPM with already established DRB.
The analysis with 16 processes presents very close
performance improvement for both heuristics, with an
average of 13.79% for MWPM and 14.07% for DRB.
1. Introduction
Multi-core processors can be found from personal
computers to high performance computers (HPC). Four
to eight core in a single chip are common today, and
the trend is that we will have more cores per chip[1]
while the Moore law can by applied for the CMPs[2].
Consequently, old methods of developing and
executing programs do not make a reasonable usage of
new architectures resources, e. g. leaving some cores
wasting energy in an idle state. Hence, new methods
are required to make an effective usage of these
resources.
When we place the processes of a parallel
application on this current multi-core environment, we
have an instance of the process mapping problem [3],
that is the act of decide in which processing unit each
parallel process will execute with the goal to archive
the best performance. It is an NP-Hard problem and
consequently do not have a generic solution, so we
must consider the software details and hardware
platform characteristics on which it will execute to find
a local solution.
The process communication have a significant
impact on performance of parallel applications and
must be taken in consideration on development of
process mapping heuristics to decrease communication
time and increase performance [4][5]. Processes within
higher communication between them must to be
mapped to the processors where the communication
have the smaller cost.
Since the last version of MPICH2, 1.3, its standard
communication method is nemesis [16] that combines
the already used socket communication with the shared
memory communications and takes into account if the
*Partially sponsored by CNPq, CAPES, FAPERGS,
FINEP and Microsoft.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
processes are sharing memory to decide which
communication to use. So, when run MPICH2
processes in a CMP cluster, faster shared memory
communication is used both when processes exchange
data within a chip intrachip communication – or
within a node intranode communication. And the
socket communications is used when a process must
send a message to one located on a different node of
the cluster – internode communication.
This paper makes an evaluation between two static
process mapping heuristics for a cluster with multi-core
processors, using MPI, and NAS Parallel Benchmark
as workload [6]. The first is the Maximum Weight
Perfect Matching (MWPM) algorithm [7], which use
the application communication pattern to make the
mapping based on a variant of the Edmonds-Karp
algorithm. The second method applies the Dual
Recursive Bipartitioning (DRB) algorithm, executing
the mapping from the application graph to the machine
architecture graph by a divide and conquer method [8].
Both heuristics are compared with the Exhaustive
Search (ES) that makes the mapping by exhaustively
combining every processes pair on cores pair that share
the same level of the memory hierarchy, finding the
best combination of pairs of processes that have the
maximum communication volume.
The objective of this evaluation is to analyze how
MWPM and DRB are close to the best mapping
method, i.e. the ES. This last has a factorial
complexity, so is not viable when the number of
process increases. Hence, by evaluating other methods
that gives a mapping solution in a shorter time, we can
have an idea about which of them can be used in most
of static mapping on homogeneous multi-core clusters.
The organization of this paper is the following. On
Section 2, we present the characteristics of static
process mapping on multi-core clusters. Section 3
explain the benchmark used to compare the different
mappings. Section 4 presents the three heuristics that
are compared on this work. The heuristics performance
results are presented and analyzed in Section 5. On
Section 6, we show some related works that uses static
process mapping. Finally, Section 7 is dedicated for
conclusion and future works.
2. Process Mapping Overview
The process mapping of parallel applications is an
NP-Hard problem [3], which means that there is not a
generic optimal solution to solve it. For that reason, we
must work on a method that consider the details of a
specific application, such as the hardware
characteristics on which the processes will be mapped,
to find a local and suitable mapping heuristic for this
specific environment and achieve a reasonable
performance. For instance, on heterogeneous platform,
processes whose tasks are specialized could be mapped
to a core where its Instruction Set Architecture (ISA) is
proper to its execution. Another example is the
communication latency. This one challenges the
process mapping algorithm because it must place the
processes on cores such that the communication
overhead is minimized. These two examples could
affect each other, because a process that was placed on
a specialized core can communicate very often with
another process which was mapped on a farther and
also specialized core.
In the current multiprocessed architectures, like
clusters, the existence of multi-core processors are
common. Thus, the process mapping for high
performance computing must consider at least three
processes communication levels. The first one is the
internode, on which processes placed on different
nodes send messages to each other through
interconnection network. Intranode is the second one,
and it happens when two processes within a node but in
different processor communicates using shared memory
or interconnection bus. Finally, the last one is the
intrachip that arose with the multi-core processors. In
this communication type, the cache memory is shared
between the processes which are attached on their
respective cores. Figure 1 shows the tree
communication types.
Figure 1: Internode, Intranode and Intrachip
Communication Types [4].
To explain the types of process placement we show,
in Figure 2, the topology of the Intel Xeon E5405
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
quad-core processor used on our experiments. If two
processes communicate intensively each other, with
small to medium message sizes, then they should be
placed on cores that share L2 cache level, such that the
message passing latency must be reduced [4].
Figure 2: Intel Xeon E5405 Quad-Core Processor
Topology.
3. Workload Description
For the purpose of the process mapping heuristics
evaluation, we used the Numerical Aerodynamic
Simulation Parallel Benchmark (NAS-NPB) workload,
version 3.3 for MPI [6]. This benchmark was selected
because of its generality and presents heterogeneous
communication patterns, so we can measure the
performance of the three heuristics in a general mode.
The Block Tridiagonal (BT), Scalar Pentadiagonal
(SP) and Lower and Upper triangular systems (LU)
applications are parallelized by the division of space
domain, with the sharing of their respective borders.
The Conjugate Gradient (CG) has a random data access
pattern which is well suited for memory performance
analysis. Multigrid (MG), on its turns, does not have a
well defined access pattern since the information are
fetched in a non-linear way in the beginning and
becomes more linear during the execution. The
Embarrassingly Parallel (EP) application is used to
measure the performance peak of computational
systems once the data processing are almost
independent, which is the inverse of Integer Sort (IS),
where exists a high data dependency. Finally, fast
Fourier Transformation (FT) merges linear data access
with data sharing.
Almost all of these applications were developed in
Fortran-90 because of their 64 bits floating-point
instructions, just the IS was implemented in C since it
operates only with integer data. As will be explained in
more details in Section 5, the binaries were generated
with a hybrid C-Fortran compilation.
4. Exhaustive Search and Static Process
Mapping Heuristics
The purpose of our work consists on evaluating two
heuristics based on the volume of processes
communication and comparing their respective
performance. The heuristics are Maximum Weighted
Perfect Matching (MWPM) and Dual Recursive
Bipartitioning (DRB). The objective of this approaches
is to minimize the communication latency of processes
which have a considerable message-exchange bulk.
For the description of both heuristics, it is important
to make some definitions. Let A(P, B) be the
application communication graph, where P is the set of
processes representing the vertexes of A, and B, the set
of edges of A on which B(px, py), with px, py P,
represents the communication bulk exchanged between
the processes px and py. In general, the three proposals
presented offer a method of placing a pair of processes
(px, py) on cores that share resources which minimize
the communication latency, that is, in this case, a level
of memory hierarchy.
It is necessary a previous execution of the
applications that will be mapped to gather the
processes communication pattern – the amount of
messages transferred between each process. This
information is stored in the set of edges B of graph A,
represented as an adjacency matrix, and it is used as an
input parameter of the both mapping heuristic.
The Exhaustive Search (ES) is used as performance
baseline. It provides the best process mapping, but it
takes an factorial time complexity and is not suited for
a considerable number of processes and cores.
Therefore, a good solution to overcome this issue is to
abdicate from the best mapping method and use
another one which finds a reasonably good mapping in
polynomial time. So, the reduction of the application
performance is amortized by the time decrement of the
allocation definition for the application processes. We
compare the MWPM and DRB heuristics with ES to
examine how close of the best solution the applications
performance obtained by their respective mapping
achieve.
All these methods are focused on process mapping
on cores which share the L2 cache memory level. As
output, they generate an affinity-file containing the
ranks of the processes and cores numbers on which
they should be mapped. The following subsections
describes ES method and the two heuristics: MWPM
and DRB.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
4.1. Exhaustive Search
This method ensures the best process placement
because it obtains the maximum communication
volume by searching exhaustively all combinations of
processes pair that can be mapped on pair of cores.
After finding the best combination, it maps each pair of
processes on a respective pair of cores which shares the
cache L2.
The algorithm receives a list of all possible
processes pairs combinations where each list element,
that is, a possible tasks pairing combination, has n
pairs. The best mapping is defined by selecting the
element of combination list where the sum of all
processes pair communication volume is maximum.
Finally, these pairs are mapped to pairs of cores which
share the L2 cache.
Although ES method provides the best process
mapping, the search comprises all tasks placement
possibilities, which takes an factorial time complexity.
Unfortunately, the number of cores that we have today
is considerably high, which makes the search time of
this method impracticable.
4.2. Maximum Weighted Perfect Matching
The MWPM defines a process placement in a
polynomial time complexity by modeling the mapping
problem as a maximum graph pairing with minimum
cost. This approach arises as a feasible solution and
works in three steps. In the first step, the algorithm
groups the processes that have a significant amount of
exchanged messages in pairs, and places each of them
in pairs of cores that shares a L2 cache level and taking
advantage of intrachip communication. The second
step allocates each pair of processes-pairs on the
processors using the intranode communication. The
third step distributes the pairs of processes-pairs on the
nodes considering the internode communication [15].
The algorithm works by choosing, from the
application graph A(P, B), the processes that should
stay closer within memory hierarchy based on the
amount of communication messages. The purpose is to
find a subset M from B such that for every process p
P there is only one edge b B on which p is attached,
and the sum of all edges b M is maximum. This
problem is solved by Edmonds-Karp algorithm in a
time complexity O() and is applied three time to find
the proper mapping with 16 processes.
The first application of the algorithm is
straightforward and discovers the pairs of processes to
be placed on pairs of cores that shares the L2 cache
level. The next application of the MWPM algorithm
defines how to map each pair of processes-pairs on the
pair of processors, and a reasonable approach to
achieve this objective is defining a sharing matrix,
where each row and column represents the processes
pairs defined by the edges of M. This matrix is used as
input of next algorithm execution, where each element
stores the data amount shared by each pair of
processes-pairs. The equation defined to generate the
matrix elements is shown in the formula below, where
(x, y) and (z, k) are the processes pairs attached to the
edges of the subset M returned by the first algorithm
execution, and M(p1, p2) is an edge of that subset
weighted with the data bulk exchanged between
processes p1 and p2.
H
(
x,y
)
,
(
z,w
)
=M
(
x,z
)
+M
(
x,k
)
+M
(
y,z
)
+M
(
y,k
)
The third application of the algorithm is a repetition
of step two, but use the communication matrix
generated on step two to generate a matrix that
represent the amount of communication between a pair
of processes-pairs.
4.3. Dual Recursive Bipartitioning
The approach of DRB process mapping is based on
the divide and conquer method to generate graphs
mapping [8]. Along with A(P, B), let T(C, L) be the
hardware platform target graph, where the vertices set
C represents the processing units the cores of the
hardware, and L, the set of edges of T, where L(cx, cy) is
defined as the latency of the communication channel
between the cores cx and cy. Consequently to generate
the graph T it is necessary to measure the
communication latency between each core.
To create the mapping, the algorithm begins with
the processes set P of the application graph A, and a
domain structure comprising the whole set of cores,
that is, the vertices set C of the graph T. Then, it
applies the functions domain bipartitioning and process
bipartitioning on the domain structure and processes
set, respectively, dividing the first one in two cores
subdomains, and the second one in two disjoint
processes subset. These functions are based on graph
partitioning algorithms, and assures that processes
which send a reasonable number of messages among
them are kept in the same subset, and processors which
have low latency to communicate each other are
preserved on the same subdomain as well. The next
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
step of the algorithm consists on attach each one of the
new processes subsets to a subdomain, and
consequently minimizing the communication volume
between these two processes subsets. These stages are
repeated recursively until the disjoint processes sets has
only one processthe basic case – and the subdomain
has only one core, so the process singleton subset is
mapped to a core singleton subdomain [11]. In the end,
the pairs of processes with the highest number of
message exchanged are mapped to processors with the
lowest communication latency.
We used the DRB mapping method through Scotch,
a software that implements this algorithm. Different
from the other allocation techniques, the Scotch
generates a file containing the nodes on which each
process should be mapped. Since the nodes processors
are composed of more than one core, we made an
operating system syscall to directly place the processes
on cores.
5. Experimental Results
To generate the applications performance results,
we used the MPE library of the MPI distribution
MPICH2-1-2. This library is a feature of MPI
distributions which is used for logging some specific
events, and was modified to collect the amount of data
exchanged between each process. Thus, executing the
applications previously with MPE, we catched their
communication pattern to generate the respective
graphs A(P, B) and, finally, apply the mapping
heuristics.
After getting the affinity files, we changed the
applications of NAS-NPB-MPI with the inclusion of
the operating system syscall sched_setaffinity, included
on the C library sched.h, to attach each process to its
respective core. It is important to explain that, since
NAS applications were written in Fortran, the library
shed.h was compiled using C compiler and linked,
later, with the NAS applications, generating a single
executable file for each application.
All benchmarks were compiled with class size C,
that is an intermediary standard size of this benchmarks
set. The BT, CG, EP, FT, IS, LU, SP and MG
benchmarks were executed with 16 processes. To
evaluate the three levels of communication, interchip,
intranode, and internode, the applications with 16
processes runned on a cluster containing two nodes,
each one with two Intel Xeon E5405 quad-core
processor. Figure 3 shows the percentage of gain for
ES and the both heuristics, MWPM and DRB.
The EP and FT benchmarks present negative or
very small gain because all the processes change the
same amount of bytes, i.e. all the positions in the
communication adjacent matrix has the same value.
The CG benchmark has the best gains for both
heuristics. This is due that it has a higher difference of
the amount of communication between specific pairs of
processes.
As expected, the results show that ES has the best
gain compared with the MPICH2 process manager
default mapping, MPWM and DRB in most cases.
However, the performance improvement of these last
two heuristics are close to it. For the execution on 16
cores, the ES had an average improvement of 14.54%,
compared to 13.79% from MPWM and 14.07 from
DRB.
Table 1 presents the MWPM and DRB mapping
heuristics execution time for each application of NAS
on 16 processes compared to ES and MPICH2 default
Figure 3: Percentage of time gain the three methods running NAS-NPB-MPI with 16 processes.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
mapping. We performed multiple executions and this
table presents a confidence interval of 95%.
Even though MVPM and DRB presents equivalent
results, the first one was developed for homogeneous
architectures that share L2 cache between each pair of
cores. The DRB can be applied on generic hardware
architectures, hence it is recommended in most static
process mapping situations.
6. Related Works
Several static process mapping heuristics were
proposed for Symmetric Multiprocessing (SMP)
architectures [9][10], but since the compute nodes are
composed by single-core processors, these heuristics
do not concern about resources sharing between
multiple cores within a chip. It implies that these
techniques may not improve the application
performance on multi-core processors clusters, which is
the focus of our work.
A strategy to reduce the communication latency in
parallel applications that have static communication
pattern which is identified by a previous execution is
presented in [11]. It also uses the DRB algorithm, but
applied to a meteorology application. Once gathering
the communication pattern, it maps the process on
multi-core clusters considering the cache memory
sharing within the chip. They reached a speed-up of 9%
using this technique, nevertheless it is advantageous
only in applications which are executed repeatedly,
where the time spent in the previous execution is
compensated by the speed-up achieved on next
computations. Moreover, in our work this algorithm is
not restricted only to one application, but it is executed
in a generic way to compare the heuristics.
The work presented in [4] evaluates multi-core
architectures through a benchmark set that uses a data
tiling division algorithm to avoid data contention in
cache and main memory that occurs when several
simultaneous access in the same memory address are
made [12]. An interesting conclusion is that when we
have a message size greater than shared cache size, the
benchmarks obtains greater performance using
Infiniband communication bus than using shared
memory. Nonetheless, this algorithm usage requires the
modification of the application source code, which can
be very difficult in real programs.
The work in [5] evaluates the process affinity on
static process mapping with MPI for SMP-CMP
clusters, and the authors argue that its impact on
parallel applications performance is not very clear.
Since there is not a clear understanding of its usage, an
heuristic may be considered to define if the process
affinity is useful for a specific case. Then the execution
of NAS benchmarks is performed to analyze some
characteristics, like scalability and communication
latency, that indicates a reasonable affinity usage. In
our work, we also analyze the impact of process
affinity, but through three process mapping methods
that explore the communication latency characteristic.
These methods aim to minimize this characteristic by
placing the processes on cores that shares main
memory and L2 cache level.
The Hardware Locality (hwloc) is a software
presented in [13] that collects hardware information
related to cores, caches and memory, and provides
them to the mapping application. It was evaluated in
pure MPI and hybrid MPI/OpenMP applications which
uses the informations fetched by this software to
execute a static and dynamic process mapping. The
performance of the first one was checked by comparing
Table 1: Execution time with a confidence interval of 95% for 16 processes, where Default
is the MPICH2 process manager mapping.
ES MWPM DRB
BT 373,37 373,04 373,69 360,29 359,68 360,9 361,14 360,87 361,4 361,09 360,79 361,4
CG 184,42 184,32 184,52 109,5 109,12 109,88 110,08 109,84 110,31 110,09 109,8 110,38
EP 34,54 34,49 34,59 34,53 34,26 34,79 34,79 34,38 35,2 34,66 34,31 35,01
FT 203,02 202,04 204,01 201,42 200,49 202,34 203,48 202,43 204,52 202,6 201,96 203,24
IS 18,03 17,89 18,16 17,3 17,2 17,4 17,32 17,2 17,44 17,31 17,2 17,43
LU 239,46 239,02 239,89 216,09 215,7 216,48 216,77 216,46 217,08 217,06 216,84 217,28
MG 40,69 40,58 40,8 33,28 33,26 33,3 34 33,63 34,37 33,52 33,21 33,82
SP 521.71 520,13 523,3 491,6 490,92 492,27 491,6 490,92 492,27 492,43 491,85 493,01
Default Confid. Interv. 95% Confid. Interv. 95% Confid. Interv. 95% Confid. Interv. 95%
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
a Round-Robin process mapping with the DRB process
mapping that was made using the data collected by the
hwloc, which obtained a speed-up of 26%. The
experiments presented in our work could be extended
by using this tool to recognize the computer
architecture when it is not known beforehand. In
addition, we also use DRB method for evaluation with
other two mapping heuristics to verify if it continues to
maintain the speed-up.
7. Conclusion
In this paper we have done a performance
evaluation with two static process mapping heuristics
based on processes communication volume. The
Exhaustive Search is used as a baseline and presents
the best performance, but the search comprises all tasks
placement possibilities, which takes factorial time
complexity. So it is not viable when the number of
process increases.
The results show that both MWPM and DRB have
gain compared with the MPICH2 process manager
default mapping, in most cases. And also the
performance improvement of these two heuristics are
close to ES. For the execution on 16 cores, the ES had
an average improvement of 14.54%, compared to
13.79% from MWPM and 10.07% from DRB. On
average, the applications mapped with both heuristics
achieved equivalent performance improvement.
However, the MWPM was developed for homogeneous
architectures that share L2 cache between each pair of
cores. DRB do not have this restriction because can be
applied on generic hardware architectures. It made us
conclude that this last heuristic is to be used in most
situations.
As future works, we intend to evaluate these
heuristics taking into account the contention on the
communication caused by the parallel execution of
other processes that do not belongs to the evaluated
benchmark.
References
[1] Asanovic, R. et al, The Landscape of Parallel Computing
Researches: A View from Berkeley. Electrical Enginneering
and Computer Sciences, University of California at
Berkeley, Technical Report No. UCB/EECS2006183,
Dezembro, vol. 18, 2006.
[2] Kumar, R., Tullsen, D. M., Jouppi, N. P. and
Ranganathan, P., Heterogeneous Chip Multiprocessors,
Computer Journal, vol. 38, p.32-38, Issues 11, Nov, 2005.
[3] Bokhari, S.H. On the mapping problem. IEEE Trans.
Comput. C-30, 5, 207-214, 1981.
[4] Chai, L., Gao, Q., Panda, D. Understanding the Impact of
Multicore Architecture in Cluster Computing: A Case Study
with Intel DualCore System, In Cluster Computing and the
Grid, CCGRID, IEEE Internacional Symposium on, p. 471-
478, 2007.
[5] Zhang, C.,Yuan, X., and Srinivasan, A. Processor
Affinity and MPI Performance on SMP-CMP Clusters, the
11th IPDPS Workshop on Parallel and Distributed Scientific
and Engineering Computing (PDSEC), 2010.
[6] National Aeronautics and Space Administration (NASA)
NAS Parallel Benchmarks (NPB3.3), Available in
http://www.nas.nasa.gov/Resources/Software/npb.html
accessed in may 2010.
[7] C. Osiakwan and S. Akl. The maximum weight perfect
matching problem for complete weighted graphs is in pc. In
Parallel and Distributed Processing, 1990. Proceedings of
the Second IEEE Symposium on, pages 880-887, 9-13 1990.
[8] Pellegrini, F., Roman, J., Experimental Analysis of the
Dual Recursive Bipartitioning Algorithm for Static Mapping.
Research Report, pages 11-38, 1996.
[9] Bhanot, G. et al., Optimizing Task Layout on the Blue
Gene/L Supercomputer, IBM Journal of Research and
DeTimesvelopment, p. 489-500, 2005.
[10] Bollinger, S., Midkiff S., Heuristic Technique for
Processor and Link Assignment in Multicomputers. Journal
IEEE Transactions on Computer, vol. 40, issued 3, March
1991.
[11] Rodrigues, E., et al. Multicore Aware Process Mapping
and its Impact on Communication Overhead of Parallel
Applications. In Computers and Communications, ISCC
2009. IEEE Symposium on, pages 811-817, 5-8 2009.
[12] Drweiga, T., Shared Memory Contention and its Impact
on Multiprocessor Call Control Throughput. Lecture Notes
in Computer Science. Performance Engineering, p. 257-266,
2001.
[13] Broquedis, F., et al. hwloc: A Generic Framework for
Managing Hardware Affinities in HPC Applications . In 18th
Euromicro Conference on Parallel, Distributed and
Networkbased Processing, pp.180-186, 2010.
[14] Righi, R. R., et al. Applying Processes Rescheduling
Over Irregular BSP Application. In Computational Science –
ICCS 2009, volume 5544 of LNCS, pp. 213-223. Springer,
2009.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
[15] Cruz, E. H. M., Alvez, M. A. Z., Navaux, P. O. A.,
Process Mapping Based on Memory Access Traces. wscad-
scc, pp.72-79, 2010 11th Symposium on Computing
Systems, 2010.
[16] Buntina, D., Mercier, G. e Gropp, W., Implementation
and Avaluation of Shared-Memory Communication and
Synchronization Operations in MPICH2 using the Nemesis
Communication Subsystem, Parallel Computing, vol. 33, p.
634-644, Issue 9, Sep. 2007.
In: Latin American Conference on High Performance Computing (CLCAR), Colima, 2011. 8p.
Article
High Performance Computing (HPC) applications have demanding need for hardware resources such as processor, memory, and storage. Applications in the area of Artificial Intelligence and Machine Learning are taking center stage in HPC, which is driving demand for increasing compute resources per node which in turn is pushing bandwidth requirement between the compute nodes. New system design paradigms exist where deploying a system with more than one high performance IO device per node provides benefits. The number of I/O devices connected to the HPC node can be increased with PCIe switches and hence some of the HPC nodes are designed to include PCIe switches to provide a large number of PCIe slots. With multiple IO devices per node, application programmers are forced to consider HPC process affinity to not only compute resources but extend this to include IO devices. Mapping of process to processor cores and the closest IO device(s) increases complexity due to three way mapping and varying HPC node architectures. While operating systems perform reasonable mapping of process to processor core(s), they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is attached to the compute node. This paper is an extended version of our work published in [1] . Our previous work provided solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. In this paper, we extend the affinity solution to enable OpenFabric Interfaces (OFI) which is a generic HPC API designed as part of the OpenFabrics Alliance that enables wider HPC programming models and applications supported by various HPC fabric vendors. We present a solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. MPI continues to be the dominant programming model for HPC and hence we provide evaluation with MPI based micro benchmarks. Our solution is then extended to OpenFabric Interfaces which supports other HPC programming models such as SHMEM, GASNet, and UPC. We propose a solution to solve NUMA issues at the lower level of the software stack that forms the runtime for MPI and other programming models independent of HPC applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by applications by affinitizing processes with best possible network device is evident from the results where we notice up to 40 percent improvement in uni-directional bandwidth, 48 percent bi-directional bandwidth, 32 percent improvement in latency measurements, and up to 40 percent improvement in message rate with OSU benchmark suite. We also extend our evaluation to include OFI operations and an MPI benchmark used for Genome assembly. With OFI Remote Memory Access (RMA) operations we see a bandwidth improvement of 32 percent for fi_read and 22 percent with fi_write operations, and also latency improvement of 15 percent for fi_read and 14 percent for fi_write. K-mer MMatching Interface HASH benchmark shows an improvement of up to 25 percent while using local network device versus using a network device connected to remote Xeon socket.
Article
Interrupt affinitization of multi-queue network interface cards is a fundamental composition that defines how packets from individual queue are processed by which CPU-cores on multi-core platforms. In this paper, we propose qcAffin to attain an optimal queue-to-core affinitization for packet processing systems based on a numerical cost model derived from hardware topology and runtime system workloads. Static architectural characteristics comprising the memory hierarchy and topology of hardware components are first analyzed to calculate static interrupt affinitization costs. Then we attempt dynamic interrupt affinitization to balance workloads on CPU-cores and improve overall performance. Classical networking applications ranging from bridging, routing, access control list (ACL) matching to deep packet inspection (DPI) with different frame sizes are extensively experimented to compare the performance of the proposed scheme and other existing approaches. As demonstrated in the comparison result, qcAffin achieves the similar performance of the best affinitization approach and outperforms the Linux default affinitizer by averages of 102, 278, 248 and 131 percent on 1G NICs for the four applications. On 10G NICs, dramatic boosts of 1,424 and 1,343 percent are measured for the bridging and routing applications, respectively. Moreover, the effectiveness of dynamic interrupt balancing is justified by a maximum of 150 percent higher system utilization and 1.2 Mpps more throughput compared to the fixed affinitization approach in a simulated setup of unbalanced traffic load.
Technical Report
Full-text available
The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. Instead of traditional benchmarks, use 13 "Dwarfs" to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) "Autotuners" should play a larger role than conventional compilers in translating parallel programs. To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. To be successful, programming models should be independent of the number of processors. To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
Article
Full-text available
Process mapping is a technique widely used in parallel machines to provide performance gains by improving the use of resources such as interconnections and cache memory hierarchy. The problem to find the best mapping is considered NP-Hard and, in shared memory environments, there is the additional difficulty to find the communication pattern, which is implicit and occurs through memory accesses. In this context, this work aims to improve the performance of parallel applications that use shared memory. For that, it was developed a method for analysis of the shared memory which identifies the mapping without requiring any previous knowledge of the application behavior. Applications from the NAS Parallel Benchmarks (NPB) were used in these experiments, showing performance gains of up to 42% compared to the native scheduler of the operating system
Conference Paper
Full-text available
Clusters of Symmetric MultiProcessing (SMP) nodes with multi-core Chip-Multiprocessors (CMP), also known as SMP-CMP clusters, are becoming ubiquitous today. For Message Passing interface (MPI) programs, such clusters have a multi-layer hierarchical communication structure: the performance of intra-node communication is usually higher than that of inter-node communication; and the performance of intra-node communication is not uniform with communications between cores within a chip offering higher performance than communications between cores in different chips. As a result, the mapping from Message Passing Interface (MPI) processes to cores within each compute node, that is, processor affinity, may significantly affect the performance of intra-node communication, which in turn may impact the overall performance of MPI applications. In this work, we study the impacts of processor affinity on MPI performance in SMP-CMP clusters through extensive benchmarking and identify the conditions when processor affinity is (or is not) a major factor that affects performance.
Article
A general method for optimizing problem layout on the Blue Gene®/L (BG/L) supercomputer is described. The method takes as input the communication matrix of an arbitrary problem as an array with entries C(i, j), which represents the data communicated from domain i to domain j. Given C(i, j), we implement a heuristic map that attempts to sequentially map a domain and its communication neighbors either to the same BG/L node or to near-neighbor nodes on the BG/L torus, while keeping the number of domains mapped to a BG/L node constant. We then generate a Markov chain of maps using Monte Carlo simulation with free energy F =Σi,j C(i, j)H(i, j), where H(i, j) is the smallest number of hops on the BG/L torus between domain i and domain j. For two large parallel applications, SAGE and UMT2000, the method was tested against the default Message Passing Interface rank order layout on up to 2,048 BG/L nodes. It produced maps that improved communication efficiency by up to 45%.
Article
This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.
Conference Paper
We propose an approach to reduce the execution time of applications with a steady communication pattern on clusters of multi-core processors by leveraging the asymmetry of core communication speeds. In addition to the well known fact that communication link speeds on a fixed cluster vary with processor selection, we consider one effect of multicore processor chips: link speeds vary with core selection within a single processor chip. The approach requires measuring link speeds among cluster cores as well as communication volumes and computational loads of the selected application processes. This data is fed into the dual recursive bipartitioning method to obtain close to optimal application process placement on cluster cores. We apply this approach to a real world application achieving sensible execution time reduction without even recompiling source code.
Conference Paper
A Shared-Memory Multi-Processor Call Control (SMMPCC) is described when the executing transactions contend for access to shared memory addresses. First, memory contention is characterized, then the system throughput is modelled analytically. Based on the model, bounds for the throughput are found. Also shown are how the throughput depends on the number of processors, contention level and on how far into transaction execution a contention on memory access happens.
Conference Paper
Multi-core processors are growing as a new industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. In the new Top500 supercomputer list, more than 20% processors belong to the multi-core processor family. However, without an in- depth study on application behaviors and trends on multi- core clusters, we might not be able to understand the char- acteristics of multi-core cluster in a comprehensive manner and hence not be able to get optimal performance. In this paper, we take on these challenges and design a set of ex- periments to study the impact of multi-core architecture on cluster computing. We choose to use one of the most ad- vanced multi-core servers, Intel Bensley system with Wood- crest processors, as our evaluation platform, and use bench- marks including HPL, NAMD, and NAS as the applications to study. From our message distribution experiments, we nd that on an average about 50% messages are transferred through intra-node communication, which is much higher than intuition. This trend indicates that optimizing intra- node communication is as important as optimizing inter- node communication in a multi-core cluster. We also ob- serve that cache and memory contention may be a potential bottleneck in multi-core clusters, and communication mid- dleware and applications should be multi-core aware to al- leviate this problem. We demonstrate that multi-core aware algorithm, e.g. data tiling, improves benchmark execution time by up to 70%. We also compare the scalability of a multi-core cluster with that of a single-core cluster and nd that the scalability of the multi-core cluster is promising.
Conference Paper
This paper shows an evaluation of processes rescheduling over an irregular BSP (Bulk Synchronous Parallel) application. Such application is based on dynamic programming and its irregularity is presented through the variation of computation density along the matrix’ cells. We are using MigBSP model for processes rescheduling, which combines multiple metrics - Computation, Communication and Memory - to decide about processes migration. The main contribution of this paper includes the viability to use processes migration on irregular BSP applications. Instead to adjust the load of each process by hand, we presented that automatic processes rebalancing is an effortless technique to obtain performance. The results showed gains greater than 10% over our multi-cluster architecture. Moreover, an acceptable overhead from MigBSP was observed when no migrations happen during application execution.