ArticlePDF Available

Parmetis: Parallel graph partitioning and sparse matrix ordering library

Authors:

Figures

Content may be subject to copyright.
PARMETIS
Parallel Graph Partitioning and Sparse Matrix Ordering
Library
Version 4.0
George Karypis and Kirk Schloegel
University of Minnesota, Department of Computer Science and Engineering
Minneapolis, MN 55455
karypis@cs.umn.edu
March 30, 2013
PARMETISis copyrighted by the regents of the University of Minnesota.
1
Contents
1 Introduction 3
2 Changes Across Key Releases 3
2.1 Changes between 4.0 and 3.2 ....................................... 3
2.2 Changes between 3.2 and 3.1 ....................................... 3
2.3 Changes between 3.0/3.1 and 2.0 ..................................... 4
3 Algorithms Used in PARMETIS5
3.1 Unstructured Graph Partitioning ..................................... 5
3.2 Partitioning Meshes Directly ....................................... 6
3.3 Partitioning Adaptively Reﬁned Meshes ................................. 7
3.4 Partition Reﬁnement ........................................... 8
3.5 Partitioning for Multi-phase and Multi-physics Computations ...................... 8
3.6 Partitioning for Heterogeneous Computing Architectures ........................ 9
3.7 Computing Fill-Reducing Orderings ................................... 10
4PARMETIS’ API 10
4.1 Header ﬁles ................................................ 10
4.2 Input and Output Formats used by PAR METIS.............................. 10
4.2.1 Format of the Input Graph .................................... 10
4.2.2 Format of Vertex Coordinates .................................. 12
4.2.3 Format of the Input Mesh .................................... 12
4.2.4 Format of the Computed Partitionings and Orderings ...................... 13
4.3 Numbering and Memory Allocation ................................... 14
5 Calling Sequence of the Routines in PARMETIS15
5.1 Graph Partitioning ............................................ 16
ParMETIS V3 PartKway ...................................... 16
ParMETIS V3 PartGeomKway .................................. 18
ParMETIS V3 PartGeom ...................................... 20
ParMETIS V3 PartMeshKway ................................... 21
5.2 Graph Repartitioning ........................................... 23
ParMETIS V3 AdaptiveRepart ................................... 23
5.3 Partitioning Reﬁnement .......................................... 25
ParMETIS V3 ReﬁneKway .................................... 25
5.4 Fill-reducing Orderings .......................................... 27
ParMETIS V3 NodeND ...................................... 27
ParMETIS V32 NodeND ...................................... 28
5.5 Mesh to Graph Translation ........................................ 30
ParMETIS V3 Mesh2Dual ..................................... 30
6 Restrictions & Limitations 31
7 Hardware & Software Requirements, and Contact Information 31
2
1 Introduction
PARMETISis an MPI-based parallel library that implements a variety of algorithms for partitioning and repartitioning
unstructured graphs and for computing ﬁll-reducing orderings of sparse matrices. PARMETISis particularly suited for
parallel numerical simulations involving large unstructured meshes. In this type of computation, PARMETISdramati-
cally reduces the time spent in communication by computing mesh decompositions such that the numbers of interface
elements are minimized.
The algorithms in PARMETISare based on the multilevel partitioning and ﬁll-reducing ordering algorithms that are
implemented in the widely-used serial package METIS[5]. However, PAR METISextends the functionality provided by
METISand includes routines that are especially suited for parallel computations and large-scale numerical simulations.
In particular, PARMETISprovides the following functionality:
Partition unstructured graphs and meshes.
Repartition graphs that correspond to adaptively reﬁned meshes.
Partition graphs for multi-phase and multi-physics simulations.
Improve the quality of existing partitionings.
Compute ﬁll-reducing orderings for sparse direct factorization.
Construct the dual graphs of meshes
The rest of this manual is organized as follows. Section 2brieﬂy describes the differences between major versions
of PARMETIS. Section 3describes the various algorithms that are implemented in PARMETIS. Section 4.2 describes
the format of the basic parameters that need to be supplied to the routines. Section 5provides a detailed description
of the calling sequences for the major routines in PARMETIS. Finally, Section 7describes software and hardware
requirements and provides contact information.
2 Changes Across Key Releases
2.1 Changes between 4.0 and 3.2
The 4.0 release of PARMETISrepresents a major code refactoring to allow full support of 64 bit architectures. As part
of that re-factoring, no additional capabilities have been added to the library. However, since the 4.0 release relies on
the latest version of METIS, it allows for better support of multi-constraint partitioning. Here is the list of the major
changes in version 4.0:
Support for 64 bit architectures by explicitly deﬁning the width of the scalar “integer” data type (idx t) used
to store the adjancency structure of the graph.
It is based on the 5.0 distribution of METIS, which itself contains many enhancements over the previous version.
A complete re-write of its internal memory management, which resulted in lower memory requirements.
Better quality partitionings for multi-constraint partitioning problems.
2.2 Changes between 3.2 and 3.1
The major change in version 3.2 is its better support for computing ﬁll-reducing orderings of sparse matrices. Speciﬁ-
cally, version 3.2 contains the following enhancements/additions:
A new parallel separator reﬁnement algorithm that leads to smaller separators and less ﬁll-in.
Parallel orderings can now be computed on non power-of-two processors.
3
It provides support for computing multiple separators at each level (both during the parallel and the serial
phases). The smallest separator among these multiple runs is selected.
There is a new API routine, ParMETIS V32 NodeND that exposes additional parameters to the user in order
to better control various aspects of the algorithm. The old API routine (ParMETIS V3 NodeND) is still valid
and is mapped to the new ordering routine.
The end results of these enhancements is that the quality of the orderings computed by PARMETISare now compa-
rable to those computed by METIS’ nested dissection routines. In addition, version 3.2 contains a number of bug-ﬁxes
and documentation corrections. Note that changes in the documentation are marked using change-bars.
2.3 Changes between 3.0/3.1 and 2.0
Version 3.x contains a number of changes over the previous major release (version 2.x). These changes include the
following:
Version 1.0 Version 2.0 Version 3.0
PARKMETIS ParMETIS PartKway ParMETIS V3 PartKway
PARGKMETIS ParMETIS PartGeomKway ParMETIS V3 PartGeomKway
PARGMETIS ParMETIS PartGeom ParMETIS V3 PartGeom
PARGRMETIS Not available Not available
PARRMETIS ParMETIS ReﬁneKway ParMETIS V3 ReﬁneKway
PARUAMETIS ParMETIS RepartLDiffusion
PARDAMETIS ParMETIS RepartGDiffusion
Not available ParMETIS RepartRemap ParMETIS V3 AdaptiveRepart
Not available ParMETIS RepartMLRemap
PAROMETIS ParMETIS NodeND ParMETIS V3 NodeND
Not available Not available ParMETIS V3 PartMeshKway
Not available Not available ParMETIS V3 Mesh2Dual
Table 1:The relationships between the names of the routines in the different versions of PARMETIS.
The names and calling sequence of all the routines have changed due to expanded functionality that has been
provided in this release. Table 1shows how the names of the various routines map from version to version. Note
that Version 3.0 is fully backwards compatible with all previous versions of PAR METIS. That is, the old API
calls have been mapped to the new routines. However, the expanded functionality provided with this release is
only available by using the new calling sequences.
The four adaptive repartitioning routines: ParMETIS RepartLDiffusion,ParMETIS RepartGDiffusion,
ParMETIS RepartRemap, and ParMETIS RepartMLRemap have been replaced by a (single) implementa-
tion of a uniﬁed repartitioning algorithm [15], ParMETIS V3 AdaptiveRepart, that combines the best features
of the previous routines.
Multiple vertex weights/balance constraints are supported for most of the routines. This allows PARMETISto be
used to partition graphs for multi-phase and multi-physics simulations.
In order to optimize partitionings for speciﬁc heterogeneous computing architectures, it is now possible to
specify the target sub-domain weights for each of the sub-domains and for each balance constraint. This feature,
for example, allows the user to compute a partitioning in which one of the sub-domains is twice the size of all
of the others.
The number of sub-domains has been de-coupled from the number of processors in both the static and the
adaptive partitioning schemes. Hence, it is now possible to use the parallel partitioning and repartitioning
4
algorithms to compute a k-way partitioning independent of the number of processors that are used. Note that
Version 2.0 provided this functionality for the static partitioning schemes only.
Routines are provided for both directly partitioning a ﬁnite element mesh, and for constructing the dual graph
of a mesh in parallel. In version 3.1 these routines have been extended to support mixed element meshes.
3 Algorithms Used in PARMETIS
PARMETISprovides a variety of routines that can be used to compute different types of partitionings and repartitionings
as well as ﬁll-reducing orderings. Figure 1provides an overview of the functionality provided by PARMETISas well
as a guide to its use.
YES
YES or NO
High quality
Low quality
ParMETIS_V3_RefineKway
ParMETIS_V3_Mesh2Dual
ParMETIS_V3_PartMeshKway
ParMETIS_V3_PartKway
ParMETIS_V3_PartGeomKway
ParMETIS_V3_PartGeom
ParMETIS_V32_NodeND
ParMETIS_V3_NodeND
ParMetis Can Do The Following
Partition a graph
Partition a mesh
Refine the quality
of a partitioning
Compute a fill−reducing
ordering
for the vertices?
Do you have coordinates
What are your
Repartition a graph corresponding
to an adaptively refined mesh
Construct a graph from a mesh
Figure 1:A brief overview of the functionality provided by PARMETIS. The shaded boxes correspond to the actual routines in
PARMETIS that implement each particular operation.
3.1 Unstructured Graph Partitioning
ParMETIS V3 PartKway is the routine in PARMETISthat is used to partition unstructured graphs. This routine takes
a graph and computes a k-way partitioning (where kis equal to the number of sub-domains desired) while attempting
to minimize the number of edges that are cut by the partitioning (i.e., the edge-cut). ParMETIS V3 PartKway makes
no assumptions on how the graph is initially distributed among the processors. It can effectively partition a graph that
is randomly distributed as well as a graph that is well distributed1. If the graph is initially well distributed among the
processors, ParMETIS V3 PartKway will take less time to run. However, the quality of the computed partitionings
1The reader should note the difference between the terms graph distribution and graph partition. A partitioning is a mapping of the vertices to
the processors that results in a distribution. In other words, a partitioning speciﬁes a distribution. In order to partition a graph in parallel, an initial
distribution of the nodes and edges of the graph among the processors is required. For example, consider a graph that corresponds to the dual of a
ﬁnite-element mesh. This graph could initially be partitioned simply by mapping groups of n/p consecutively numbered elements to each processor
where nis the number of elements and pis the number of processors. Of course, this naive approach is not likely to result in a very good distribution
because elements that belong to a number of different regions of the mesh may get mapped to the same processor. (That is, each processor may get
a number of small sub-domains as opposed to a single contiguous sub-domain). Hence, you would want to compute a new high-quality partitioning
for the graph and then redistribute the mesh accordingly. Note that it may also be the case that the initial graph is well distributed, as when meshes
are adaptively reﬁned and repartitioned.
5
G
G
3
O
G4
G
2
1
G
3
G
G
O
1
G
2
G
Coarsening Phase
Uncoarsening Phase
Initial Partitioning Phase
Multilevel K-way Partitioning
Figure 2:The three phases of multilevel k-way graph partitioning. During the coarsening phase, the size of the graph is successively decreased. During the
initial partitioning phase, a k-way partitioning is computed, During the multilevel reﬁnement (or uncoarsening) phase, the partitioning is successively reﬁned as it is
projected to the larger graphs. G0is the input graph, which is the ﬁnest graph. Gi+1 is the next level coarser graph of Gi.G4is the coarsest graph.
does not depend on the initial distribution.
The parallel graph partitioning algorithm used in ParMETIS V3 PartKway is based on the serial multilevel k-
way partitioning algorithm described in [6,7] and parallelized in [4,14]. This algorithm has been shown to quickly
produce partitionings that are of very high quality. It consists of three phases: graph coarsening, initial partitioning,
and uncoarsening/reﬁnement. In the graph coarsening phase, a series of graphs is constructed by collapsing together
adjacent vertices of the input graph in order to form a related coarser graph. Computation of the initial partitioning
is performed on the coarsest (and hence smallest) of these graphs, and so is very fast. Finally, partition reﬁnement is
performed on each level graph, from the coarsest to the ﬁnest (i.e., original graph) using a KL/FM-type reﬁnement
algorithm [2,9]. Figure 2illustrates the multilevel graph partitioning paradigm.
PARMETISprovides the ParMETIS V3 PartGeomKway routine for computing partitionings for graphs derived
from ﬁnite element meshes in which the vertices have coordinates associated with them. Given a graph that is dis-
tributed among the processors and the coordinates of the vertices ParMETIS V3 PartGeomKway quickly computes
an initial partitioning using a space-ﬁlling curve method, redistributes the graph according to this partitioning, and
then calls ParMETIS V3 PartKway to compute the ﬁnal high-quality partitioning. Our experiments have shown that
ParMETIS V3 PartGeomKway is often two times faster than ParMETIS V3 PartKway, and achieves identical par-
tition quality. Note that depending on how the graph is constructed from the underlying mesh, the coordinates can
correspond to either the actual node coordinates of the mesh (nodal graphs) or the coordinates of the coordinates of
the element centers (dual graphs).
PARMETISalso provides the ParMETIS V3 PartGeom function for partitioning unstructured graphs when coordi-
nates for the vertices are available. ParMETIS V3 PartGeom computes a partitioning based only on the space-ﬁlling
curve method. Therefore, it is extremely fast (often 5 to 10 times faster than ParMETIS V3 PartGeomKway), but it
computes poor quality partitionings (it may cut 2 to 10 times more edges than ParMETIS V3 PartGeomKway). This
routine can be useful for certain computations in which the use of space-ﬁlling curves is the appropriate partitioning
technique (e.g., n-body computations).
3.2 Partitioning Meshes Directly
PARMETISalso provides routines that support the computation of partitionings and repartitionings given meshes (and
not graphs) as inputs. In particular, ParMETIS V3 PartMeshKway take a mesh as input and computes a partitioning
6
of the mesh elements. Internally, ParMETIS V3 PartMeshKway uses a mesh-to-graph routine and then calls the
same core partitioning routine that is used by ParMETIS V3 PartKway.
PARMETISprovides no such routines for computing adaptive repartitionings directly from meshes. However, it
does provide the routine ParMETIS V3 Mesh2Dual for constructing a dual graph given a mesh, quickly and in
parallel. Since the construction of the dual graph is in parallel, it can be used to construct the input graph for
3.3 Partitioning Adaptively Reﬁned Meshes
For large-scale scientiﬁc simulations, the computational requirements of techniques relying on globally reﬁned meshes
become very high, especially as the complexity and size of the problems increase. By locally reﬁning and de-reﬁning
the mesh either to capture ﬂow-ﬁeld phenomena of interest [1] or to account for variations in errors [11], adaptive
methods make standard computational methods more cost effective. The efﬁcient execution of such adaptive scientiﬁc
simulations on parallel computers requires a periodic repartitioning of the underlying computational mesh. These
repartitionings should minimize both the inter-processor communications incurred in the iterative mesh-based compu-
tation and the data redistribution costs required to balance the load. Hence, adaptive repartitioning is a multi-objective
optimization problem. PARMETISprovides the routine ParMETIS V3 AdaptiveRepart for repartitioning such adap-
tively reﬁned meshes. This routine assumes that the mesh is well distributed among the processors, but that (due to
mesh reﬁnement and de-reﬁnement) this distribution is poorly load balanced.
Repartitioning algorithms fall into two general categories. The ﬁrst category balances the computation by incre-
mentally diffusing load from those sub-domains that have more work to adjacent sub-domains that have less work.
These schemes are referred to as diffusive schemes. The second category balances the load by computing an entirely
new partitioning, and then intelligently mapping the sub-domains of the new partitioning to the processors such that
the redistribution cost is minimized. These schemes are generally referred to as remapping schemes. Remapping
schemes typically lead to repartitionings that have smaller edge-cuts, while diffusive schemes lead to repartitionings
that incur smaller redistribution costs. However, since these results can vary signiﬁcantly among different types of
applications, it can be difﬁcult to select the best repartitioning scheme for the job.
ParMETIS V3 AdaptiveRepart is a parallel implementation of the Uniﬁed Repartitioning Algorithm [15] for
adaptive repartitioning that combines the best characteristics of remapping and diffusion-based repartitioning schemes.
A key parameter used by this algorithm is the ITR Factor. This parameter describes the ratio between the time
required for performing the inter-processor communications incurred during parallel processing compared to the time
to perform the data redistribution associated with balancing the load. As such, it allows us to compute a single metric
that describes the quality of the repartitioning, even though adaptive repartitioning is a multi-objective optimization
problem.
ParMETIS V3 AdaptiveRepart is based on the multilevel partitioning algorithm, and so, is in nature similar
to the the algorithm implemented in ParMETIS V3 PartKway. However, this routine uses a technique known as
local coarsening. Here, only vertices that have been distributed onto the same processor are coarsened together. On
the coarsest graph, an initial partitioning need not be computed, as one can either be derived from the initial graph
distribution (in the case when sub-domains are coupled to processors), or else one needs to be supplied as an input to
the routine (in the case when sub-domains are de-coupled from processors). However, this partitioning does need to
be balanced. The balancing phase is performed on the coarsest graph twice by alternative methods. That is, optimized
variants of remapping and diffusion algorithms [16] are both used to compute new partitionings. A quality metric
for each of these partitionings is then computed (using the ITR Factor) and the partitioning with the highest quality
is selected. This technique tends to give very good points from which to start multilevel reﬁnement, regardless of
the type of repartitioning problem or the value of the ITR Factor. Note that the fact that the algorithm computes
two initial partitionings does not impact its scalability as long as the size of the coarsest graph is suitably small [8].
Finally, multilevel reﬁnement is performed on the balanced partitioning in order to further improve its quality. Since
ParMETIS V3 AdaptiveRepart starts from a graph that is already well distributed, it is extremely fast.
Appropriate values to pass for the ITR Factor parameter can easily be determined depending on the times required
to perform (i) all inter-processor communications that have occurred since the last repartitioning, and (ii) the data
7
redistribution associated with the last repartitioning/load balancing phase. Simply divide the ﬁrst time by the second.
The result is the correct ITR Factor. In case these times cannot be ascertained (e.g., for the ﬁrst repartitioning/load
balancing phase), our experiments have shown that values between 100 and 1000 work well for a variety of situations.
ParMETIS V3 AdaptiveRepart can be used to load balance the mesh either before or after mesh adaptation. In
the latter case, each processor ﬁrst locally adapts its mesh, leading to different processors having different numbers of
elements. ParMETIS V3 AdaptiveRepart can then compute a partitioning in which the load is balanced. However,
load balancing can also be done before adaptation if the degree of reﬁnement for each element can be estimated a
priori. That is, if we know ahead of time into how many new elements each old element will subdivide, we can use
these estimations as the weights of the vertices for the graph that corresponds to the dual of the mesh. In this case,
the mesh can be redistributed before adaption takes place. This technique can signiﬁcantly reduce data redistribution
times [10].
3.4 Partition Reﬁnement
ParMETIS V3 ReﬁneKway is the routine provided by PAR METISto improve the quality of an existing partitioning.
Once a graph is partitioned (and has been redistributed accordingly), ParMETIS V3 ReﬁneKway can be called to
compute a new partitioning that further improves the quality. ParMETIS V3 ReﬁneKway can be used to improve
the quality of partitionings that are produced by other partitioning algorithms (such as the technique discussed in
Section 3.1 that is used in ParMETIS V3 PartGeom). ParMETIS V3 ReﬁneKway can also be used repeatedly to
further improve the quality of a partitioning. However, each successive call to ParMETIS V3 ReﬁneKway will tend
to produce smaller improvements in quality.
3.5 Partitioning for Multi-phase and Multi-physics Computations
The traditional graph partitioning problem formulation is limited in the types of applications that it can effectively
model because it speciﬁes that only a single quantity be load balanced. Many important types of multi-phase and multi-
physics computations require that multiple quantities be load balanced simultaneously. This is because synchronization
steps exist between the different phases of the computations, and so, each phase must be individually load balanced.
That is, it is not sufﬁcient to simply sum up the relative times required for each phase and to compute a partitioning
based on this sum. Doing so may lead to some processors having too much work during one phase of the computation
(and so, these may still be working after other processors are idle), and not enough work during another. Instead, it is
critical that every processor have an equal amount of work from each phase of the computation.
Two examples are particle-in-cells [17] and contact-impact simulations [3]. Figure 3illustrates the characteristics
of partitionings that are needed for these simulations. Figure 3(a) shows a mesh for a particles-in-cells computation.
Assuming that a synchronization separates the mesh-based computation from the particle computation, a partitioning
is required that balances both the number of mesh elements and the number of particles across the sub-domains. Fig-
ure 3(b) shows a mesh for a contact-impact simulation. During the contact detection phase, computation is performed
only on the surface (i.e., lightly shaded) elements, while during the impact phase, computation is performed on all of
the elements. Therefore, in order to ensure that both phases are load balanced, a partitioning must balance both the
total number of mesh elements and the number of surface elements across the sub-domains. The solid partitioning in
Figure 3(b) does this. The dashed partitioning is similar to what a traditional graph partitioner might compute. This
partitioning balances only the total number of mesh elements. The surface elements are imbalanced by over 50%.
A new formulation of the graph partitioning problem is presented in [6] that is able to model the problem of
balancing multiple computational phases simultaneously, while also minimizing the inter-processor communications.
In this formulation, a weight vector of size mis assigned to each vertex of the graph. The multi-constraint graph
partitioning problem then is to compute a partitioning such that the edge-cut is minimized and that every sub-
domain has approximately the same amount of each of the vertex weights. The routines ParMETIS V3 PartKway,
ParMETIS V3 PartGeomKway,ParMETIS V3 ReﬁneKway, and ParMETIS V3 AdaptiveRepart are all able to
compute partitionings that satisfy multiple balance constraints.
Figure 4gives the dual graph for the particles-in-cells mesh shown in Figure 3. Each vertex has two weights here.
The ﬁrst represents the work associated with the mesh-based computation for the corresponding element. (These are all
8
(a) (b)
Figure 3:A computational mesh for a particle-in-cells simulation (a) and a computational mesh for a contact-impact simulation (b). The particle-in-cells mesh
is partitioned so that both the number of mesh elements and the number of particles are balanced across the sub-domains. Two partitionings are shown for the
contact-impact mesh. The dashed partitioning balances only the number of mesh elements. The solid partitioning balances both the number of mesh elements and
the number of surface (lightly shaded) elements across the sub-domains.
(1, 0)
(1, 0) (1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 1)
(1, 1)
(1, 1)
(1, 1)
(1, 3)
(1, 4)
(1, 1)
(1, 1)
(1, 1)
(1, 2)
(1, 0)
1
12
20
2
3
4
56
7
89
10
11
13
14
17
18 19
15
16
11
13
16
1
2
34 5
6
7
8
9
10
20
19
18
14
15
12
17
Figure 4:A dual graph with vertex weight vectors of size two is constructed from the particle-in-cells mesh from Figure 3. A multi-constraint partitioning has
been computed for this graph, and this partitioning has been projected back to the mesh.
ones because we assume in this case that all of the elements have the same amount of mesh-based work associated with
them.) The second weight represents the work associated with the particle-based computation. This value is estimated
by the number of particles that fall within each element. A multi-constraint partitioning is shown that balances both of
these weights.
3.6 Partitioning for Heterogeneous Computing Architectures
Complex, heterogeneous computing platforms, such as groups of tightly-coupled shared-memory nodes that are
loosely connected via high bandwidth and high latency interconnection networks, and/or processing nodes that have
complex memory hierarchies, are becoming more common, as they display competitive cost-to-performance ratios.
The same is true of platforms that are geographically distributed. Most existing parallel simulation codes can easily
be ported to a wide range of parallel architectures as they employ a standard messaging layer such as MPI. However,
complex and heterogeneous architectures present new challenges to the scalable execution of such codes, since many
of the basic parallel algorithm design assumptions are no longer valid.
We have taken the ﬁrst steps toward developing architecture-aware graph-partitioning algorithms. These are able
to compute partitionings that allow computations to achieve the highest levels of performance regardless of the
computing platform. Speciﬁcally, we have enabled ParMETIS V3 PartKway,ParMETIS V3 PartGeomKway,
9
ParMETIS V3 PartMeshKway,ParMETIS V3 ReﬁneKway, and ParMETIS V3 AdaptiveRepart to compute ef-
ﬁcient partitionings for networks of heterogeneous processors. To do so, these routines require an additional array
(tpwgts) to be passed as a parameter. This array describes the fraction of the total vertex weight each sub-domain
should contain. For example, if you have a network of four processors, the ﬁrst three of which are of equal pro-
cessing speed, and the fourth of which is twice as fast as the others, the user would pass an array containing the
values (0.2,0.2,0.2,0.4). Note that by allowing users to specify target sub-domain weights as such, heterogeneous
processing power can be taken into account when computing a partitioning. However, this does not allow us to take
heterogeneous network bandwidths and latencies into account. Optimizing partitionings for heterogeneous networks
is still the focus of ongoing research.
3.7 Computing Fill-Reducing Orderings
ParMETIS V3 NodeND and ParMETIS V32 NodeND are the routines provided by PARMETISfor computing ﬁll-
reducing orderings, suited for Cholesky-based direct factorization algorithms. Note that ParMETIS V3 NodeND is
simply a wrapper around the more general ParMETIS V32 NodeND routine and is included for backward compat-
ibility. ParMETIS V32 NodeND makes no assumptions on how the graph is initially distributed among the proces-
sors. It can effectively compute ﬁll-reducing orderings for graphs that are randomly distributed as well as graphs that
are well distributed.
The algorithm implemented by ParMETIS V32 NodeND is based on a multilevel nested dissection algorithm.
This algorithm has been shown to produce low ﬁll orderings for a wide variety of matrices. Furthermore, it leads
to balanced elimination trees that are essential for parallel direct factorization. ParMETIS V32 NodeND uses a
multilevel node-based reﬁnement algorithm that is particularly suited for directly reﬁning the size of the separators.
To achieve high performance, ParMETIS V32 NodeND ﬁrst uses ParMETIS V3 PartKway to compute a high-
quality partitioning and redistributes the graph accordingly. Next it proceeds to compute the blog pclevels of the
elimination tree concurrently. When the graph has been separated into pparts (where pis the number of processors),
the graph is redistributed among the processor so that each processor receives a single subgraph, and METIS’ serial
nested dissection ordering algorithm is used to order these smaller subgraphs.
4 PARMETIS’ API
The various routines implemented in PARMETIS’ can be accessed from a C, C++, or Fortran program by using the
supplied library. In the rest of this section we describe PARMETIS’ API by ﬁrst describing various calling and usage
conventions, the various data structures used to pass information into and get information out of the routines, followed
by a detailed description of the calling sequence of the various routines.
Any program using PARMETIS’ API needs to include the parmetis.h header ﬁle. This ﬁle provides function
prototypes for the various API routines and deﬁnes the various data types and constants used by these routines.
During PARMETIS’ installation time, the metis/include/metis.h deﬁnes two important data types and their
widths. These are the idx t data type for storing integer quantities and the real t data type for storing ﬂoating
point quantities. The idx t data type can be deﬁned to be either a 32 or 64 bit signed integer, whereas the real t
data type can be deﬁned to be either a single or double precision ﬂoat point number. All of PARMETIS’ API routines
take as input arrays and/or scalars that are of these two data types.
4.2 Input and Output Formats used by PARMETIS
4.2.1 Format of the Input Graph
All of the graph routines in PARMETIStake as input the adjacency structure of the graph, the weights of the vertices
and edges (if any), and an array describing how the graph is distributed among the processors. Note that depending
on the application this graph can represent different things. For example, when PARMETISis used to compute ﬁll-
reducing orderings, the graph corresponds to the non-zero structure of the matrix (excluding the diagonal entries). In
10
the case of ﬁnite element computations, the vertices of the graph can correspond to nodes (points) in the mesh while
edges represent the connections between these nodes. Alternatively, the graph can correspond to the dual of the ﬁnite
element mesh. In this case, each vertex corresponds to an element and two vertices are connected via an edge if the
corresponding elements share an edge (in 2D) or a face (in 3D). Also, the graph can be similar to the dual, but be more
or less connected. That is, instead of limiting edges to those elements that share a face, edges can connect any two
elements that share even a single node. However the graph is constructed, it is usually undirected.2That is, for every
pair of connected vertices vand u, it contains both edges (v, u)and (u, v).
In PARMETIS, the structure of the graph is represented by the compressed storage format (CSR), extended for the
context of parallel distributed-memory computing. We will ﬁrst describe the CSR format for serial graphs and then
describe how it has been extended for storing graphs that are distributed among processors.
Serial CSR Format The CSR format is a widely-used scheme for storing sparse graphs. Here, the adjacency
structure of a graph is represented by two arrays, xadj and adjncy. Weights on the vertices and edges (if any) are
represented by using two additional arrays, vwgt and adjwgt. For example, consider a graph with nvertices and m
edges. In the CSR format, this graph can be described using arrays of the following sizes:
Note that the reason both adjncy and adjwgt are of size 2mis because every edge is listed twice (i.e., as (v, u)
and (u, v)). Also note that in the case in which the graph is unweighted (i.e., all vertices and/or edges have the same
weight), then either or both of the arrays vwgt and adjwgt can be set to NULL.ParMETIS V3 AdaptiveRepart
additionally requires a vsize array. This array is similar to the vwgt array, except that instead of describing the
amount of work that is associated with each vertex, it describes the amount of memory that is associated with each
vertex.
The adjacency structure of the graph is stored as follows. Assuming that vertex numbering starts from 0 (C style),
the adjacency list of vertex iis stored in array adjncy starting at index xadj[i]and ending at (but not including)
Hence, the adjacency lists for each vertex are stored consecutively in the array adjncy. The array xadj is used
to point to where the list for each speciﬁc vertex begins and ends. Figure 5(b) illustrates the CSR format for the
15-vertex graph shown in Figure 5(a). If the graph was weights on the vertices, then vwgt[i]is used to store the
weight of vertex i. Similarly, if the graph has weights on the edges, then the weight of edge adjncy[j]is stored in
adjwgt[j]. This is the same format that is used by the (serial) METISlibrary routines.
Distributed CSR Format PARMETISuses an extension of the CSR format that allows the vertices of the graph
and their adjacency lists to be distributed among the processors. In particular, PARMETISassumes that each processor
Pistores niconsecutive vertices of the graph and the corresponding miedges, so that n=Pini, and 2m=Pimi.
Here, each processor stores its local part of the graph in the four arrays xadj[ni+ 1],vwgt[ni],adjncy[mi],
and adjwgt[mi], using the CSR storage scheme. Again, if the graph is unweighted, the arrays vwgt and adjwgt
can be set to NULL. The straightforward way to distribute the graph for PARMETISis to take n/p consecutive adjacency
lists from adjncy and store them on consecutive processors (where pis the number of processors). In addition, each
processor needs its local xadj array to point to where each of its local vertices’ adjacency lists begin and end. Thus, if
we take all the local adjncy arrays and concatenate them, we will get exactly the same adjncy array that is used in
the serial CSR. However, concatenating the local xadj arrays will not give us the serial xadj array. This is because
the entries in each local xadj must point to their local adjncy array, and so, xadj[0]is zero for all processors.
In addition to these four arrays, each processor also requires the array vtxdist[p+ 1]that indicates the range of
vertices that are local to each processor. In particular, processor Pistores the vertices from vtxdist[i]up to (but
not including) vertex vtxdist[i+ 1].
Figure 5(c) illustrates the distributed CSR format by an example on a three-processor system. The 15-vertex graph
in Figure 5(a) is distributed among the processors so that each processor gets 5 vertices and their corresponding
2Multi-constraint and multi-objective graph partitioning formulations [6,13] can get around this requirement for some applications. These also
allow the computation of partitionings for bipartite graphs, as well as for graphs corresponding to non-square and non-symmetric matrices.
11
0 2 5 8 11 13
0 5 10 15vtxdist
1 0 2 6 1 3 75 2 4 938
0 3 7 11 15 18
15711268123791348140610
0 5 10 15vtxdist
5 11 6 10 12 7 11 13 8 12 14 9 13
0 2 5 8 11 13
0 5 10 15vtxdist
Description of the graph on a parallel computer with 3 processors (ParMeTiS)
Processor 0:
0 2 5 8 11 13 16 20 24 28 31 33 36 39 42 44
1 0 2 6 1 3 75 2 4 938 1 5 7 11 2 6 8 12 3 7 9 13 4 8 140 6 10 5 11 6 10 12 7 11 13 8 12 14 9 13
Description of the graph on a serial computer (serial MeTiS)
01234
56789
1413121110
(a) A sample graph
(b) Serial CSR format
(c) Distributed CSR format
Figure 5:An example of the parameters passed to PARMETIS in a three processor case. The arrays vwgt and adjwgt are
assumed to be NULL.
adjacency lists. That is, Processor Zero gets vertices 0 through 4, Processor One gets vertices 5 through 9, and
Processor Two gets vertices 10 through 14. This ﬁgure shows the xadj,adjncy, and vtxdist arrays for each
processor. Note that the vtxdist array will always be identical for every processor.
When multiple vertex weights are used for multi-constraint partitioning, the cvertex weights for each vertex are
stored contiguously in the vwgt array. In this case, the vwgt array is of size nc, where nis the number of locally-
stored vertices and cis the number of vertex weights (and also the number of balance constraints).
4.2.2 Format of Vertex Coordinates
As discussed in Section 3.1,PARMETISprovides routines that use the coordinate information of the vertices to quickly
pre-distribute the graph, and so, speedup the execution of the parallel k-way partitioning. These coordinates are
speciﬁed in an array called xyz of type real t. If dis the number of dimensions of the mesh (i.e., d= 2 for 2D
meshes or d= 3 for 3D meshes), then each processor requires an array of size dni, where niis the number of
locally-stored vertices. (Note that the number of dimensions of the mesh, d, is required as a parameter to the routine.)
In this array, the coordinates of vertex iare stored starting at location xyz[id]up to (but not including) location
xyz[id+d]. For example, if d= 3, then the x, y, and z coordinates of vertex iare stored at xyz[3*i],
xyz[3*i+1], and xyz[3*i+2], respectively.
4.2.3 Format of the Input Mesh
The routine ParMETIS V3 PartMeshKway takes a distributed mesh and computes its partitioning, while
ParMETIS V3 Mesh2Dual takes a distributed mesh and constructs a distributed dual graph. Both of these rou-
tines require an elmdist array that speciﬁes the distribution of the mesh elements, but that is otherwise identical
12
to the vtxdist array. They also require a pair of arrays called eptr and eind, as well as the integer parameter
ncommonnodes.
The eptr and eind arrays are similar in nature to the xadj and adjncy arrays used to specify the adjacency
list of a graph but now for each element they specify the set of nodes that make up each element. Speciﬁcally, the set
of nodes that belong to element iis stored in array eind starting at index eptr[i]and ending at (but not including)
index eptr[i+ 1](in other words, eind[eptr[i]] up through and including eind[eptr[i+ 1]-1]). Hence,
the node lists for each element are stored consecutively in the array eind. This format allows the speciﬁcation of
meshes that contain elements of mixed type.
The ncommonnodes parameter speciﬁes the degree of connectivity that is desired between the vertices of the
dual graph. Speciﬁcally, an edge is placed between two vertices if their corresponding mesh elements share at least
gnodes, where gis the ncommonnodes parameter. Hence, this parameter can be set to result in a traditional dual
graph (e.g., a value of two for a triangle mesh or a value of four for a hexahedral mesh). However, it can also be set
higher or lower for increased or decreased connectivity.
Additionally, ParMETIS V3 PartMeshKway requires an elmwgt array that is analogous to the vwgt array.
4.2.4 Format of the Computed Partitionings and Orderings
Format of the Partitioning Array The partitioning and repartitioning routines require that arrays (called part)
of sizes ni(where niis the number of local vertices) be passed as parameters to each processor. Upon completion
of the PARMETISroutine, for each vertex j, the sub-domain number (i.e., the processor label) to which this vertex
belongs will have been written to part[j]. Note that PAR METISdoes not redistribute the graph according to the new
partitioning, it simply computes the partitioning and writes it to the part array.
Additionally, whenever the number of sub-domains does not equal the number of processors that are used to com-
pute a repartitioning, ParMETIS V3 ReﬁneKway and ParMETIS V3 AdaptiveRepart require that the previously
computed partitioning be passed as a parameter via the part array. (This is also required whenever the user chooses to
de-couple the sub-domains from the processors. See discussion in Section 5.2.) This is because the initial partitioning
needs to be obtained from the values supplied in the part array. If the numbers of sub-domains and processors are
equal, then the initial partitioning can be obtained from the initial graph distribution, and so this information need not
be supplied. (In this case, for each processor i, every element of part would be set to i.)
Format of the Ordering and Separator Sizes Arrays Each processor running ParMETIS V3 NodeND (and
ParMETIS V32 NodeND) writes its portion of the computed ﬁll-reducing ordering to an array called order. Similar
to the part array, the size of order is equal to the number of vertices stored at each processor. Upon completion,
for each vertex j,order[j]stores the new global number of this vertex in the ﬁll-reducing permutation.
Besides the ordering vector, ParMETIS V3 NodeND also returns information about the sizes of the different
sub-domains as well as the separators at different levels. This array is called sizes and is of size 2p(where pis
the number of processors). Every processor must supply this array and upon return, each of the sizes arrays are
identical.
To accommodate runs in which the number of processors is not a power of two, ParMETIS V3 NodeND performs
blog pclevels of nested dissection. Because of that, let p0= 2blog pcbe the largest number of processors less than p
that is a power of two.
Given the above deﬁnition of p0, the format of the sizes array is as follows. The ﬁrst p0entries of sizes
starting from 0to p01store the number of nodes in each one of the p0sub-domains. The remaining p01entries
of this array starting from sizes[p0]up to sizes[2p02]store the sizes of the separators at the log p0levels
of nested dissection. In particular, sizes[2p02]stores the size of the top level separator, sizes[2p04]and
sizes[2p03]store the sizes of the two separators at the second level (from left to right). Similarly, sizes[2p08]
through sizes[2p05]store the sizes of the four separators of the third level (from left to right), and so on. This
array can be used to quickly construct the separator tree (a form of an elimination tree) for direct factorization. Given
this separator tree and the sizes of the sub-domains, the nodes in the ordering produced by ParMETIS V3 NodeND
are numbered in a postorder fashion. Figure 6illustrates the sizes array and the postorder ordering.
13
1 0 14 6 7 4 5 13 1011 2 3 12 8 9
sizes
order
1413121110
98765
43210
2222223
981232
11101354
761401
Figure 6:An example of the ordering produced by ParMETIS_V3_NodeND. Consider the simple 3×5grid and assume that
we have four processors. ParMETIS_V3_NodeND ﬁnds the three separators that are shaded. It ﬁrst ﬁnds the big separator and
then for each of the two sub-domains it ﬁnds the smaller. At the end of the ordering, the order vector concatenated over all the
processors will be the one shown. Similarly, the sizes arrays will all be identical to the one shown, corresponding to the regions
pointed to by the arrows.
4.3 Numbering and Memory Allocation
PARMETISallows the user to specify a graph whose numbering starts either at 0 (C style) or at 1 (Fortran style). Of
course, PARMETISrequires that same numbering scheme be used consistently for all the arrays passed to it, and it
writes to the part and order arrays similarly.
PARMETISallocates all the memory that it requires dynamically. This has the advantage that the user does not have
to provide workspace. However, if there is not enough memory on the machine, the routines in PARMETISwill abort.
Note that the routines in PARMETISdo not modify the arrays that store the graph (e.g., xadj and adjncy). They
only modify the part and order arrays.
14
5 Calling Sequence of the Routines in PARMETIS
The calling sequences of the PARMETISroutines are described in this section.
15
5.1 Graph Partitioning
int ParMETIS V3 PartKway (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *vwgt, idx t *adjwgt, idx t *wgtﬂag,
idx t *numﬂag, idx t *ncon, idx t *nparts, real t *tpwgts, real t *ubvec, idx t *options,
idx t *edgecut, idx t *part, MPI Comm *comm
)
Description
This routine is used to compute a k-way partitioning of a graph on pprocessors using the multilevel k-way
multi-constraint partitioning algorithm.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor. (See discussion in Sec-
tion 4.2.1).
These store the weights of the vertices and edges. (See discussion in Section 4.2.1).
wgtﬂag This is used to indicate if the graph is weighted. wgtﬂag can take one of four values:
0No weights (vwgt and adjwgt are both NULL).
1Weights on the edges only (vwgt is NULL).
2Weights on the vertices only (adjwgt is NULL).
3Weights on both the vertices and edges.
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and part
arrays. numﬂag can take one of two values:
0C-style numbering that starts from 0.
1Fortran-style numbering that starts from 1.
ncon This is used to specify the number of weights that each vertex has. It is also the number of balance
constraints that must be satisﬁed.
nparts This is used to specify the number of sub-domains that are desired. Note that the number of sub-
domains is independent of the number of processors that call this routine.
tpwgts An array of size ncon ×nparts that is used to specify the fraction of vertex weight that should
be distributed to each sub-domain for each balance constraint. If all of the sub-domains are to be of
the same size for every vertex weight, then each of the ncon ×nparts elements should be set to
a value of 1/nparts. If ncon is greater than 1, the target sub-domain weights for each sub-domain
are stored contiguously (similar to the vwgt array). Note that the sum of all of the tpwgts for a
give vertex weight should be one.
ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight, with 1
being perfect balance and nparts being perfect imbalance. A value of 1.05 for each of the ncon
weights is recommended.
options This is an array of integers that is used to pass additional parameters for the routine. The ﬁrst element
(i.e., options[0]) can take either the value of 0or 1. If it is 0, then the default values are used,
otherwise the remaining two elements of options are interpreted as follows:
16
options[1] This speciﬁes the level of information to be returned during the execution of the
algorithm. Timing information can be obtained by setting this to 1. Additional
options for this parameter can be obtained by looking at parmetis.h. The nu-
merical values there should be added to obtain the correct value. The default value
is 0.
options[2] This is the random number seed for the routine.
edgecut Upon successful completion, the number of edges that are cut by the partitioning is written to this
parameter.
part This is an array of size equal to the number of locally-stored vertices. Upon successful completion the
partition vector of the locally-stored vertices is written to this array. (See discussion in Section 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
17
int ParMETIS V3 PartGeomKway (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *vwgt, idx t *adjwgt, idx t *wgtﬂag,
idx t *numﬂag, idx t *ndims, real t *xyz, idx t *ncon, idx t *nparts, real t *tpwgts,
real t *ubvec, idx t *options, idx t *edgecut, idx t *part, MPI Comm *comm
)
Description
This routine is used to compute a k-way partitioning of a graph on pprocessors by combining the coordinate-
based and multi-constraint k-way partitioning schemes.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor. (See discussion in Sec-
tion 4.2.1).
These store the weights of the vertices and edges. (See discussion in Section 4.2.1).
wgtﬂag This is used to indicate if the graph is weighted. wgtﬂag can take one of four values:
0No weights (vwgt and adjwgt are both NULL).
1Weights on the edges only (vwgt is NULL).
2Weights on the vertices only (adjwgt is NULL).
3Weights on both the vertices and edges.
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and part
arrays. numﬂag can take one of two values:
0C-style numbering that starts from 0.
1Fortran-style numbering that starts from 1.
ndims The number of dimensions of the space in which the graph is embedded.
xyz The array storing the coordinates of the vertices (described in Section 4.2.2).
ncon This is used to specify the number of weights that each vertex has. It is also the number of balance
constraints that must be satisﬁed.
nparts This is used to specify the number of sub-domains that are desired. Note that the number of sub-
domains is independent of the number of processors that call this routine.
tpwgts An array of size ncon ×nparts that is used to specify the fraction of vertex weight that should
be distributed to each sub-domain for each balance constraint. If all of the sub-domains are to be of
the same size for every vertex weight, then each of the ncon ×nparts elements should be set to a
value of 1/nparts. If ncon is greater than one, the target sub-domain weights for each sub-domain
are stored contiguously (similar to the vwgt array). Note that the sum of all of the tpwgts for a
give vertex weight should be one.
ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight, with 1
being perfect balance and nparts being perfect imbalance. A value of 1.05 for each of the ncon
weights is recommended.
options This is an array of integers that is used to pass parameters to the routine. Their meanings are identical
to those of ParMETIS V3 PartKway.
18
edgecut Upon successful completion, the number of edges that are cut by the partitioning is written to this
parameter.
part This is an array of size equal to the number of locally-stored vertices. Upon successful completion the
partition vector of the locally-stored vertices is written to this array. (See discussion in Section 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
Note
The quality of the partitionings computed by ParMETIS V3 PartGeomKway are comparable to those pro-
duced by ParMETIS V3 PartKway. However, the run time of the routine may be up to twice as fast.
19
int ParMETIS V3 PartGeom (
idx t *vtxdist, idx t *ndims, real t *xyz, idx t *part, MPI Comm *comm
)
Description
This routine is used to compute a p-way partitioning of a graph on pprocessors using a coordinate-based space-
ﬁlling curves method.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
ndims The number of dimensions of the space in which the graph is embedded.
xyz The array storing the coordinates of the vertices (described in Section 4.2.2).
part This is an array of size equal to the number of locally stored vertices. Upon successful completion
stores the partition vector of the locally stored graph (described in Section 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
Note
The quality of the partitionings computed by ParMETIS V3 PartGeom are signiﬁcantly worse than those
produced by ParMETIS V3 PartKway and ParMETIS V3 PartGeomKway.
20
int ParMETIS V3 PartMeshKway (
idx t *elmdist, idx t *eptr, idx t *eind, idx t *elmwgt, idx t *wgtﬂag, idx t *numﬂag,
idx t *ncon, idx t *ncommonnodes, idx t *nparts, real t *tpwgts, real t *ubvec,
idx t *options, idx t *edgecut, idx t *part, MPI Comm *comm
)
Description
This routine is used to compute a k-way partitioning of a mesh on pprocessors. The mesh can contain elements
of different types.
Parameters
elmdist This array describes how the elements of the mesh are distributed among the processors. It is anal-
ogous to the vtxdist array. Its contents are identical for every processor. (See discussion in
Section 4.2.3).
eptr, eind
These arrays speciﬁes the elements that are stored locally at each processor. (See discussion in
Section 4.2.3).
elmwgt This array stores the weights of the elements. (See discussion in Section 4.2.3).
wgtﬂag This is used to indicate if the elements of the mesh have weights associated with them. The wgtﬂag
can take two values:
0No weights (elmwgt is NULL).
2Weights on the vertices only.
numﬂag This is used to indicate the numbering scheme that is used for the elmdist,elements, and part arrays.
numﬂag can take one of two values:
0C-style numbering that starts from 0.
1Fortran-style numbering that starts from 1.
ncon This is used to specify the number of weights that each vertex has. It is also the number of balance
constraints that must be satisﬁed.
ncommonnodes
This parameter determines the degree of connectivity among the vertices in the dual graph. Speciﬁ-
cally, an edge is placed between any two elements if and only if they share at least this many nodes.
This value should be greater than 0, and for most meshes a value of two will create reasonable dual
graphs. However, depending on the type of elements in the mesh, values greater than 2 may also
be valid choices. For example, for meshes containing only triangular, tetrahedral, hexahedral, or
rectangular elements, this parameter can be set to two, three, four, or two, respectively.
Note that setting this parameter to a small value will increase the number of edges in the resulting
dual graph and the corresponding partitioning time.
nparts This is used to specify the number of sub-domains that are desired. Note that the number of sub-
domains is independent of the number of processors that call this routine.
tpwgts An array of size ncon ×nparts that is used to specify the fraction of vertex weight that should
be distributed to each sub-domain for each balance constraint. If all of the sub-domains are to be of
the same size for every vertex weight, then each of the ncon ×nparts elements should be set to
a value of 1/nparts. If ncon is greater than 1, the target sub-domain weights for each sub-domain
are stored contiguously (similar to the vwgt array). Note that the sum of all of the tpwgts for a
give vertex weight should be one.
21
ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight, with 1
being perfect balance and nparts being perfect imbalance. A value of 1.05 for each of the ncon
weights is recommended.
options This is an array of integers that is used to pass parameters to the routine. Their meanings are identical
to those of ParMETIS V3 PartKway.
edgecut Upon successful completion, the number of edges that are cut by the partitioning is written to this
parameter.
part This is an array of size equal to the number of locally-stored vertices. Upon successful completion the
partition vector of the locally-stored vertices is written to this array. (See discussion in Section 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
22
5.2 Graph Repartitioning
int ParMETIS V3 AdaptiveRepart (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *vwgt, idx t *vsize, idx t *adjwgt,
idx t *wgtﬂag, idx t *numﬂag, idx t *ncon, int *nparts, real t *tpwgts, real t *ubvec,
real t *itr, idx t *options, idx t *edgecut, idx t *part, MPI Comm *comm
)
Description
This routine is used to balance the work load of a graph that corresponds to an adaptively reﬁned mesh.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor. (See discussion in Sec-
tion 4.2.1).
These store the weights of the vertices and edges. (See discussion in Section 4.2.1).
vsize This array stores the size of the vertices with respect to redistribution costs. Hence, vertices associ-
ated with mesh elements that require a lot of memory will have larger corresponding entries in this
array. Otherwise, this array is similar to the vwgt array. (See discussion in Section 4.2.1).
wgtﬂag This is used to indicate if the graph is weighted. wgtﬂag can take one of four values:
0No weights (vwgt and adjwgt are both NULL).
1Weights on the edges only (vwgt is NULL).
2Weights on the vertices only (adjwgt is NULL).
3Weights on both the vertices and edges.
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and part
arrays. numﬂag can take the following two values:
0C-style numbering is assumed that starts from 0
1Fortran-style numbering is assumed that starts from 1
ncon This is used to specify the number of weights that each vertex has. It is also the number of balance
constraints that must be satisﬁed.
nparts This is used to specify the number of sub-domains that are desired. Note that the number of sub-
domains is independent of the number of processors that call this routine.
tpwgts An array of size ncon ×nparts that is used to specify the fraction of vertex weight that should
be distributed to each sub-domain for each balance constraint. If all of the sub-domains are to be of
the same size for every vertex weight, then each of the ncon ×nparts elements should be set to a
value of 1/nparts. If ncon is greater than one, the target sub-domain weights for each sub-domain
are stored contiguously (similar to the vwgt array). Note that the sum of all of the tpwgts for a
give vertex weight should be one.
ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight, with 1
being perfect balance and nparts being perfect imbalance. A value of 1.05 for each of the ncon
weights is recommended.
23
itr This parameter describes the ratio of inter-processor communication time compared to data redistri-
bution time. It should be set between 0.000001 and 1000000.0. If ITR is set high, a repartitioning
with a low edge-cut will be computed. If it is set low, a repartitioning that requires little data redistri-
bution will be computed. Good values for this parameter can be obtained by dividing inter-processor
communication time by data redistribution time. Otherwise, a value of 1000.0 is recommended.
options This is an array of integers that is used to pass additional parameters for the routine. The ﬁrst element
(i.e., options[0]) can take either the value of 0or 1. If it is 0, then the default values are used,
otherwise the remaining three elements of options are interpreted as follows:
options[1] This speciﬁes the level of information to be returned during the execution of the
algorithm. Timing information can be obtained by setting this to 1. Additional
options for this parameter can be obtained by looking at parmetis.h. The nu-
merical values there should be added to obtain the correct value. The default value
is 0.
options[2] This is the random number seed for the routine.
options[3] This speciﬁes whether the sub-domains and processors are coupled or un-coupled.
If the number of sub-domains desired (i.e., nparts) and the number of processors
that are being used is not the same, then these must be un-coupled. However,
if nparts equals the number of processors, these can either be coupled or de-
coupled. If sub-domains and processors are coupled, then the initial partitioning
will be obtained implicitly from the graph distribution. However, if sub-domains
are un-coupled from processors, then the initial partitioning needs to be obtained
from the initial values assigned to the part array.
A value of PARMETIS PSR COUPLED indicates that sub-domains and processors
are coupled and a value of PARMETIS PSR UNCOUPLED indicates that these are
de-coupled.
The default value is PARMETIS PSR COUPLED if nparts equals the number
of processors and PARMETIS PSR UNCOUPLED (un-coupled) otherwise. These
constants are deﬁned in parmetis.h.
edgecut Upon successful completion, the number of edges that are cut by the partitioning is written to this
parameter.
part This is an array of size equal to the number of locally-stored vertices. Upon successful completion the
partition vector of the locally-stored vertices is written to this array. (See discussion in Section 4.2.4).
If the number of processors is not equal to the number of sub-domains and/or options[3] is set
to PARMETIS PSR UNCOUPLED, then the previously computed partitioning must be passed to the
routine as a parameter via this array.
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
24
5.3 Partitioning Reﬁnement
int ParMETIS V3 ReﬁneKway (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *vwgt, idx t *adjwgt, idx t *wgtﬂag,
idx t *numﬂag, idx t *ncon, idx t *nparts, real t *tpwgts, real t *ubvec, idx t *options,
idx t *edgecut, idx t *part, MPI Comm *comm
)
Description
This routine is used to improve the quality of an existing a k-way partitioning on pprocessors using the multi-
level k-way reﬁnement algorithm.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor. (See discussion in Sec-
tion 4.2.1).
These store the weights of the vertices and edges. (See discussion in Section 4.2.1).
ncon This is used to specify the number of weights that each vertex has. It is also the number of balance
constraints that must be satisﬁed.
nparts This is used to specify the number of sub-domains that are desired. Note that the number of sub-
domains is independent of the number of processors that call this routine.
wgtﬂag This is used to indicate if the graph is weighted. wgtﬂag can take one of four values:
0No weights (vwgt and adjwgt are both NULL).
1Weights on the edges only (vwgt is NULL).
2Weights on the vertices only (adjwgt is NULL).
3Weights on both the vertices and edges.
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and part
arrays. numﬂag can take the following two values:
0C-style numbering is assumed that starts from 0
1Fortran-style numbering is assumed that starts from 1
tpwgts An array of size ncon ×nparts that is used to specify the fraction of vertex weight that should
be distributed to each sub-domain for each balance constraint. If all of the sub-domains are to be of
the same size for every vertex weight, then each of the ncon ×nparts elements should be set to
a value of 1/nparts. If ncon is greater than 1, the target sub-domain weights for each sub-domain
are stored contiguously (similar to the vwgt array). Note that the sum of all of the tpwgts for a
give vertex weight should be one.
ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight, with 1
being perfect balance and nparts being perfect imbalance. A value of 1.05 for each of the ncon
weights is recommended.
options This is an array of integers that is used to pass parameters to the routine. Their meanings are identical
to those of ParMETIS V3 AdaptiveRepart.
edgecut Upon successful completion, the number of edges that are cut by the partitioning is written to this
parameter.
25
part This is an array of size equal to the number of locally-stored vertices. Upon successful completion the
partition vector of the locally-stored vertices is written to this array. (See discussion in Section 4.2.4).
If the number of processors is not equal to the number of sub-domains and/or options[3] is set to
PARMETIS PSR UNCOUPLED, then the previously computed partitioning must be passed to the
routine as a parameter via this array.
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
26
5.4 Fill-reducing Orderings
int ParMETIS V3 NodeND (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *numﬂag, idx t *options, idx t *order,
idx t *sizes, MPI Comm *comm
)
Description
This routine is used to compute a ﬁll-reducing ordering of a sparse matrix using multilevel nested dissection.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor (See discussion in Sec-
tion 4.2.1).
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and order
arrays. numﬂag can take the following two values:
0C-style numbering is assumed that starts from 0
1Fortran-style numbering is assumed that starts from 1
options This is an array of integers that is used to pass parameters to the routine. Their meanings are identical
to those of ParMETIS V3 PartKway.
order This array returns the result of the ordering (described in Section 4.2.4).
sizes This array returns the number of nodes for each sub-domain and each separator (described in Sec-
tion 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
27
int ParMETIS V32 NodeND (
idx t *vtxdist, idx t *xadj, idx t *adjncy, idx t *vwgt, idx t *numﬂag, idx t *mtype,
idx t *rtype, idx t *p nseps, int *s nseps, real t *ubfrac, idx t *seed, idx t *dbglvl,
idx t *order, idx t *sizes, MPI Comm *comm
)
Description
This routine is used to compute a ﬁll-reducing ordering of a sparse matrix using multilevel nested dissection.
Parameters
vtxdist This array describes how the vertices of the graph are distributed among the processors. (See discus-
sion in Section 4.2.1). Its contents are identical for every processor.
These store the (local) adjacency structure of the graph at each processor (See discussion in Sec-
tion 4.2.1).
vwgt These store the weights of the vertices. A value of NULL indicates that each vertex has unit weight.
(See discussion in Section 4.2.1).
numﬂag This is used to indicate the numbering scheme that is used for the vtxdist,xadj,adjncy, and order
arrays. The possible values are:
0C-style numbering is assumed that starts from 0
1Fortran-style numbering is assumed that starts from 1
mtype This is used to indicate the scheme to be used for computing the matching. The possible values,
deﬁned in parmetis.h, are:
PARMETIS MTYPE LOCAL A local matching scheme is used in which each pair of matched
vertices reside on the same processor.
PARMETIS MTYPE GLOBAL A global matching scheme is used in which the pairs of matched
vertices can reside on different processors. This is the default value
if a NULL value is passed.
rtype This is used to indicate the separator reﬁnement scheme that will be used. The possible values,
deﬁned in parmetis.h, are:
PARMETIS SRTYPE GREEDY Uses a simple greedy reﬁnement algorithm.
PARMETIS SRTYPE 2PHASE Uses a higher quality reﬁnement algorithm, which is somewhat
slower. This is the default value if a NULL value is passed.
p nseps Speciﬁes the number of different separators that will be computed during each bisection at the ﬁrst
blog pclevels of the nested dissection (these are computed in parallel among the processors). The
bisection that achieves the smallest separator is selected. The default value is 1 (when NULL is
supplied), but values greater than 1 can lead to better quality orderings. However, this is a time-
s nseps Speciﬁes the number of different separators that will be computed during each of the bisections
levels of the remaining levels of the nested dissection (when the matrix has been divided among
the processors and each processor proceeds independently to order its portion of the matrix). The
bisections that achieve the smallest separator are selected. The default value is 1 (when NULL is
supplied), but values greater than 1 can lead to better quality orderings. However, this is a time-
28
ubfrac This value indicates how unbalanced the two partitions are allowed to get during each bisection level.
The default value (when NULL is supplied) is 1.05, but higher values (typical ranges 1.05–1.25) can
lead to smaller separators.
seed This is the seed for the random number generator. When NULL is supplied, a default seed is used.
dbglvl This speciﬁes the level of information to be returned during the execution of the algorithm. This is
identical to the options[2] parameter of the other routines. When NULL is supplied, a value of 0
is used.
order This array returns the result of the ordering (described in Section 4.2.4).
sizes This array returns the number of nodes for each sub-domain and each separator (described in Sec-
tion 4.2.4).
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
29
5.5 Mesh to Graph Translation
int ParMETIS V32 Mesh2Dual (
idx t *elmdist, idx t *eptr, idx t *eind, idx t *numﬂag, idx t *ncommonnodes,
idx t **xadj, idx t **adjncy, MPI Comm *comm
)
Description
This routine is used to construct a distributed graph given a distributed mesh. It can be used in conjunction with
other routines in the PARMETISlibrary. The mesh can contain elements of different types.
Parameters
elmdist This array describes how the elements of the mesh are distributed among the processors. It is anal-
ogous to the vtxdist array. Its contents are identical for every processor. (See discussion in
Section 4.2.3).
eptr, eind
These arrays speciﬁes the elements that are stored locally at each processor. (See discussion in
Section 4.2.3).
numﬂag This is used to indicate the numbering scheme that is used for the elmdist,elements,xadj, and adjncy
arrays. numﬂag can take one of two values:
0C-style numbering that starts from 0.
1Fortran-style numbering that starts from 1.
ncommonnodes
This parameter determines the degree of connectivity among the vertices in the dual graph. Speciﬁ-
cally, an edge is placed between any two elements if and only if they share at least this many nodes.
This value should be greater than 0, and for most meshes a value of two will create reasonable dual
graphs. However, depending on the type of elements in the mesh, values greater than 2 may also
be valid choices. For example, for meshes containing only triangular, tetrahedral, hexahedral, or
rectangular elements, this parameter can be set to two, three, four, or two, respectively.
Note that setting this parameter to a small value will increase the number of edges in the resulting
dual graph and the corresponding partitioning time.
Upon the successful completion of the routine, pointers to the constructed xadj and adjncy arrays
will be written to these parameters. (See discussion in Section 4.2.1). The calling program is respon-
sible for freeing this memory by calling the METIS Free routine described in METIS’ manual.
comm This is a pointer to the MPI communicator of the processes that call PARMETIS. For most programs
this will point to MPI COMM WORLD.
Returns
METIS OK Indicates that the function returned normally.
METIS ERROR Indicates some other type of error.
Note
This routine can be used in conjunction with ParMETIS V3 PartKway,ParMETIS V3 PartGeomKway, or
ParMETIS V3 AdaptiveRepart. It typically runs in half the time required by ParMETIS V3 PartKway.
30
6 Restrictions & Limitations
The following is a list of restrictions and limitations imposed by the current release of PARMETIS. Note that these
restrictions are on top of any other restrictions described with each API function.
1. The graph must be initially distributed among the processors such that each processor has at least one vertex.
Substantially better performance will be achieved if the vertices are distributed so that each processor gets an
equal number of vertices.
2. The routines must be called by at least two processors. That is, PARMETIScannot be used on a single processor.
If you need to partition on a single processor use METIS.
3. The partitioning routines in PARMETISswitch to a purely serial implementation (via a call to the corresponding
METIS’ routine) when the following conditions are met: (i) the graph/matrix contains less than 10000 vertices,
(ii) the graph contains no edges, and (iii) the number of vertices in the graph is less than 20 ×p, where pis the
number of processors.
7 Hardware & Software Requirements, and Contact Information
PARMETISis written in ANSI C and uses MPI for inter-processor communication. Instructions on how to build
PARMETISare available in the Install.txt ﬁle. In the directory called Graphs, you will ﬁnd some graphs that
can be used to test PARMETISusing the testing program that are built with PARMETIS.
In order to use PARMETISin your application you need to have a copy of the serial METISlibrary and link your
program with both libraries (i.e., libparmetis.a and libmetis.a). Note that the PAR METISpackage already
contains the source code for the METISlibrary. The included build system automatically construct both libraries.
PARMETIShave been extensively tested on a number of different parallel computers. However, even though
PARMETIScontains no known bugs, this does not mean that all of its bugs have been found and ﬁxed. If you have any
problems, please send email to karypis@cs.umn.edu with a brief description of the problem.
PARMETISis copyrighted by the Regents of the University of Minnesota. It can be freely used for educational and
research purposes by non-proﬁt institutions and US government agencies only. Other organizations are allowed to
use PARMETISonly for evaluation purposes, and any further uses will require prior approval. The software may not
be sold or redistributed without prior approval. One may make copies of the software for their use provided that the
copies, are not sold or distributed, are used under the same terms and conditions.
As unestablished research software, this code is provided on an “as is” basis without warranty of any kind, either
expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to
these terms. These terms and conditions are subject to change at any time without prior notice.
References
[1] R. Biswas and R. Strawn. A new procedure for dynamic adaption of three-dimensional unstructured grids. Applied Numerical
Mathematics, 13:437–452, 1994.
[2] C. Fiduccia and R. Mattheyses. A linear time heuristic for improving network partitions. In In Proc. 19th IEEE Design
Automation Conference, pages 175–181, 1982.
[3] J. Fingberg, A. Basermann, G. Lonsdale, J. Clinckemaillie, J. Gratien, and R. Ducloux. Dynamic load-balancing for parallel
structural mechanics simulations with DRAMA. ECT2000, 2000.
[4] G. Karypis and V. Kumar. A coarse-grain parallel multilevel k-way partitioning algorithm. In Proceedings of the 8th SIAM
conference on Parallel Processing for Scientiﬁc Computing, 1997.
[5] G. Karypis and V. Kumar. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and comput-
ing ﬁll-reducing orderings of sparse matrices, version 4.0. Technical report, Univ. of MN, Dept. of Computer Sci. and Engr.,
1998.
31
[6] G. Karypis and V. Kumar. Multilevel algorithms for multi-constraint graph partitioning. In Proc. Supercomputing ’98, 1998.
[7] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed
Computing, 48(1), 1998.
[8] G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. Siam Review, 41(2):278–300,
1999.
[9] B. Kernighan and S. Lin. An efﬁcient heuristic procedure for partitioning graphs. The Bell System Technical Journal,
49(2):291–307, 1970.
[10] L. Oliker and R. Biswas. PLUM: Parallel load balancing for adaptive unstructured meshes. Journal of Parallel and Distributed
Computing, 52(2):150–177, 1998.
[11] A. Patra and D. Kim. Efﬁcient mesh partitioning for adaptive hp ﬁnite element meshes. Technical report, Dept. of Mech.
Engr., SUNY at Buffalo, 1999.
[12] A. Pothen, H. Simon, L. Wang, and S. Bernard. Towards a fast implementation of spectral nested dissection. In Supercom-
puting ’92 Proceedings, pages 42–51, 1992.
[13] K. Schloegel, G. Karypis, and V. Kumar. A new algorithm for multi-objective graph partitioning. In Proc. EuroPar ’99, pages
322–331, 1999.
[14] K. Schloegel, G. Karypis, and V. Kumar. Parallel multilevel algorithms for multi-constraint graph partitioning. In Proc.
EuroPar-2000, 2000. Accepted as a Distinguished Paper.
[15] K. Schloegel, G. Karypis, and V. Kumar. A uniﬁed algorithm for load-balancing adaptive scientiﬁc simulations. In Proc.
Supercomputing 2000, 2000.
[16] K. Schloegel, G. Karypis, and V. Kumar. Wavefront diffusion and LMSR: Algorithms for dynamic repartitioning of adaptive
meshes. IEEE Transactions on Parallel and Distributed Systems, 12(5):451–466, 2001.
[17] J. Watts, M. Rieffel, and S. Taylor. A load balancing technique for multi-phase computations. Proc. of High Performance
Computing ‘97, pages 15–20, 1997.
32
... These steps are iterated until the label of vertices ceases modifying and the algorithm converges. ParMETIS [119], PT-Scotch [120], KaPPa [121], JOSTLE [122], JA-BE-JA [123], Blp [124], BS [125], XTRAPULP [126], and Spinner [96] are examples of ODVP. ParMETIS [119] is MPI-based parallel partitioning that implements several methods for partitioning unstructured graphs and computing sparse matrices fill-reducing orderings. ...
... ParMETIS [119], PT-Scotch [120], KaPPa [121], JOSTLE [122], JA-BE-JA [123], Blp [124], BS [125], XTRAPULP [126], and Spinner [96] are examples of ODVP. ParMETIS [119] is MPI-based parallel partitioning that implements several methods for partitioning unstructured graphs and computing sparse matrices fill-reducing orderings. It adopts the popular multilevel partitioning METIS [41] by including routines explicitly designed for parallel computations and large-scale numerical simulations. ...
... First, the input graph G is changed to a hypergraph (Hg) via the split and connect method, and then the Hg is partitioned via vertex partitioning. dSPAC+X partitions billions of edges by integrating parallel vertex partitionings like ParMETIS [119] and ParHIP [127]. DNE [132] is a distributed version of NE [49] and introduces a parallel expansion heuristic. ...
Full-text available
Article
Graphs are a tremendously suitable data representations that model the relationships of entities in many application domains, such as recommendation systems, machine learning, computational biology, social network analysis, and other application domains. Graphs with many vertices and edges have become quite prevalent in recent years. Therefore, graph computing systems with integrated various graph partitioning techniques have been envisioned as a promising paradigm to handle large-scale graph analytics in these application domains. However, scalable processing of large-scale graphs is challenging due to their high volume and inherent irregular structure of the real-world graphs. Hence, industry and academia have been recently proposing graph partitioning and computing systems to process and analyze large-scale graphs efficiently. The graph partitioning and computing systems have been designed to improve scalability issues and reduce processing time complexity. This paper presents an overview, classification, and investigation of the most popular graph partitioning and computing systems. The various methods and approaches of graph partitioning and diverse categories of graph computing systems are presented. Finally, we discuss main challenges and future research directions in graph partitioning and computing systems.
...  It follows multilevel approach along with spectral partitioning technique [24].  Calculation of eigenvalues for spectral partitioning made efficiently in Chaco [51].  It is flexible i.e., supplementary method used if one of the methods fails to give result. ...
...  It is flexible i.e., supplementary method used if one of the methods fails to give result. Actually, the results of different methods can be compared and the best possible one selected even if a method does not fail [51].  Simple, Inertial, Spectral, Kemighan-Lin and Multilevel Kemighan-Lin algorithms are implemented in chaco. ...
...  It generates partitions by applying iterative and recursive approach. While applying this it confirms load is equally balanced among processors by minimizing edge cut value [51]. [35]. ...
Article
In this paper the authors have used a systematic literature review to provide benchmarking on influencing parameters for graph partitioning tools, which is the principal contribution of the present paper. Tools are compared on the basis of parameters which will impact the performance of tool. The paper elucidates about the tools and techniques along with their features, merits and demerits and also highlighted on influencing parameters which is missing in other reviews. These techniques are analysed by identifying merits and demerits of each technique. This research paper can help the researchers to choose the appropriate tool or technique for their own partitioning problems. Also authors have suggested future research directions and anomalies for improvement in tools and techniques for Graph Partitioning.
... In order to partition the dual graph corresponding to the decomposition into the subdomains {Ω i0 } i=1,...,N 0 , we use the Trilinos package Zoltan2 [17]. It provides an interface to the partitioning algorithms from the older Zoltan package and can also be linked to third party libraries such as ParMETIS [13]. We solve the linear system assembled by Galeri using the conjugate gradient (PCG) method from the Trilinos package Belos, preconditioned by FROSch. ...
Full-text available
Preprint
Different graph partitioning methods, i.e., linear partioning, parallel hypergraph (PHG) partioning, and two approaches using ParMETIS, are considered to generate an unstructured decomposition of the second-level coarse operator of three-level FROSch (Fast and Robust Overlapping Schwarz) preconditioners in the Trilinos software library. In our context, the parallel hypergraph method shows the most consistent results.
... In the case of large-graphs, training GNN models requires partitioning them into smaller sub-graphs, and assigning each sub-graph to a CPU; it also requires splitting training example vertices in a balanced manner into the specified number of parts. Minimum-cut algorithms can partition such graphs along vertices [14] or edges [7,8] into a specified number of sub-graphs. In the case of millions of small graphs (e.g., molecule graphs) training GNN models is an embarrassingly parallel problem, as each node in a cluster can feed a batch of example graphs in parallel to the model. ...
Full-text available
Preprint
Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.
... Then, we use a multilevel k-way partitioning method [34] to generate the subdomains considering optimization criteria, including minimizing the resulting subdomain connectivity graph and the contiguous partition enforcement. Figure 3 depicts the mesh partitioned into sixteen subdomains using the ParMetis library [35]. We then obtain the displacements of the model using non-linear finite element analysis. ...
Full-text available
Article
Underwater manipulation with current robotics technology is a challenging task with significant limits in versatility and robustness terms. Such functionality has tremendous potential covering a broad spectrum of applications, mainly replacing divers performing hazardous jobs. Soft robotics provides an efficient solution for operating in these scenarios and adapting to uncertain environmental conditions. This paper presents the design and fabrication of a simple, low-cost, and easily deployable soft gripper for underwater manipulation. We use modelling and simulation techniques for designing the soft fluidic elastomer actuators that compose the soft gripper and additive manufacturing techniques for rapid test cycles and validation. These techniques allow for a fast redesign depending on the application requirements. The proposal combines materials and fabrication techniques to take advantage of their strengths. We validate the feasibility and ability of the proposed soft gripper in a challenging underwater scenario using a subaquatic vehicle.
... The graph is redistributed according to this partition and ParMETIS_V3_PartKway is then called to compute the final high quality partition. Information about the partitioning process of Parmetis is available in [25]. ...
Full-text available
Article
Numerical simulation of thermal hydraulics of nuclear reactors is widely concerned, but large-scale fluid simulation is still prohibited due to the complexity of components and huge computational effort. Some applications of open source CFD programs still have a large gap in terms of comprehensiveness of physical models, computational accuracy and computational efficiency compared with commercial CFD programs. Therefore, it is necessary to improve the computational performance of in-house CFD software (YHACT, the parallel analysis code of thermohydraulices) to obtain the processing capability of large-scale mesh data and better parallel efficiency. In this paper, we will form a unified framework of meshing and mesh renumbering for solving fluid dynamics problems with unstructured meshes. Meanwhile, the effective Greedy, RCM (reverse Cuthill-Mckee), and CQ (cell quotient) grid renumbering algorithms are integrated into YHACT software. An important judgment metric, named median point average distance (MDMP), is applied as the discriminant of sparse matrix quality to select the renumbering methods with better effect for different physical models. Finally, a parallel test of the turbulence model with 39.5 million grid volumes is performed using a pressurized water reactor engineering case component with 3*3 rod bundles. The computational results before and after renumbering are also compared to verify the robustness of the program. Experiments show that the CFD framework integrated in this paper can correctly perform simulations of the thermal engineering hydraulics of large nuclear reactors. The parallel size of the program reaches a maximum of 3072 processes. The renumbering acceleration effect reaches its maximum at a parallel scale of 1536 processes, 56.72%. It provides a basis for our future implementation of open-source CFD software that supports efficient large-scale parallel simulations.
... Computations were performed on INM cluster using 40 -600 cores. In order to obtain better balanced partitioning of the mesh, the ParMETIS [23] partitioner was used, interface to which is provided by INMOST. Calculated water head and displacement distributions are depicted in figure 7. Time measurements and linear iterations count are presented in tables 4 and 5. Total speed-up is presented in figure 8 and shows superlinear scaling with monolithic strategy reaching slightly larger speed-up. ...
Full-text available
Preprint
Poroelasticity is an example of coupled processes which are crucial for many applications including safety assessment of radioactive waste repositories. Numerical solution of poroelasticity problems discretized with finite volume-virtual element scheme leads to systems of algebraic equations, which may be solved simultaneously or iteratively. In this work, parallel scalability of the monolithic strategy and of the fixed-strain splitting strategy is examined, which depends mostly on linear solver performance. It was expected that splitting strategy would show better scalability due to better performance of a black-box linear solver on systems with simpler structure. However, this is not always the case.
... The triangular resulting mesh is composed of 14 372 triangles. However, the generality of the PolyDG method allows us to use mesh elements of any shape, for this reason, we agglomerate the mesh by using ParMETIS [37] and we obtain a polygonal mesh of 51 elements, as shown in Figure 5. This mesh is then used to perform a convergence Error Convergence for P q − P q+1 elements To test the convergence we consider the following exact solution for a case with two pressure fields (|J| = 2): ...
Full-text available
Preprint
We introduce and analyze a discontinuous Galerkin method for the numerical modelling of the equations of Multiple-Network Poroelastic Theory (MPET) in the dynamic formulation. The MPET model can comprehensively describe functional changes in the brain considering multiple scales of fluids. Concerning the spatial discretization, we employ a high-order discontinuous Galerkin method on polygonal and polyhedral grids and we derive stability and a priori error estimates. The temporal discretization is based on a coupling between a Newmark $\beta$-method for the momentum equation and a $\theta$-method for the pressure equations. After the presentation of some verification numerical tests, we perform a convergence analysis using an agglomerated mesh of a geometry of a brain slice. Finally we present a simulation in a three dimensional patient-specific brain reconstructed from magnetic resonance images. The model presented in this paper can be regarded as a preliminary attempt to model the perfusion in the brain.
Article
The numerical simulation of blood flows in the human body with a certain level of clinical accuracy is important for the understanding of the human physiology. The success of the modeling relies on a robust numerical method with the corresponding software that can handle the complex geometry, the complex fluid flows and run efficiently on a supercomputer. In this work, we introduce a highly parallel domain decomposition method to solve the three-dimensional incompressible Navier-Stokes equations on a patient-specific artery at the full-body scale from neck to feet with 222 outlets and a minimum diameter around 1.0 mm. A locally refined, unstructured mesh is used to resolve the complex fluid flow. Moreover, a two-level method is introduced to determine the model parameters in the Windkessel outlet boundary condition to guarantee clinically correct flow distributions to 14 major regions. A fully implicit Newton-Krylov-Schwarz method is used to solve the nonlinear algebraic system at each time step and numerical experiments show that the proposed method is robust with respect to the complex geometry, the graph-based partition of the complex mesh, the ill-conditioned sparse systems with locally dense blocks, and different model parameters and is scalable with up to 15,360 processor cores. With the proposed method, one simulation of the blood flow in a full-body arterial network can be obtained in about 8 hours per cardiac cycle, which enables its potential use in a wide range of clinical scenarios.
Article
The discrete unified gas kinetic scheme (DUGKS) is a recently devised approach to simulate multiscale flows based on the kinetic models, which also shows distinct features for continuum flows. Most of the existing DUGKS are sequential or based on structured grids, thus limiting their scope of application in engineering. In this paper, a parallel DUGKS for inviscid high-speed compressible flows on unstructured grids is proposed. In the framework of the DUGKS, the gradients of the distribution functions are calculated by a least-square method. To parallelize the method, a graph-based partitioning method is employed to guarantee the load balancing and minimize the communication among processors. The method is validated by several benchmark problems, i.e., a two-dimensional (2D) Riemann problem, 2D subsonic flows passing two benchmark airfoils, a 2D regular shock reflection problem, 2D supersonic flows (Mach numbers are 3 and 5) around a cylinder, an explosion in a three-dimensional (3D) box, a 3D subsonic flow around the Office National d'Etudes et de Recherches Aérospatiales M6 wing, a 3D hypersonic flow (Mach number is 10) around a hemisphere, and a supersonic flow over the Northrop YF-17 fighter model. The numerical results show good agreement with the published results, and the present method is robust for a wide range of Mach numbers, from subsonic to hypersonic. The parallel performance results show that the proposed method is highly parallel scalable, where an almost linear scalability with 93% parallel efficiency is achieved for a 3D problem with over 55 × 10 ⁶ tetrahedrons on a supercomputer with up to 4800 processors.
Full-text available
Conference Paper
In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorithm. The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). A key innovative feature of our parallel formulation is that it utilizes graph coloring to effectively parallelize both the coarsening and the refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of concurrency, while maintaining the high quality partitions produced by the serial algorithm. We test our scheme on a large number of graphs from finite element methods, and transportation domains. Our parallel formulation on Cray T3D, produces high quality 128-way partitions on 128 processors in a little over two seconds, for graphs with a million vertices. Thus our parallel algorithm makes it possible to perform dynamic graph partition in adaptive computations without compromising quality.
Full-text available
Article
Introduction The use of domain decomposition solvers presumes the existence of a partitioning of the domain that distributes the computational effort equitably. Adaptive hp finite elements which are capable of delivering solution accuracies far superior to classical hGamma or pGammaversion finite element methods, for a given discretization size [BS94] create special difficulties in generating such load balanced partitions. Two major difficulties that arise in partitioning such meshes are a) the choice of a good a priori measure of computational effort, which can be equidistributed among the processors and b) minimizing the data migration among processors as the mesh changes and is repartitioned. In uniform meshes using simple solvers, computational effort is directly related to the degrees of freedom in each sub-domain. In schemes using hp meshes, and domain decomposition solvers (e
Article
In this paper we present and study a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, find a k-way partitioning of the smaller graph, and then uncoarsen and refine it to construct a k-way partitioning for the original graph. These algorithms compute a k-way partitioning of a graph G = (V, E) in O(|E |) time which is faster by a factor of O(log k) than previously proposed multilevel recursive bisection algorithms. A key contribution of our work is in finding a high quality and computationally inexpensive refinement algorithm that can improve upon an initial k-way partitioning. We also study the effectiveness of the overall scheme for a variety of coarsening schemes. We present experimental results on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that this new scheme produces partitions that are of comparable or better q...
Article
We consider the problem of partitioning the nodes of a graph with costs on its edges into subsets of given sizes so as to minimize the sum of the costs on all edges cut. This problem arises in several physical situations—for example, in assigning the components of electronic circuits to circuit boards to minimize the number of connections between boards. This paper presents a heuristic method for partitioning arbitrary graphs which is both effective in finding optimal partitions, and fast enough to be practical in solving large problems.
Conference Paper
An iterative mincut heuristic for partitioning networks is presented whose worst case computation time, per pass, grows linearly with the size of the network. In practice, only a very small number of passes are typically needed, leading to a fast approximation algorithm for mincut partitioning. To deal with cells of various sizes, the algorithm progresses by moving one cell at a time between the blocks of the partition while maintaining a desired balance based on the size of the blocks rather than the number of cells per block. Efficient data structures are used to avoid unnecessary searching for the best cell to move and to minimize unnecessary updating of cells affected by each move.
Conference Paper
The authors describe the novel spectral nested dissection (SND) algorithm, a novel algorithm for computing orderings appropriate for parallel factorization of sparse, symmetric matrices. The algorithm makes use of spectral properties of the Laplacian matrix associated with the given matrix to compute separators. The authors evaluate the quality of the spectral orderings with respect to several measures: fill, elimination tree height, height and weight balances of elimination trees, and clique tree heights. They use some very large structural analysis problems as test cases and demonstrate on these real applications that spectral orderings compare quite favorably with commonly used orderings, outperforming them by a wide margin for some of these measures. The only disadvantage of SND is its relatively long execution time
Article
Parallel computations comprised of multiple, tightly interwoven phases of computation may require a different approach to dynamic load balancing than single-phase computations. This paper presents a load sharing method based on the view of load as a vector, rather than as a scalar. This approach allows multiphase computations to achieve higher efficiency on large-scale multicomputers than possible with traditional techniques. Results are presented for two large-scale particle simulations running on 128 nodes of an Intel Paragon and on 256 processors of a Cray T3D, respectively. INTRODUCTION Load balancing techniques already in the literature have concentrated entirely on single-phase computations (Boillat 1990; Cybenko 1989; Evans and Butt 1993; Heirich and Taylor 1995; Horton 1993; Kohring 1995; Lin and Keller 1987; Muniz and Zaluska 1995; Song 1994; Walshaw and Berzins 1995; Watts et al. 1996; Willebeek-LeMair and Reeves 1993; Williams 1991; Xu and Lau 1997). That is, they work only ...