ArticlePDF Available

On the impact of heterogeneity-aware mesh partitioning and non-contributing computation removal on parallel reservoir simulations

Authors:

Abstract and Figures

Parallel computations have become standard practice for simulating the complicated multi-phase flow in a petroleum reservoir. Increasingly sophisticated numerical techniques have been developed in this context. During the chase of algorithmic superiority, however, there is a risk of forgetting the ultimate goal, namely, to efficiently simulate real-world reservoirs on realistic parallel hardware platforms. In this paper, we quantitatively analyse the negative performance impact caused by non-contributing computations that are associated with the “ghost computational cells” per subdomain, which is an insufficiently studied subject in parallel reservoir simulation. We also show how these non-contributing computations can be avoided by reordering the computational cells of each subdomain, such that the ghost cells are grouped together. Moreover, we propose a new graph-edge weighting scheme that can improve the mesh partitioning quality, aiming at a balance between handling the heterogeneity of geological properties and restricting the communication overhead. To put the study in a realistic setting, we enhance the open-source Flow simulator from the OPM framework, and provide comparisons with industrial-standard simulators for real-world reservoir models.
Content may be subject to copyright.
Thune et al. Journal of Mathematics in Industry (2021) 11:12
https://doi.org/10.1186/s13362-021-00108-5
R E S E A R C H Open Access
On the impact of heterogeneity-aware mesh
partitioning and non-contributing
computation removal on parallel reservoir
simulations
Andreas Thune1,2* ,XingCai
1,2and Alf Birger Rustad3
*Correspondence:
andreast@simula.no
1Simula Research Laboratory, Martin
Linges vei 25, 1364 Fornebu,
Norway
2University of Oslo, Oslo, Norway
Full list of author information is
available at the end of the article
Abstract
Parallel computations have become standard practice for simulating the complicated
multi-phase flow in a petroleum reservoir. Increasingly sophisticated numerical
techniques have been developed in this context. During the chase of algorithmic
superiority, however, there is a risk of forgetting the ultimate goal, namely, to
efficiently simulate real-world reservoirs on realistic parallel hardware platforms. In this
paper, we quantitatively analyse the negative performance impact caused by
non-contributing computations that are associated with the “ghost computational
cells” per subdomain, which is an insufficiently studied subject in parallel reservoir
simulation. We also show how these non-contributing computations can be avoided
by reordering the computational cells of each subdomain, such that the ghost cells
are grouped together. Moreover, we propose a new graph-edge weighting scheme
that can improve the mesh partitioning quality, aiming at a balance between
handling the heterogeneity of geological properties and restricting the
communication overhead. To put the study in a realistic setting, we enhance the
open-source Flow simulator from the OPM framework, and provide comparisons with
industrial-standard simulators for real-world reservoir models.
Keywords: Reservoir simulation; High performance computing; Mesh partitioning;
Norne reservoir model
1 Introduction and motivation
Computer simulation is extensively used in the oil industry to predict and analyse the
flow of fluids in petroleum reservoirs. The multi-phased flow in such porous media is
mathematicallydescribedby acomplicatedsystem ofpartial differentialequations(PDEs),
only numerically solvable for realistic cases. At the same time, the quest for realism in
reservoir simulation leads to using a large number of grid cells, thereby many degrees
of freedom. Parallel computing is thus indispensable for achieving large scales and fast
simulation time.
The fundamental step of parallelization is to divide the total number of degrees of free-
dom among multiple hardware processing units. Each processing unit is responsible for
©The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other
third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 2 of 23
computing its assigned degrees of freedom, in collaboration with the other units. To use
distributed-memory mainstream parallel computers for mesh-based computations, such
as in a reservoir simulation, the division of the degrees of freedom is most naturally
achieved by partitioning the global computational mesh. Each processing unit is there-
fore assigned with a sub-mesh consisting of two types of grid cells: interior and ghost.The
distributed ownershipof theinterior cellsgives adisjoint divisionofall thedegreesof free-
dom among the processing units. The ghost cells per sub-mesh, which constitute one or
several layers around the interior cells, are needed to maintain the PDE-induced coupling
between the neighboring sub-meshes.
One major benefit of such a work division, based on mesh partitioning, is that each
processing unit can independently discretize the PDEs restricted to its assigned sub-mesh.
Anyglobal linearor nonlinearsystemwill thusonly existlogically,collectivelyrepresented
by a set of sub-systems of linear or nonlinear equations. The overall computing speed of a
parallel simulator, however, hinges upon the quality of mesh partitioning. Apart from the
usualobjectiveof minimizingthe inter-process communicationvolume,it isalso desirable
to avoid that strongly coupled grid cells are assigned to different processes. The latter is
important for the effectiveness of parallel preconditioners that are essential for iteratively
solving the linear systems arising from the discretized PDEs. The two objectives are not
easy to achieve simultaneously.
It is therefore necessary to revisit the topic of mesh partitioning as the foundation of
parallel reservoir simulations. In particular, the interplay between minimizing commu-
nication overhead and maximizing numerical effectiveness, especially in the presence of
reservoir-characteristic features, deserves a thorough investigation. The novelty of this
paper includes a study of how different edge-weighting schemes, which can be used in
a graph-based method of mesh partitioning, will influence numerical effectiveness and
communication overhead. We also quantify the negative performance impact caused by
non-contributing computations that are associated with the ghost degrees of freedom.
This subject is typically neglected by the practitioners of parallel reservoir simulations.
Moreover,we presenta simple strategyto avoidthenon-contributing computations based
on a reordering of the interior and ghost grid cells per subdomain.
The remainder of the paper is organized as follows. Section 2gives a very brief intro-
duction to the mathematical model and the numerical solution strategy for reservoir flow
simulations. Then, Sect. 3explains the parallelization with a focus on how to avoid non-
contributing computations related to the unavoidable ghost grid cells. Thereafter, Sect. 4
devotes its attention to the details of mesh partitioning and the corresponding graph par-
titioning problem, with the presentation of a new edge-weighting scheme. The impacts of
removing non-contributing computations and applying the new edge-weighting scheme
are demonstrated by numerical experiments in Sect. 5,whereasSects.6and 7,respec-
tively, addresses the related work and provides concluding remarks.
2 Mathematical model and numerical strategy
In this section, we will give a very brief introduction to the most widely used mathematical
model of petroleum reservoirs and a standard numerical solution strategy that is based on
corner-point grids and cell-centered finite volume discretization.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 3 of 23
2.1 The black-oil model
The standard mathematical model used in reservoir simulation is the black-oil model [1,
2]. It is a system of nonlinear PDEs governing three-phase fluid flow in porous media. The
equations arederived fromDarcy’slawand conservation ofmass. Themodel assumesthat
thedifferent chemicalspeciesfound inthe reservoircanbe separatedinto threecategories
offluid phasesα={w,o,g}:water(w), oil(o)andgas(g).There areconsequentlythree main
equations in the black-oil model, one for each phase:
tφSo
Bo=∇·kroK
μoBoo+qo,(1)
tφSw
Bw=∇·krwK
μwBww+qw,(2)
tφRsSo
Bo+Sg
Bg=∇·RskroK
μoBoo+krgK
μgBgg+Rsqo+qfg.(3)
Here,φ,KandRsareporosity,permeability andgas solubility. Theydescribe thegeological
properties of a reservoir. For each phase α,thetermsSα,μα,Bαand krαdenote saturation,
viscosity, formation volume factor and relative permeability. The phase potential αis
defined by the phase pressure pαand phase density ρα:
α=pα+ραγz,(4)
where γand zare the gravitational constant and reservoir depth. The unknowns of the
black-oil model are the saturation and pressure of each phase, so the following three rela-
tions are needed to complete Eqs. (1)-(3):
So+Sw+Sg=1, (5)
pw=popcow(Sw), (6)
pg=po+pcog(Sg). (7)
The dependencies of the capillary pressures pcow and pcog upon the saturations Swand Sg,
used in Eq. (6)andEq.(7), are typically based on empirical models.
2.2 Well modelling
The right-hand sides of the black-oil model (Eqs. (1)-(3)) contain source/sink terms qα,
which represent either production or injection wells in a reservoir. The wells affect the
fluid flow on a much finer scale than what is captured by the resolution of the computa-
tional mesh for the reservoir. Special well models, such as the Peaceman model [3], are
incorporated to model important phenomena, such as stark pressure drops, in proxim-
ity to well in- and outflow. In the Peaceman well model, the pressure drop is modelled
by introducing new variables and equations in cells that contain a well bottom-hole. The
related well equations numerically couple all the grid cells perforated by each well.
2.3 Corner-point grid and discretization
It is common to use the 3D corner-point grid format [4]torepresentareservoirmesh.
A corner-point grid is a set of hexahedral cells logically aligned in a Cartesian fashion.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 4 of 23
The actual geometry of the grid is defined by a set of inclined vertical pillars, such that
each grid cell in the mesh is initially formed by eight corner points on four of these pillars.
Deformation and shifting of the sides of a cell are allowed independently of the horizontal
neighboring cells. Moreover, a realistic reservoir may turn some of the cells to be inactive.
The combined consequence is that the resulting computational mesh is unstructured. For
example, a cell can have fewer than six sides, and there can be more than one neighboring
cell on each side.
A standard cell-centred finite volume scheme, using two-point flux approximation with
upwind mobilityweighing [5],can beapplied ona corner-pointgrid todiscretizethe PDEs
of the black-oil model. The time integration is fully implicit to ensure numerical stability.
Takefor instancethe waterequation(Eq. (2)).Let S,i
wdenoteSwattime step inside cellCi,
which has Vias its volume. Suppose the neighboring cells of Ciare denoted as Cj0,...,Cji,
the discretization result of Eq. (2)restrictedtocellCiis thus
Viφi
tSw
Bw+1,iSw
Bw,i
ji
j=j0
λ+1,ij
wTij+1,i
w+1,j
w=Viq+1,i
w.(8)
Here, λ+1,ij
w=k+1,ij
rw
μ+1,ij
wB+1,ij
wdenotes the water mobility on the face intersection ij between a
pair of neighboring cells Ciand Cj,whereasTij is the static transmissibility on ij:
Tij =mij|ij|
ci2
niKi
ci+
cj2
njKj
cj–1.(9)
The mij term denotesatransmissibility multiplier,for incorporatingthe effectof faults.For
example, when a fault acts as a barrier between cells Ciand Cj,wehavemij =0.Figure1
illustrates the geometric terms
ci,
niand ij involved in the transmissibility calculation.
A typical scenario of reservoir simulation is that sw,sgand poare chosen as the primary
unknowns, and the cell-centered finite volume method is applied to the three main equa-
tions Eqs. (1)-(3) on all the grid cells. As result we get a system of nonlinear algebraic
equations per time step. The total number of degrees of freedom is three times the num-
ber of active grid cells. Newton iterations are needed at each time level, such that a series
Figure 1 A sketch of the geometric properties needed to calculate the static transmissibility (Eq. (9)) between
two neighboring grid cells Ciand Cj
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 5 of 23
of linear systems Ax =bwill be solved by an iterative method, such as BiCGStab or GM-
RES [6], which is accelerated by some preconditioner. The linear systems are sparse, often
ill-conditioned, and non-symmetric due to the influence of well models. Although the
nonzero values in the matrix Achange with the time level and Newton iteration, the spar-
sity pattern remains unchanged (as long as the corner-point grid is fixed). This allows for
a static partitioning of the computational mesh needed for parallelization. In this context,
the static transmissibility Tij defined in Eq. (9) is an important measure of the coupling
strength between a pair of neighboring cells Ciand Cj.
3 Efficient parallelization of reservoir simulation
To parallelize the numerical strategy outlined in the preceding section, several steps are
needed. The main focus of this section will be on two topics. First, we explain why ghost
grid cells need to be added per sub-mesh after the global computational mesh is non-
overlappingly partitioned. Second, we pinpoint the various types of non-contributing
computationsthatarise duetothe ghostcells, andshowhowthesecanbeavoided forabet-
ter computational efficiency. We remark that the exact amount and spread of ghost cells
among the sub-meshes are determined by the details of the mesh-partitioning scheme,
which will be the subject of Sect. 4.
3.1 Parallelization based on division of cells
Numerical solution of the black-oil model consists of a time integration procedure, where
duringeach timestep severalNewton iterationsareinvoked tolinearizethe nonlinearPDE
system in Eqs. (1)-(3). The linearized equations are then discretized and solved numeri-
cally. The main computational workinside every Newton iteration is the construction and
subsequent solution of a linear system of the form Ax =b. Typically, the 3D corner-point
grid remains unchanged throughout the entire simulation. It is thus customary to start
the parallelization by a static, non-overlapping division of the grid cells evenly among a
prescribed number of processes. Suppose Ndenotes the total number of active cells in
the global corner-point grid, and Npis the number of cells assigned to process p,thenwe
have
P
p=1 Np=N,
where Pdenotes the total number of processes. Process pis responsible for computing
the 3Npdegrees of freedom that live on its Npdesignated cells. The global linear system
Ax =bthat needs to be calculated and solved inside each Newton iteration will only exist
logically, i.e., collectively composed by the 3Nprows of Aand the 3Npentries of xand b
that are owned by every process p=1,2,...,P.
3.2 The need for ghost cells
The non-overlapping cell division gives a“clean-cut” distribution of the computational re-
sponsibility among the Pprocesses, specifically, through a divided ownership of the xen-
tries. There are however two practical problems for parallel computing associated with
such a non-overlapping division.
First, we recall that the numerical coupling between two neighboring cells Ciand Cjis
expressed by the static transmissibility Tij as defined in Eq. (9). In case Ciand Cjbelong to
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 6 of 23
Figure 2 An illustrative example of 4-way mesh partitioning. The interior cells of each of the four subdomains
are colored green, while ghost cells are colored red
two different processes, inter-process exchange of data is required in the parallel solution
procedure. If each process has a local data structure storing only its designed 3Npentries
of x, the inter-process communication will be in the form of many individual 3-value ex-
changes, resulting in a drastic communication overhead. It is thus common practice to let
each process extend its portion of the xvector by 3NG
p,whereNG
pdenotes the number of
ghost cellsthatarenotownedbyprocesspbut border its internal boundary. For the finite
volume method considered in this paper, only one layer of ghost cells is needed. Figure 2
demonstrates how ghost cells are added to the local grids of each process. The extended
local data structure will allow aggregated inter-process communication, i.e., all the values
needed by process pfrom process qare sent in one batch, at a much lower communica-
tionoverhead comparedwith theindividual-exchange counterpart.Specifically,whenever
xhas been distributedly updated, process pneeds to receive in total 3NG
pvalues of xfrom
its neighbors. To distinguish between the two types of cells, we will from now on denote
the originally designated Npcells from the non-overlapping division as interior cells on
process p.
Second, and perhaps more importantly, if a local discretization is carried out on pro-
cess pby restricting to its designated Npinterior cells, the resulting local part of A,which
is of dimension 3Np×3Np, will be incomplete on the rows that correspond to the cells
that have one or more of their neighboring cells owned by a process other than p.Asim-
ilar problem also applies to the corresponding local entries in the vector b.Elaborate
inter-process communication can be used to expand the sub-matrix on process pto be
of dimension 3Np×3(Np+NG
P), for fully accommodating the numerical coupling be-
tween process pand all its neighboring processes. However, a communication-free and
thereby more efficient local discretization approach is to let each process also include its
ghost cells. More specifically, the local discretization per process is independently done
on a sub-mesh that comprises both the interior and ghost cells. This computation can
reuse a sequential discretization code, without the need of writing a specialized subdo-
main discretization procedure. The resulting sub-matrix Apwill therefore be of dimen-
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 7 of 23
sion 3(Np+NG
p)×3(Np+NG
P) and the sub-vector bpof length 3(Np+NG
p). We note that
the 3NG
p“extra” rows (or entries) in Ap(or bp) that correspond to the NG
pghost cells will
be incomplete/incorrect, but they do not actively participate in the parallel computation
later. One particular benefit of having a square Apis when a parallelized iterative solver
for Ax =brelies on a parallel preconditioner that adopts some form of incomplete factor-
ization per process. The latter typically requires each local matrix Apto be of a (logically)
square shape.
In the following, we will discuss what types of computation and memory overhead can
arise due to the ghost cells and how to alleviate them.
3.3 Non-contributing computation and memory overhead due to ghost cells
Whilepromoting communication-freediscretizations persub-mesh andaggregated inter-
process exchanges of data, the ghost cells (and the associated ghost degrees of freedom)
on every sub-mesh do bring disadvantages. If not treated appropriately, these can lead to
wasteful computations that are discarded later, as well as memory usage overhead. Such
issues normally receive little attention in parallel reservoir simulators. To fully identify
these performance obstacles, we will now dive into some of the numerical and program-
ming details related to solving Ax =bin parallel.
For any Krylov-subspace iterative solver for Ax =b, such as BiCGStab and GMRES [6],
the following four computational kernels must be parallelized:
Vector addition: w=u+v. If all the involved vectors are distributed among the
processes in the same way as for xand b, then no inter-process communication is
needed for a parallel vector addition operation. Each process simply executes
wp=up+vpindependently, involving the sub-vectors. However, unless the result
vector wis used as the input vector to a subsequent matrix-vector multiplication (see
below), the floating-point operations and memory traffic associated with the
ghost-cell entries are wasted. It is indeed possible to test for each entry of wpwhether
it is an interior-cell value or not, thus avoiding the non-contributing floating-point
operations, but such an entry-wise if-test may dramatically slow down the overall
execution of the parallel vector addition. Moreover, the memory traffic overhead due
to the ghost-cell entries cannot be avoided on a cacheline based memory system, if the
ghost-cell and interior-cell entries are “intermingled” in memory.
Inner product: u·v. Again, we assume that both sub-vectors upand vphave
3(Np+NG
p)entries on process p. It is in fact numerically incorrect to let each process
simply compute its local inner product up·vp,beforesummingupthelocal
contributions from all the processes by a collective communication (such as the
MPI_Allreduce function). The remedy is to let each process “skip” over the
ghost-cell entries in upand vp. In a typical scenario that the ghost-cell entries are
mixed with interior-cell entries in upand vp, some extra implementation effort is
needed. For example, an assistant integer array named mask can be used, which is of
length Np+NG
p,wheremask[i]==1 means cell iis interior and mask[i]==0
means otherwise. Assume the three degrees of freedom per cell are stored
contiguously in memory, the following code segment is a possible implementation of
the parallel inner product:
double sub_dot_p = 0, global_dot_p;
for (int i=0; i<sub_num_cells; i++) {
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 8 of 23
sub_dot_p += mask[i]*(sub_u[3*i]*sub_v[3*i]
+sub_u[3*i+1]*sub_v[3*i+1];
+sub_u[3*i+2]*sub_v[3*i+2]);
}
MPI_Allreduce (&sub_dot_p, &global_dot_p, 1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD);
For example, the well-known DUNE software framework [7] adopts a similar
implementation. It is clear that the floating-point operations associated with the
ghost-cell entries in upand vp, as well as all the multiplications associated with the
array mask, are non-contributing work. Allocating the array mask also incurs
memory usage and traffic overhead.
Sparse matrix-vector multiplication: u=Av. Here, we recall that the global matrix Ais
logically represented by a sub-matrix Apper process, arising from a
communication-free discretization that is restricted to a sub-mesh comprising both
interior and ghost cells. The dimension of Apis 3(Np+NG
p)×3(Np+NG
p).Moreover,
we assume that all the ghost-cell entries in the sub-vector vpare consistent with their
“master copies” that are owned by other processes as interior-cell entries. This can be
ensured by an aggregated inter-process data exchange. Then, a parallel matrix-vector
multiplication can be easily realized by letting each process independently execute
up=Apvp. We note that the ghost-cell entries in upwill not be correctly computed (an
aggregated inter-process data exchange is need if, e.g., upis later used as the input to
another matrix-vector multiplication). Therefore, the floating-point operations and
memory traffic associated with the ghost-cell entries in upand the ghost-cell rows in
Apare non-contributing.
Preconditioning operation: w=M–1u. For faster and more robust convergence of a
Krylov-subspace iterative solver, it is customary to apply a preconditioning operation
to the result vector of a preceding matrix-vector multiplication. That is, a
mathematically equivalent but numerically more effective linear system
M–1Ax =M–1bis solved in reality. The action of a parallelized preconditioner M–1 is
typically applying wp=˜
A–1
pupper sub-mesh, where ˜
A–1
pdenotes an inexpensive
numerical approximation of the inverse of Ap. One commonly used strategy for
constructing ˜
A–1
pis to carry out an incomplete LU (ILU) factorization [6]ofAp.
Similar to the case of parallel matrix-vector multiplication, the floating-point
operations and memory traffic associated with the ghost-cell entries in wpand the
ghost-cell rows in Apare non-contributing.
3.4 Handling non-contributing computation and memory overhead
The negative impact on the overall parallel performance, caused by the various types of
non-contributing computation and memory usage/traffic overhead, can be large. This is
especially true when the non-overlapping mesh partitioning is of insufficient quality (de-
tails will be discussed in Sect. 4). It is thus desirable to eliminate, as much as possible, the
non-contributing computation and memory overhead.
A closerlook at the four kernels that areneeded in the parallel solution of Ax =breveals
the actual “evil”. Namely, the interior-cell entries and ghost-cell entries are intermingled.
This is a general situation if the interior and ghost cells of a sub-mesh are ordered to obey
the original numbering sequence of the corresponding cells in the global 3D corner-point
grid. (This is standard practice in parallel PDE solver software.) Hence, the key to avoiding
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 9 of 23
non-contributing computation and memory overhead is a separation of the interior-cell
entries from the ghost-cell counterparts in memory. This can be achieved per sub-mesh
by deliberately numbering all the ghost cells after all the interior cells, which only needs
to be done once and for all. If such a local numbering constraint is enforced, the non-
contributing computation and memory overhead can be almost completely eliminated.
Specifically, the parallel vector addition and inner-product can now simply stop at the
last interior degree of freedom. The array mask is thus no longer needed in the parallel
inner-product operation. For the parallel matrix-vector multiplication, the per-process
computation can stop at the last interior-cell row of Ap. In effect, the local computation
only touches the upper 3Np×3(Np+NG
p)segmentofAp.Thelast3NG
prows of Apare not
used. This also offers an opportunity to save the memory storage related to these “non-
contributing” rows. More specifically, each of the last 3NG
prows can be zeroed out and
replaced with a single value of 1 on the main diagonal. As a result, the sub-matrix Apon
process pis of the following new form:
Ap=AII
pAIG
p
0I
, (10)
where the AII
pblock is of dimension 3Np×3Npand stores the numerical coupling among
the 3Npinterior degrees of freedom, whereas the AIG
pblock is of dimension 3Np×3NG
p
and stores the numerical coupling between the 3Npinterior degrees of freedom and the
3NG
pghost degrees of freedom.
The “condensed” sub-mesh matrix Apin Eq. (10) is still of a square shape. This is mainly
motivated by the situations where an incomplete factorization (such as ILU) of Apis used
as M–1 restricted to sub-mesh pin a parallel preconditioner setting. Clearly, having only a
nonzero diagonal for the last 3NG
prows of Apmeans that there is effectively no computa-
tional work associated with these rows in an ILU, which is a part of the preparation work
of a Krylov-subspace solver before starting the linear iterations. Moreover, the forward-
backward substitutions, which are executed within each preconditioning operation, also
have negligible work associated with the “ghost” rows in the condensed Ap.Compared
with the non-condensed version of Ap, which arises directly from a local discretization on
the sub-mesh comprising both interior and ghost cells, the condensed Apis superior in
the amount of computational work, the amount of memory usage and traffic, as well as
the preconditioning effect. The latter is due to the fact that a “natural” non-flux bound-
ary condition is implicitly enforced on the ghost rows of the non-condensed version of
Ap. This is, e.g., incompatible with a parallel Block-Jacobi preconditioner that effectively
requires M–1 to be on the form:
M–1 =
(
AII
p=1)–1 0··· 0
0(
AII
p=2)–1 ··· 0
.
.
.··· ....
.
.
0··· ··· (
AII
p=P)–1
. (11)
4 Mesh partitioning
As mentioned in the previous section, the first step of parallelizing a reservoir simulator
is a disjoint division of all the cells in the global corner-point grid, i.e., a non-overlapping
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 10 of 23
mesh partitioning. We have shown how to eliminate the non-contributing computation
and memory overhead, which are associated with the necessary inclusion of one layer of
ghost cells per sub-mesh. The actual amount of ghost cells per sub-mesh depends on the
non-overlapping division, which will be the subject of this section. One aim is to keep the
number of resulting ghost cells low, for limiting the overhead of aggregated inter-process
communication. At the same time, we want to ensure good convergence effectiveness of a
parallel preconditioner such as the Block-Jacobi method that uses ILU as the approximate
inverse of the sub-matrix Ap(Eq. (10)) per process.
For the general case of an unstructured global corner-point grid, the standard strategy
for a disjoint division of the grid cells is through partitioning a corresponding graph. The
graph is translated from the global grid by turning each grid cell into a graph vertex. If
grid cells iand jshare an interface, it is translated into a (weighted) edge between vertex i
and vertex jin the graph. This standard graph-based partitioning approach is traditionally
focused on load balance and low communication volume. The convergence effectiveness
ofa resultingparallelpreconditionerisnormallynotconsidered. We willthereforepropose
anewedge-weighting schemetobeused inthe graphpartitioner,whichtargetsspecifically
the reservoir simulation scenario. The objective is to provide a balance between the pure
mesh-partitioning quality metrics and the convergence effectiveness.
4.1 Graph partitioning
AgraphG=(V,E) is composed of a set of vertices Vand a set of edges EV×V
connecting pairs of vertices in V. If weights are assigned to each member of Vand E
through weighting functions σ:VRand ω:ER,thenwegetaweightedgraph
G=(V,E,σ,ω). The P-way graph partitioning problem is defined as follows: Partition the
vertex set Vinto Psubsets V1,V2,...,VPof approximately equal size, while minimizing
the summed weight of all the “cut edges” e=(vi,vj)Econnecting vertices belonging to
different vertex subsets. Suppose Cdenotes the cut set of a partitioned G, containing all
the cut edges. The graph partitioning problem can be formulated more precisely as a con-
strained optimization problem, where the objective function Jis the sum of the weights
of all members of C, also called edge-cut:
minJ(C)=
eC
ω(e), (12)
Subject to maxpvVpσ(v)
1
PvVσ(v)<, (13)
where 1 is the imbalance tolerance of the load balancing constraint.
When the edge-weight function ωis uniform, i.e., each edge has a unit weight, the
edge-cut is an approximation of the total volume of communication needed, e.g., before
each parallel matrix-vector multiplication (see Sect. 3.3). When the edge-weight func-
tion ωis non-uniform, however, the objective function is no longer an approximation
of communication overhead. As demonstrated in e.g. [8,9], adopting non-uniform edge
weights in the partitioning graph can be beneficial when partitioning linear systems with
highly heterogeneous coefficients. Assigning edge weights based on the “between-cell
coupling strength” can improve the quality of parallel preconditioners, such as Block-
Jacobi (Eq. (11)).
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 11 of 23
Because the graph partitioning problem in Eqs. (12)-(13)isNP-complete,solvingitex-
actly is practically impossible. Many existing algorithms find good approximate solutions,
and implementations of these are available in several open software libraries such as Metis
[10], Scotch [11]andZoltan[12].
4.2 Edge-weighting strategies in graph partitioning for reservoir simulation
Although the black-oil equations are nonlinear and time dependent, and the values in the
global linear system Ax =bvary with each Newton iteration and time step, the global
corner-point grid remains unchanged. The required mesh partitioning can thus be done
once and for all, at the beginning of the simulation. When translating the corner-point
grid to a corresponding graph, there are two reservoir-specific tasks. The first is that in
case two neighboring cells iand jhave a zero value for the static transmissibility Tij (e.g.,
dueto abarrierfault), thecorresponding edge inthe graphis removed,because sucha pair
of cells is not numerically coupled. The second is to include additional edges connecting
vertex pairs corresponding to all the cells penetrated by a common well, because these
grid cells are numerically coupled. We denote the set of well-related edges as Ew.The
set containing the other regular edges is denoted by Ef. It is practical to avoid dividing a
well (the penetrated cells) among multiple subdomains, and one way to achieve this is to
ascribe a large edge weight to the edges in Ew. A corresponding uniform edge-weighting
strategyfor theedges inEf,while ensuringthatno wellispartitionedbetweensubdomains,
is defined as follows:
ωu(e)=
eEw,
1eEf.(14)
Intheaboveformulaweascribeweightsofto the well edges. When implementing the
edge-weighting scheme, the -weights must be replaced by a large numerical value. We
choose to use the largest possible value on a computer for the edge-weight data type.
To ensure good convergence effectiveness of a parallel preconditioner, we can modify
the above uniform edge-weighting strategy by using the static transmissibility Tij that lives
on each cell interface (Eq. (9)). This is because Tij can be used to estimate the between-
cell flux, hence directly related to the magnitude of off-diagonal elements in A. Therefore,
Tij can be considered to describe the strength of the between-cell coupling. A commonly
used edge-weighting strategy based on transmissibility is thus
ωt(e)=
eEw,
Tij e=(vi,vj)Ef.(15)
The transmissibility values can vary greatly in many realistic reservoir cases. One ex-
ample of transmissibility heterogeneity can be seen with the Norne case, displayed in the
histogram plot in Fig. 3b. Here, we observe a factor of more than 1012 between the small-
est and largest transmissibility values. Using these transmissibilities directly as the edge
weights, we can get partitioning results with a potentially large communication volume.
Down-scaling the weights in Eq. (15) may help to decrease the communication overhead,
while still producing partitions that yield better numerical performance than the uniform-
weighted graph partitioning. We therefore propose an alternative edge-weighting strategy
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 12 of 23
by using the logarithm of Tij as the weight of edge e=(vi,vj):
ωl(e)=
eEw,
log(Tij
Tmin )e=(vi,vj)Ef.(16)
5 Numerical experiments
In this section, we will investigate the effect of removing the non-contributing computa-
tions in solving Ax =b(see Sect. 3), as well as using different edge-weighting strategies
for graph partitioning (see Sect. 4), on parallel simulations of a realistic reservoir model.
The main objective of the experiments is to quantify the impact on the overall simulation
execution time. Moreover, for the different edge-weighting strategies, we will also exam-
ine the resulting numerical effectiveness and parallel efficiency. We have conducted our
experiments on the publicly available Norne model, that we will describe in Sect. 5.1.In
Sect. 5.4 we will compare the parallel performance of a thus improved open-source simu-
lator with industry-standard commercial reservoir simulators.
We employ the open-source reservoir simulator Flow [13] to conduct our experiments
and test our alternative implementations and methods. Flow is provided by the Open
Porous Media (OPM) initiative [14], which is a software collaboration in the domain of
porous media fluid flow. Flow offers fully-implicit discretizations of the black-oil model,
and accepts the industry standard ECLIPSE input format. The linear algebra and grid im-
plementations are based on DUNE [7]. In our experiments we restrict ourselves to the
BiCGStab iterative solver, combined with a parallel Block-Jacobi preconditioner that uses
ILU(0) as the approximate subdomain solver. Although Flow supports more sophisticated
preconditioners, we stick to Flow’s recommended default option. As of yet, experiences
with Flow show that ILU(0) achieves better overall performance than the alternatives on
most relevant models.
Parallelizationof Flowismostlyenabled byusing theMessage PassingInterface (MPI)li-
brary.For matrixassembly andI/O, sharedmemory parallelismwiththe OpenMPthread-
ing library is also available. Graph partitioning in Flow is performed by Zoltan. The well
implementation in Flow does not allow for the division of a well over multiple subdo-
mains. In addition to the well related edge-weights that discourage cutting wells, a post-
processing procedure in Flow ensures that the cells perforated by a well always are con-
tained on a single subdomain.
5.1 The Norne field model
To study reservoir simulation in a realistic setting we require real-world reservoir mod-
els. One openly available model that fits this criterion is the Norne benchmark case [15],
which is a black-oil model case based on a real oil field in the Norwegian Sea. The model
grid consists of 44,420 active cells, and has a heterogeneous and anisotropic permeabil-
ity distribution. The model also includes 36 wells, which have changing controls during
the simulation. Figure 3depicts the Norne mesh colored by the x-directed permeability
values, plus a histogram of the static transmissibility values.
In the histogram in Fig. 3b, we can see a huge span of the transmissibilities (Eq. (9)) of
the Norne benchmark case. The majority of transmissibilities have a value between 10–16
and 10–9. This large variation is a result of the heterogeneous distribution of permeability
and cell size shown in Fig. 3a.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 13 of 23
Figure 3 Permeability distribution (a) in the x direction of the Norne grid. A histogram plot (b)ofthe
transmissibility associated with cell interfaces from the Norne reservoir simulation case. The spike in the left
end of the plot represents transmissibilities with value 0
Figure 4 Ratio of the ghost cells related to partitioning the Norne reservoir mesh
Because of the modest grid size of the Norne model, we will also consider a refined
version [16], where the cells are halved in each direction, resulting in a model with 8 times
as many grid cells as the original model. The number of active cells in the refined Norne
model is 355360.
We conducted our experiments of the Norne models on two computing clusters: Abel
[17]andSaga[18]. The Abel cluster consists of nodes with dual Intel Xeon E5-2670
2.6 GHz 8-core CPUs interconnected with an FDR InfiniBand (56 Gbits/s) network,
whereas Saga is a cluster of more modern dual socket Intel Xeon-Gold 6138 2.0 GHz 20-
core CPUs interconnected with an EDR InfiniBand network (100 Gbits). The nodes on
Abel and Saga have a total of 16 and 40 cores respectively. All experiments that use more
MPI-processes than there are cores available on a single node are conducted on multiple
nodes.
In all experiments using Flowon the original and refined Norne models, we have turned
off the OpenMP multithreading for system assembly inside Flow.
5.2 Impact of removing non-contributing computations
We study the impact of non-contributing computations on the performance of linear al-
gebra kernels and the overall performance of Flow,bycarryingoutexperimentsofthe
original Norne model described above. For these experiments we have used the trans-
missibility edge weights in Eq. (15), the default choice of Flow, in the graph partitioning
scheme.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 14 of 23
Figure 5 Time measurements (in seconds) of parallel linear algebra kernels for the Norne case, obtained on
the Abel cluster, with and without the non-contributing computations. The GL prefix means the ghost-related
non-contributing computations are removed
Figure 4shows the ratio of ghost cells, i.e., (P
p=1 NG
p)/N, as a function of the number of
subdomainsused.Wecan seethat theghostcells canmake upasignificantproportion. For
example, with P= 128, the ghost cell ratio is as high as 55%. As discussed in Sect. 3,non-
contributing computations can arise due to the ghost cells. Because ghost cells increase
as a proportion of total cells with an increasing number of MPI-processes, avoiding ghost
cell related non-contributing computations is crucial for achieving good strong scaling.
To show the negative impact of non-contributing computations on the performance
of solving Ax =b, we present the execution time of the linear algebra kernels used in
the original Norne case in Fig. 5.Itdisplaystimemeasurementsofsparsematrix-vector
multiplication (SpMV), ILU’s forward-backward substitution and inner product (IP), with
and without the non-contributing computations. These time measurements are attained
on the Abel cluster using selected matrices and vectors from the Flow Norne simulation.
Using these matrices and vectors, SpMV, ILU forward-backward substitution and inner
product operations are executed and timed.
Figure 5shows a significant time improvement for the linear algebra kernels, due to
removing the non-contributing computations as proposed in Sect. 3.BecausetheIPop-
eration includes a collective reduction communication, it does not scale as well as the
SpMV and ILU operations. We thereforeobserve that the relative improvement attained
by removing the non-contributing IP computations start to decrease when the number of
processes is higher than 16. A closer look at the performance of the linear algebra kernels
is presented in Fig. 6for the case of 16 MPI-processes. Here, the per-process execution
times of SpMV, ILU and IP are displayed, with and without the non-contributing compu-
tations.
The top right plot of Fig. 6displays the per-process numbers of interior and ghost cells.
We notice a very uneven distribution of ghost cells among the processes. We also observe
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 15 of 23
Figure 6 Per-process time usage of linear algebra kernels on an Abel node, and distribution of ghost/interior
cells when the number of MPI-processes is 16
Figure 7 Overall execution time in seconds (a) and total number of BiCGStab iterations (b) of the 2018.10
release of Flow and our improved implementation, when applied to the Norne model. The simulations were
conducted on the Abel cluster, and when the number of MPI-processes exceeded 16 we used multiple
computational nodes
that the load imbalance induced by the ghost cells impacts the performance of the linear
algebra operations, especially for SpMV.
The results displayed in Fig. 5and Fig. 6show the impact of ghost cells on the perfor-
mance of key linear algebra operations, and the benefit of avoiding the non-contributing
computations.Tosee theimpact ontheoverall simulationtime, wecompareour improved
implementation of Flowwith the 2018.10 release of OPM’s Flow,whichisthelatestrelease
that does not contain any of the optimizations mentioned in this paper. The compari-
son results are displayed in Fig. 7, where the overall execution time and total number of
BiCGStab iterations are presented.
In Fig. 7we observe that our improved implementation of Flow achieves a significant
improvement in both execution time and total iteration count. The latter is due to us-
ingthecondensedlocalsub-matrixApof form Eq. (10), instead of the non-condensed
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 16 of 23
Figure 8 Resulting subdomains when partitioning the Norne mesh into 8 parts using uniform, logarithmic
and transmissibility edge weights. Each color represent one subdomain
version of Ap, in the ILU subdomain solver. This leads to better convergence of the Block-
Jacobi preconditioner. Note that removing ghost related non-contributing computations
and improving convergence only impact the performance of the simulators linear solver.
The main computational work of the reservoir simulator also consists of a system assem-
bly part. Therefore, for example, we only achieve a speedup of around 3.5 in the P= 128
case, despite 2.9 times smaller iteration count, and about 2 to 3 times faster SpMV and
ILU operations.
5.3 Impact of different edge-weighting strategies
In this subsection we will study and compare the different edge-weighting strategies pre-
sentedin Sect.4. We areultimately interestedin howthe uniformweights (Eq.(14)), trans-
missibility weights (Eq. (15)) or logarithmic transmissibility weights (Eq. (16)) impact the
overall performance of the reservoir simulation, but we will also consider their impact on
partitioning quality and numerical effectiveness. The strategies are enforced before pass-
ing the graph to the Zoltan graph partitioner inside Flow, and in our experiments we have
used a 5% imbalance tolerance (= 1.05). All simulation results in this subsection are at-
tained withthe improvedFlowwhereweusetheghost-lastproceduresdescribedinSect.3.
Figure 8displays a visualization of a P= 8 partitioning of the Norne mesh for the differ-
ent edge-weighting strategies. We observe that subdomains generated by the logarithmic
and transmissibility edge-weighted partitioning schemes are less clean cut and more dis-
connected, than the subdomains resulting from the uniform edge-weighted partitioning
scheme.
5.3.1 Mesh partitioning quality
We start our tests by focusing on how the different edge-weighting strategies affect par-
titioning quality, when used to partition the original and refined Norne meshes. In Fig. 9
we report the total communication volume for the three partitioning schemes. The total
communication volume is equal to the sum of the DoFs associated with ghost cells over
all processes, 3pNG
p, multiplied with the data size of double precision floats, which
is 8 bytes. The partitioning results for the original Norne mesh are displayed in Fig. 9a,
whereas Fig. 9b is for the refined Norne mesh.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 17 of 23
Figure 9 Total communication volume in bytes for the partitioned Norne mesh (a) and the refined Norne
mesh (b) when using different edge-weighting strategies
Figure 10 Communication overhead, i.e., time usage of DUNE’s copyOwnerToAll function on the Abel
(a)andSaga(b) clusters. Multiple computational nodes are used when the number of MPI-processes
exceeded 16 on Abel (a)and40onSaga(b)
In the plots of Fig. 9we observe that the partitions obtained using the transmissibility
edge-weights yield significantly higher communication volume than the two other alter-
natives. This holdsforboth theoriginal andthe refinedNornemodels,and forallcounts of
MPI-processes P. We also notice that although the uniform edge-weighting strategy out-
performs the logarithmic scheme, the differences in the resulting communication volume
are relatively small.
To precisely measure how the difference in communication volume affects the actual
communication overhead, we consider the execution time of the MPI data transfer opera-
tions required to perform before a parallel SpMV. The data transfer operations are imple-
mented in the DUNE function copyOwnerToAll.InFig.10 we present the execution
time of copyOwnerToAll related to the original and refined Norne meshes on the Abel
and Saga clusters, corresponding to the communication volumes shown in Fig. 9.
The plots in Fig. 10 demonstrate that using transmissibility edge-weights yields larger
communication overhead than the uniform and logarithmic alternatives on both the Abel
and Saga clusters. The relatively high execution time of copyOwnerToAll, resulting
from the transmissibility edge-weighted partitioning scheme, can partially be explained
by the communication volume displayed in Fig. 9.However,thecopyOwnerToAll ex-
ecution time is also affected by the hardware. For example, the jump in execution time
between P=16andP=32,observedinFig.10a, occurs when we start using two instead
ofone computationalnode, andthereis thusinter-nodecommunication overthenetwork.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 18 of 23
Figure 11 Total number of BiCGStab iterations to run the Norne black-oil benchmark case for three different
partitioning strategies and varying numbers of MPI-processes
A similar jump in execution time is not observed in Fig. 10bontheSagaclusterbetween
P=40andP= 80, because the Saga interconnect is better than the Abel interconnect
(100 Gbits vs. 56 Gbits).
5.3.2 Numerical and overall performance
We can find the impact of the edge-weighting schemes on Flow’s numerical performance
by looking at the total number of Block-Jacobi/ILU0 preconditioned BiCGStab iterations
needed to complete each simulation of the original Norne benchmark case. This iteration
count is displayed in Fig. 11 for the three edge-weighting strategies.
One interesting finding from Fig. 11 is that the transmissibility weighted partitioning
strategy can keep the number of BiCGStab iterations almost completely independent of
number of subdomains. It means that the transmissibility edge weights are indeed good
for the convergence of the parallel Block-Jacobi preconditioner. On the opposite side, the
uniform edge-weighting scheme leads to a large increase in the BiCGStab iterations, when
the number of subdomains is large. This is contrary to its ability of keeping the commu-
nication overhead low. The logarithmic transmissibility edge-weighting scheme seems a
good compromise, which is confirmed by Fig. 12 showing the total simulation time.
In Fig. 12 we observe that simulations using the logarithmic edge-weight strategy out-
performs simulations using the transmissibility and uniform edge-weights for all numbers
of processes. Although significant, the improvements achieved by logarithmic weights in
comparison to transmissibility weights are modest in absolute terms. However, the rela-
tive improvement in execution time increase with the number of processes involved. For 2
processes we observe a 4.7% reduction in simulation execution time. For 48 processes the
reduction is 24.5%. The BiCGStab iterations count required to complete the simulation of
the Norne case is significantly higher when using logarithmic and uniform edge-weights
instead of transmissibility edge-weights, especially when using more than 16 processes.
Despite higher iteration counts, logarithmic and uniform edge-weights yield equally good
or better performance than transmissibility edge-weights. There are two reasons for this.
First, as demonstrated in Fig. 10, using logarithmic and uniform edge-weights results in
lower communication overhead, which results in lower execution time per BiCGStab it-
eration. Second, the uniform and logarithmic edge-weights result in a lower number of
per-process ghost cells. This has a positive impact on system assembly performance.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 19 of 23
Figure 12 Overall time usage of the parallel Flow reservoir simulator when applied to the Norne benchmark
case, measured on the Abel cluster. Beyond 16 MPI-processes, the simulations were conducted on multiple
computational nodes
Figure 13 Execution time (a) and iteration count (b)oftheFlow reservoir simulator when applied to the
refined Norne benchmark case on the Saga cluster. Beyond 40 MPI-processes the simulations were
conducted on multiple computational nodes
We also notice little or no reduction in execution time beyond 48 processes for all edge-
weighting strategies. From P=64toP=128weevenseeanincreaseinsimulationexe-
cution time. This is not unexpected, because at this point the mesh partitioning produces
an average of only 44,420/128 347 cells per process, which correspond to around 1041
DoFs. When DoFs per process reaches this point we are beyond the strong scaling limit,
so adding more hardware resources yields no benefit.
Performance results for the refined Norne model, measured on the Saga cluster, are
displayed in Fig. 13. Here, we have used P= 2, 4,8, 10,20, 40 MPI-processes on a single
compute node, as well as P= 80,120,160,200, 240 MPI-processes on two to six nodes.
Execution times are presented in Fig. 13a and BiCGStab iterations in Fig. 13b.
TheresultsfortherefinedNornemodelattainedonthe Sagacluster, displayedinFig. 13,
are similar to the Norne results on the Abel cluster. Transmissibility edge-weights yield
better linear solver convergence than the logarithmic and uniform alternatives, but the
overall performance improves when using logarithmic and uniform edge-weights. We ob-
serve that simulations ran with logarithmic edge-weights had the lowest execution time
forall numberofMPI-processesexceptP=2andP= 8. Theimprovement overpure trans-
missibility edge-weights again appears quite modest. However, the benefit increases with
thenumberofprocesses.ForP= 2, 4 and 8 logarithmic transmissibilityedge-weights yield
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 20 of 23
a 1.3%, 9.1% and 3.6%, reduction in execution time, while for P= 80, 120 and 160 the im-
provement is respectively 40.1%, 29.5% and 47.6%.
InFig. 13awe observean expecteddiminishingparallel efficiencyfor anincreasing num-
ber of MPI-processes. Simulation execution time increases for all edge-weight schemes
between 200 and 240 processes. At P= 240 there is around 1500 cells and 4500 DoFs per
process,andwehavereachedthestrongscalinglimit.
5.4 Comparing Flow with industry-standard simulators
TheNorne modelexhibitsseveral featuresrarely foundin academicallyavailabledata sets.
It is therefore interesting to compare the performance of the improved Flow simulator,
whenapplied toNorne, withcommercial alternatives. We considertwo industry-standard
simulators, Eclipse 100 version 2018.2 and Intersect version 2019.1. For our experiments
Eclipse and Intersect were used in their default configurations. All simulations presented
in this subsection were done on a dual socket workstation with two Intel E5-2687W pro-
cessors with a total of 16 cores available. The system has 128 GB of memory, which was
sufficient for all simulations. The CPUs have a base frequency of 3.1 GHz and a turbo
frequency rating of 3.8 GHz. The operating system was Red Hat 6.
We should note that the three simulators have different numerics internally. Unfortu-
nately, only Flow hasopensourcecode,sowecannotinvestigatetheimplementationde-
tails of the other two. We refer the reader to the publicly available information on the nu-
merics of the two proprietary simulators. While mpirun is handled internally by Eclipse
and Intersect, we use mpirun directly for parallel simulations with Flow. The only other
command-line option used by mpirun is -map-by numa.Thisoptionisparticularlyim-
portant for runs with two processes, since it ensures that the two are distributed on dif-
ferent sockets, taking advantage of cache and memory bandwidth on both. Without the
option, the runtime may put both MPI processes on the same NUMA-node. Results from
the previous subsection have shown that the parallel Flow simulator works best with log-
arithmic transmissibility edge-weights for the Norne case. The Flow simulation results
presented in this subsection therefore use the logarithmic transmissibility edge-weighting
scheme.
The MPI implementation in Eclipse is simplistic. According to the documentation it
simply does domain decomposition by dividing the mesh cells evenly along one axis di-
mension. Nevertheless, for some models (the Norne model is actually one of them) this
works well.
The comparison in parallel performance between Flow,Eclipse and Intersect on the
Norne benchmark case is presented in Fig. 14. In this plot we also include results from
simulations where multithreading is activated with two OpenMP threads per MPI pro-
cess forFlowand Intersect.Multithreading isnotactivatedwhen 16MPI-processes isused,
because of a lack of remaining hardware resources.
The results presented in Fig. 14 show that the parallel Flow simulator outperforms
Eclipse and Intersect,forP4. Although Eclipse is faster than Flow for serial runs, Flow
scales better than Eclipse on the Norne model. Flow achieves a speedup of 7.6 for P=16
compared to Eclipse’s speedup of 4.4. Intersect performs rather poorly on Norne, but it
scales better than Eclipse. Activating multithreading yields improved performance for
both Flow and Intersect.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 21 of 23
Figure 14 Execution time (in seconds) of Flow,Eclipse and Intersect on the Norne benchmark case. We
include results for Flow and Intersect simulations where multithreading with two OpenMP threads per
MPI-process is activated
6 Related work
Because of the demand for large-scale reservoir simulations in the petroleum industry,
there exist several commercial and in-house simulators that are able to take advantage of
parallel computing platforms. Examples include the Saudi Aramco POWERS [1921]and
GigaPOWERS [22] simulators, which are able to run simulations on reservoir grids with
billions of cells.
Graph partitioning is often used to enable parallel reservoir simulation [2326]. How-
ever, no reservoir simulation specific considerations were made in these cases. In [27]the
authors suggest two guiding principles for achieving good load balancing when perform-
ing mesh partitioning for thermal reservoir simulation. First, grid cells and simulation
wells should be evenly distributed between the processors. Second, if faults are present in
the reservoir, they should serve as subdomain boundaries between the processes.
In the PhD thesis [28] a mesh partitioning scheme based on edge-weighted graph par-
titioning is described. The edge weights are formed based on the transmissibility on the
interface of the cell blocks in the reservoir mesh. Additionally, the presence of wells in
the reservoir is accounted for by modifying the partitioning graph. In [29]theauthors
derive similar strategies for partitioning non-uniform meshes in the context of compu-
tational fluid dynamics. A graph partitioning approach with edge-weights corresponding
to the cell face area is implemented. The aim of this approach is to improve solver con-
vergence, by accounting for the coefficient matrix heterogeneity introduced by the non-
uniform mesh. The authors of [29] do not only focus on how this edge-weighted approach
can affect the numerical performance, but also consider measures of partitioning quality,
such as edges cut and number of processor neighbors. Despite poorer partitioning quality,
the edge-weighted graph partitioning scheme gave the best overall performance.
Several attempts at incorporating coefficient matrix information into the partitioning
of the linear systems have been made [8,9,3032]. In [8] the authors add coefficient-
based edge weights to the partitioning graph, and demonstrate improved numerical per-
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 22 of 23
formancein comparisonwithstandard non-weightedschemes. Thepreviously mentioned
paper [9] presents a spectral graph partitioning scheme that outperforms standard graph
partitioners, even with weighted edges, for symmetric positive definite systems with het-
erogeneous coefficients.
7Conclusion
In this paper, we have given a detailed description of the domain decomposition strategy
used to parallelize the simulation of fluid flow in petroleum reservoirs. We proposed an
improvedparallel implementationof thelinearsolver based onalocal“ghostlast”reorder-
ing of the grid cells. We also investigated the use of edge-weighted graph partitioning for
dividing the reservoir mesh. A new edge-weighting scheme was devised with the purpose
to maintain a balance between the numerical effectiveness of the Block-Jacobi precondi-
tioner and the communication overhead.
Through experiments based on the Norne black-oil benchmark case, we showed that
the ghost cells make up an increasing proportion of the cells in the decomposed sub-
meshes when the number of processes increases. Further we found that removing non-
contributing calculations related to these ghost cells can give a significant improvement
in the parallel simulator performance.
For the Norne black-oil benchmark case, which has extremely heterogeneous petro-
physical properties, using edge-weights directly derived from these properties can have
negative consequences for the overall performance. Although this approach yields satis-
factory numericaleffectiveness, itdoes notmakeup forthe increaseinthecommunication
overhead.The largecommunicationoverhead associated withthedefault transmissibility-
weighting scheme is due to poor partitioning quality, in particular with respect to the
communication volume and number of messages. The scaled logarithmic transmissibility
approach, which also uses non-uniform edge-weights, yields a much better partitioning
quality. Even though the numerical effectiveness due to the logarithmic scheme may be
lower than the transmissibility weighted scheme, it can still result in a better overall per-
formance.
Acknowledgements
This research was conducted using the supercomputers Abel and Saga in Norway.
Funding
This research was funded by the SIRIUS Centre for Scalable Data Access.
Abbreviations
PDE, partial differential equation; ILU, incomplete LU; OPM, open porous media; MPI, Message Passing Interface; SpMV,
sparse matrix-vector multiplication; IP, inner product.
Availability of data and materials
Data and source codes are available upon reasonable request.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors contributed to the writing of the manuscript. AT implemented the improved version of Flow and performed
most of the numerical experiments. XC and ABR provided research advice and guidance. ABR performed experiments
with industry-standard simulators. All authors read and approved the final manuscript.
Author details
1Simula Research Laboratory, Martin Linges vei 25, 1364 Fornebu, Norway. 2University of Oslo, Oslo, Norway. 3Equinor
Research Centre, Ranheim, Norway.
Thune et al. Journal of Mathematics in Industry (2021) 11:12 Page 23 of 23
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Received: 31 August 2020 Accepted: 18 June 2021
References
1. Chen Z, Huan G, Ma Y. Computational methods for multiphase flows in porous media. vol. 2. Philadelphia: SIAM; 2006.
2. Aziz K, Settari A. Petroleum reservoir simulation. London: Applied Science Publishers; 1979.
3. Peaceman DW et al. Interpretation of well-block pressures in numerical reservoir simulation (includes associated
paper 6988). Soc Pet Eng J. 1978;18(03):183–94.
4. Ponting DK. Corner point geometr y in reservoir simulation. In: ECMOR I - 1st European conference on the
mathematics of oil recovery. 1989.
5. Lie K-A. An introduction to reservoir simulation using Matlab/GNU octave. Cambridge: Cambridge University Press;
2019.
6. Saad Y. Iterative methods for sparse linear systems. Philadelphia: SIAM; 2003.
7. Bastian P, Blatt M, Dedner A, Engwer C, Klöfkorn R, Kornhuber R, Ohlberger M, Sander O. A generic grid interface for
parallel and adaptive scientific computing. Part II: implementation and tests in DUNE. Computing.
2008;82(2–3):121–38.
8. Cullum JK, Johnson K, Tuma M. Effects of problem decomposition (partitioning) on the rate of convergence of
parallel numerical algorithms. Numer Linear Algebra Appl. 2003;10(5–6):445–65.
9. Vecharynski E, Saad Y, Sosonkina M. Graph partitioning using matrix values for preconditioning symmetric positive
definite systems. SIAM J Sci Comput. 2014;36(1):63–87.
10. Karypis G, Kumar V. A software package for partitioning unstructured graphs, partitioning meshes, and computing
fill-reducing orderings of sparse matrices. University of Minnesota, Department of Computer Science and
Engineering, Army HPC Research Center, Minneapolis, MN 38 (1998)
11. Pellegrini F, Roman J. Scotch: a software package for static mapping by dual recursive bipartitioning of process and
architecture graphs. In: International conference on high-performance computing and networking. Berlin: Springer;
1996. p. 493–8.
12. BomanE,DevineK,FiskLA,HeaphyR,HendricksonB,VaughanC,CatalyurekU,BozdagD,MitchellW,TerescoJ.
Zoltan 3.0: parallel partitioning, load-balancing, and data management services; user’s guide. Sandia National
Laboratories, Albuquerque, NM. 2007;2(3).
13. Rasmussen AF, Sandve TH, Bao K, Lauser A, Hove J, Skaflestad B, Klöfkorn R, Blatt M, Rustad AB, Sævareid O et al. The
open porous media flow reservoir simulator. Comput Math Appl. 2021;81:159–85.
14. Open Porous Media Initiative. http://opm- project.org (2017). Accessed 2017-07-26.
15. Norne reservoir simulation benchmark. https://github.com/OPM/opm-data (2019)
16. Refined Norne reservoir simulation deck.
https://github.com/andrthu/opm-data/tree/refined-222-norne/refined-norne (2020)
17. The Abel computer cluster. https://www.uio.no/english/services/it/research/hpc/abel/ (2019)
18. The Saga computer cluster. https://documentation.sigma2.no/quick/saga.html (2019)
19. DogruA,LiK,SunaidiH,HabiballahW,FungL,Al-ZamilN,ShinD,McDonaldA,SrivastavaN.Amassivelyparallel
reservoir simulator for large scale reservoir simulation. In: SPE symposium on reservoir simulation. 1999. p. 73–92.
20. Dogru AH, Sunaidi H, Fung L, Habiballah WA, Al-Zamel N, Li K et al. A parallel reservoir simulator for large-scale
reservoir simulation. SPE Reserv Eval Eng. 2002;5(1):11–23.
21. Al-Shaalan TM, Fung LS, Dogru AH et al. A scalable massively parallel dual-porosity dual-permeability simulator for
fractured reservoirs with super-k permeability. In: SPE annual technical conference and exhibition. Society of
Petroleum Engineers; 2003.
22. Dogru AH, Fung LSK, Middya U, Al-Shaalan T, Pita JA et al. A next-generation parallel reservoir simulator for giant
reservoirs. In: SPE reservoir simulation symposium. Society of Petroleum Engineers; 2009.
23. Elmroth E. On grid partitioning for a high-performance groundwater simulation software. In: Simulation and
visualization on the grid. Berlin: Springer; 2000. p. 221–34.
24. Wu Y-S, Zhang K, Ding C, Pruess K, Elmroth E, Bodvarsson G. An efficient parallel-computing method for modeling
nonisothermal multiphase flow and multicomponent transport in porous and fractured media. Adv Water Resour.
2002;25(3):243–61.
25. Guo X, Wang Y, Killough J. The application of static load balancers in parallel compositional reservoir simulation on
distributed memory system. J Nat Gas Sci Eng. 2016;28:447–60.
26. Maliassov S, Shuttleworth R et al. Partitioners for parallelizing reservoir simulations. In: SPE reservoir simulation
symposium. Society of Petroleum Engineers; 2009.
27. Ma Y, Chen Z. Parallel computation for reservoir thermal simulation of multicomponent and multiphase fluid flow. J
Comput Phys. 2004;201(1):224–37.
28. Zhong H. Development of a new parallel thermal reservoir simulator. PhD thesis. University of Calgary; 2016.
29. Wang M, Xu X, Ren X, Li C, Chen J, Yang X. Mesh partitioning using matrix value approximations for parallel
computational fluid dynamics simulations. Adv Mech Eng. 2017;9(11):1687814017734109.
30. Saad Y, Sosonkina M. Non-standard parallel solution strategies for distributed sparse linear systems. In: International
conference of the Austrian center for parallel computation. Berlin: Springer; 1999. p. 13–27.
31. Duff IS, Kaya K. Preconditioners based on strong subgraphs. Electron Trans Numer Anal. 2013;40:225–48.
32. Janna C, Castelletto N, Ferronato M. The effect of graph partitioning techniques on parallel block FSAI
preconditioning: a computational study. Numer Algorithms. 2015;68(4):813–36.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Open Porous Media (OPM) initiative is a community effort that encourages open innovation and reproducible research for simulation of porous media processes. OPM coordinates collaborative software development, maintains and distributes open-source software and open data sets, and seeks to ensure that these are available under a free license in a long-term perspective. In this paper, we present OPM Flow, which is a reservoir simulator developed for industrial use, as well as some of the individual components used to make OPM Flow. The descriptions apply to the 2019.10 release of OPM.
Article
Full-text available
Mesh partitioning is significant to the efficiency of parallel computational fluid dynamics simulations. The most time-consuming parts of parallel computational fluid dynamics simulations are iteratively solving linear systems derived from partial differential equation discretizations. This article aims at mesh partitioning for better iterative convergence feature of this procedure. For typical computational fluid dynamics simulations in which partial differential equations are discretized and solved after the mesh is partitioned, numerical information of the linear systems is not available yet during mesh partitioning. We propose to construct approximations for matrix elements and theoretically find out that for finite-volume-based problems, the face area can approximate the corresponding matrix element well. A mesh partitioning scheme using the matrix value approximations for better iterative convergence behavior is implemented and numerically testified. The results show that our method can capture the most important factor influencing the matrix values and achieve partitions with good performance throughout the simulations with non-uniform meshes. The novel partitioning strategy is general and easy to implement in various partitioning packages.
Book
This book provides a self-contained introduction to the simulation of flow and transport in porous media, written by a developer of numerical methods. The reader will learn how to implement reservoir simulation models and computational algorithms in a robust and efficient manner. The book contains a large number of numerical examples, all fully equipped with online code and data, allowing the reader to reproduce results, and use them as a starting point for their own work. All of the examples in the book are based on the MATLAB Reservoir Simulation Toolbox (MRST), an open-source toolbox popular popularity in both academic institutions and the petroleum industry. The book can also be seen as a user guide to the MRST software. It will prove invaluable for researchers, professionals and advanced students using reservoir simulation methods. This title is also available as Open Access on Cambridge Core. https://doi.org/10.1017/9781108591416
Article
Compositional reservoir simulation depicts the complex behaviors of all the components in gaseous, liquid, and oil phases. It helps to understand the dynamic changes in reservoirs. Parallel computing is implemented to speed up simulation in large scale fields. However, there is still many challenges in obtaining efficient and cost-effective parallel reservoir simulation. Load imbalance on processors in the parallel machine is a major problem and it severely affects the performance of parallel implementation in compositional reservoir simulators. This article presents a new approach to the reduction of load imbalance among processors in large scale parallel compositional reservoir simulation. The approach is based on graph partitioning techniques: Metis partitioning and spectral partitioning. These techniques treat the simulation grid, or the mesh, as a graph constituted by vertices and edges, and then partition the graph into smaller domains. Metis and spectral partitioning techniques are advantageous because they take into account the potential computational load of each grid block in the mesh and generate smaller partitions for heavy computational load areas and larger partitions for light computational load areas. In our case, the computational load is represented by transmissibility. After new partitions are generated, each of them is assigned to a processor in the parallel machine and new parallel reservoir simulation can be conducted. Traditionally, the intuitive 2D decomposition is frequently used to partition the simulation grid into small rectangles, and this is a major source of load imbalance. The performance of parallel compositional simulation based on our partitioning techniques is compared with the most commonly used 2D decomposition and it is found that load imbalance in our new simulations is reduced when compared with the traditional 2D decomposition. This study improves the efficiency of compositional simulation and eventually makes it more cost-effective for hydrocarbon simulation on mega-scale reservoir models.
Article
Giant reservoirs of the Middle East are crucial for the supply of oil and gas to the world market. Proper simulation of these giant reservoirs with long history and large amount of static and dynamic data requires efficient parallel simulation technologies, powerful visualization and data processing capabilities. This paper describes GigaPOWERS, a new parallel reservoir simulator capable of simulating hundreds of millions of cells to a billion cells with long production history in practical times. The new simulator uses unstructured grids. A distributed unstructured grid infrastructure has been developed for models using unstructured or complex structured grids. Unconventional wells such as maximum reservoir contact wells and fish-bone wells, as well as faults and fractures are handled by the new gridding system. A new parallel linear solver has been developed to solve the resulting linear system of equations. Load balancing issues are also discussed. A unified compositional formulation has been implemented. The simulator is designed to handle n-porosity systems. An optimization-based well management system has been developed by using mixed integer nonlinear programming. In addition to the core computational algorithms, the paper will present the pre- and postprocessing software system to handle large amount of data. Visualization techniques for billions of cells are also presented.
Article
Adaptive Block FSAI (ABF) is a novel preconditioner which has proved efficient for the parallel solution of symmetric positive definite (SPD) linear systems and eigenproblems. A possible drawback stems from its reduced strong scalability, as the iteration count to converge for a given problem tends to grow with the number of processors used. The preliminary use of graph partitioning techniques can help improve the preconditioner quality and scalability. According to the specific theoretical properties of Block FSAI, different partitionings are selected and tested in a set of matrices arising from SPD engineering applications. The results show that using an appropriate graph partitioning technique with ABF may play an important role to increase the preconditioner efficiency and robustness, allowing for its effective use also in massively parallel simulations.
Article
Original manuscript received in Society of Petroleum Engineers officeJune 16, 1977. Paper accepted for publication Dec. 20, 1977. Revisedmanuscript received April 3, 1978. Paper (SPE 6893) first presentedat the SPE-AIME 52nd Annual Fall Technical Conference and Exhibition, held in Denver, Oct. 9-12, 1977. Abstract Examination of grid pressures obtained in thenumerical simulation of single-phase flow into asingle well shows that the well-block pressure isessentially equal to the actual flowing pressure ata radius of 0.2 x. Using the equation forsteadystate radial flow then allows calculation ofthe flouring bottom-hole pressure. The relation between pressures measured in abuildup test and the simulator well-block pressureis derived. In particular, the buildup pressure andthe well-block pressure are shown equal at ashut-in time of 67.5 ct x2/k. This is aboutone-third the shut-in time stated by previous authors, who derived their results from an erroneousassumption concerning the significance of thewell-block pressure. When only a single buildup pressure is observedat a different shut-in time, an adjustment to theobserved pressure can be made for matching with the simulator well-block pressure. Introduction When modeling reservoir behavior by numericalmethods, inevitably the horizontal dimensions ofany grid block containing a well are much larger than the wellbore radius of that well. It long hasbeen recognized that the pressure calculated for awell block will be greatly different from the flowingbottom-hole pressure of the modeled well, but theliterature contains few specific guides as to how tomake the correction. In this study, we confine our attention tosinglephase flow in two dimensions. Consider the fiveblocks abstracted from a regular grid system(Fig.1) with the center block containing a well producingat rate q. Schwabe and Brand proposed therelationship 2 kh Pe - Pwfq = ------- -----------------,..............(1)1n(r /r) + se w where re is taken equal to x, and pe is an effectivepressure at the"drainage radius," re, obtainedfrom4Pe = Po + Fi (Pi - Po).i=1 Schwabe and Brand did not define Fi, but seemedto imply that it be taken as zero. Thus, in theabsence of a skin effect, Eq. 1 reduces to 2 kh Po - Pwfq = ------- -------------...................(2)1n (x/r) w The most significant treatment of this subjectuntil now was that of van Poollen et al. Theystated that the calculated pressure for a well block should be tithe areal average pressure in theportion of the reservoir represented by the block. SPEJ P. 183
Article
Preface 1. Introduction 2. Flow and transport equations 3. Rock and fluid properties 4. Numerical methods 5. Solution of linear systems 6. Single phase flow 7. Two-phase flow 8. The Black Oil model 9. The Compositional model 10. Nonisothermal flow 11. Chemical flooding 12. Flows in fractured porous media 13. Welling modeling 14. Special topics 15. Nomenclature 16. Units Bibliography Index.