NUMA-Awareness as a Plug-In for an
Eventify-based Fast Multipole Method
Laura Morgenstern1,2, David Haensel1, Andreas Beckmann1, and Ivo
1Forschungszentrum J¨ulich, J¨ulich Supercomputing Centre, 52425 J¨ulich, Germany
2Technische Universit¨at Chemnitz, Faculty of Computer Science, 09111 Chemnitz,
Abstract. Following the trend towards Exascale, today’s supercomput-
ers consist of increasingly complex and heterogeneous compute nodes.
To exploit the performance of these systems, research software in HPC
needs to keep up with the rapid development of hardware architectures.
Since manual tuning of software to each and every architecture is neither
sustainable nor viable, we aim to tackle this challenge through appro-
priate software design. In this article, we aim to improve the perfor-
mance and sustainability of FMSolvr, a parallel Fast Multipole Method
for Molecular Dynamics, by adapting it to Non-Uniform Memory Access
architectures in a portable and maintainable way. The parallelization
of FMSolvr is based on Eventify, an event-based tasking framework we
co-developed with FMSolvr. We describe a layered software architecture
that enables the separation of the Fast Multipole Method from its paral-
lelization. The focus of this article is on the development and analysis of
a reusable NUMA module that improves performance while keeping both
layers separated to preserve maintainability and extensibility. By means
of the NUMA module we introduce diverse NUMA-aware data distri-
bution, thread pinning and work stealing policies for FMSolvr. During
the performance analysis the modular design of the NUMA module was
advantageous since it facilitates combination, interchange and redesign
of the developed policies. The performance analysis reveals that the run-
time of FMSolvr is reduced by 21% from 1.48 ms to 1.16 ms through
Keywords: non-uniform memory access ·multicore programming ·soft-
ware architecture ·fast multipole method
The trend towards higher clock rates stagnates and heralds the start of the
exascale era. The resulting supply of higher core counts leads to the rise of
Non-Uniform Memory Access (NUMA) systems for reasons of scalability. This
2 L. Morgenstern et al.
requires not only the exploitation of ﬁne-grained parallelism, but also the han-
dling of hierarchical memory architectures in a sustainable and thus portable
way. We aim to tackle this challenge through suitable software design since
manual adjustment of research software to each and every architecture is neither
sustainable nor viable for reasons of development time and staﬀ expenses.
Our use case is FMSolvr, a parallel Fast Multipole Method (FMM) for Molec-
ular Dynamics (MD). The parallelization of FMSolvr is based on Eventify, a
tailor-made tasking library that allows for the description of ﬁne-grained task
graphs through events [7,8]. Eventify and FMSolvr are published as open source
under LGPL v2.1 and available at www.fmsolvr.org. In this article, we aim to
improve the performance and sustainability of FMSolvr by adapting it to NUMA
architectures through the following contributions:
1. A layered software design that separates the algorithm from its paralleliza-
tion and thus facilitates the development of new features and the support of
new hardware architectures.
2. A reusable NUMA module for Eventify that models hierarchical memory ar-
chitectures in software and enables rapid development of algorithm-dependent
3. Diverse NUMA-aware data distribution, thread pinning and work stealing
policies for FMSolvr based on the NUMA module.
4. A detailed comparative performance analysis of the developed NUMA poli-
cies for the FMM on two diﬀerent NUMA machines.
1.2 State of the Art
MD has become a vital research method in biochemistry, pharmacy and mate-
rials science. MD simulations target strong scaling since their problem size is
typically small. Thus, the computational eﬀort per compute node is very low
and MD applications tend to be latency- and synchronization-critical. To ex-
ploit the performance of NUMA systems, MD applications need to adapt to the
diﬀerences in latency and throughput caused by hierarchical memory.
In this work, we focus on the FMM  with computational complexity O(N).
We consider the hierarchical structure of the FMM as a good ﬁt for hierarchical
memory architectures. We focus on the analysis of NUMA eﬀects on FMSolvr
as a demonstrator for latency- and synchronization-critical applications.
We aim at a NUMA module for FMSolvr since various research works [1,4, 3]
prove the positive impact of NUMA awareness on performance and scalability of
the FMM. Subsequently, we summarize the eﬀorts to support NUMA in current
ScalFMM  is a parallel, C++ FMM-library. The main objectives of its
software architecture are maintainability and understandability. A lot of research
about task-based and data-driven FMMs is based on ScalFMM. The authors
devise the parallel data-ﬂow of the FMM for shared memory architectures in 
and for distributed memory architectures in . In ScalFMM NUMA policies
can be set by the user via the OpenMP environment variables OMP PROC BIND
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 3
and OMP PLACES. However, to the best of our knowledge, there is no performance
analysis regarding these policies for ScalFMM.
KIFMM  is a kernel-independent FMM. In  NUMA awareness is an-
alyzed dependent on work unit granularity and a speed-up of 4 on 48 cores is
However, none of the considered works provides an implementation and com-
parison of diﬀerent NUMA-aware thread pinning and work stealing policies.
From our point of view, the implementation and comparison of diﬀerent policies
is of interest since NUMA awareness is dependent on the properties of the in-
put data set, the FMM operators and the hardware. Therefore, the focus of the
current work is to provide a software design that enables the rapid development
and testing of multiple NUMA policies for the FMM.
We follow the deﬁnition of sustainability provided in :
Deﬁnition 1. Sustainability. A long-living software system is sustainable if
it can be cost-eﬃciently maintained and evolved over its entire life-cycle.
According to  a software system is further on long-living if it must be op-
erated for more than 15 years. Due to a relatively stable problem set, a large user
base and the great performance optimization eﬀorts, HPC software is typically
long-living. This holds e.g. for the molecular dynamics software GROMACS or
Coulomb-solvers such as ScaFaCos. Fortran FMSolvr (included in ScaFaCos),
the predecessor of C++ FMSolvr, is roughly 20 years old. Hence, sustainability
is our main concern regarding C++ FMSolvr.
According to , sustainability comprises non-functional requirements such
as maintainability, modiﬁability, portability and evolvability. We add perfor-
mance and performance portability to the properties of sustainability, to adjust
the term to software development in HPC.
2.2 Performance Portability
Regarding performance portability, we follow the deﬁnition provided in :
Deﬁnition 2. Performance Portability. For a given set of platforms H, the
performance portability Φof an application asolving problem pis:
Φ(a, p, H) = (|H|
if iis supported ∀i∈H
where ei(a, p)is the performance eﬃciency of application asolving problem pon
4 L. Morgenstern et al.
2.3 Non-Uniform Memory Access
NUMA is a shared memory architecture for modern multi-processor and multi-
core systems. A NUMA system consists of several NUMA nodes. Each NUMA
node is a set of cores together with their local memory. NUMA nodes are con-
nected via a NUMA interconnect such as Intel’s Quick Path Interconnect (QPI)
or AMD’s HyperTransport.
The cores of each NUMA node can access the memory of remote NUMA
nodes only by traversing the NUMA interconnect. However, this exhibits no-
tably higher memory access latencies and lower bandwidths than accessing local
memory. According to , remote memory access latencies are about 50% higher
than local memory access latencies. Hence, memory-bound applications have to
take data locality into account to exploit the performance and scalability of
2.4 Fast Multipole Method
Sequential Algorithm. The fast multipole method for MD is a hierarchical
fast summation method (HFSM) for the evaluation of Coulombic interactions
in N-body simulations. The FMM computes the Coulomb force Fiacting on
each particle i, the electrostatic ﬁeld Eand the Coulomb potential Φin each
time step of the simulation. The FMM reduces the computational complexity of
classical Coulomb solvers from O(N2) to O(N) by use of hierarchical multipole
expansions for the computation of long-range interactions.
The algorithm starts out with a hierarchical space decomposition to group
particles. This is done by recursively bisecting the simulation box in each of its
three dimensions. We refer to the developing octree as FMM tree. The input data
set to create the FMM tree consists of location xiand charge qiof each particle
iin the system as well as the algorithmic parameters multipole order p, max-
imal tree depth dmax and well-separateness criterion ws. All three algorithmic
parameters inﬂuence the time to solution and the precision of the results.
We subsequently introduce the relations between the boxes of the FMM tree
referring to , since the data dependencies between the steps of the algorithm
are based on those:
– Parent-child relation: We refer to box xas parent box of box yif xand
yare directly connected when moving towards the root of the tree.
– Near Neighbor: We refer to two boxes as near neighbors if they are at the
same reﬁnement level and share a boundary point.
– Interaction Set: We refer to the interaction set of box ias the set consisting
of the children of the near neighbors of i’s parent box which are well separated
Based on the FMM tree, the sequential workﬂow of the FMM referring to 
is stated in Algorithm 1, with steps 1.to 5.computing far ﬁeld interactions and
step 6.computing near ﬁeld interactions.
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 5
Algorithm 1 Fast Multipole Method
Input: Positions and charges of particles
Output: Electrostatic ﬁeld E, Coulomb forces F, Coulomb potential Φ
0. Create FMM Tree:
Hierarchical space decomposition
1. Particle to Multipole (P2M):
Expansion of particles in each box on the lowest level of the FMM tree into multipole
moments ωrelative to the center of the box.
2. Multipole to Multipole (M2M):
Accumulative upwards-shift of the multipole moments ωto the centers of the parent
3. Multipole to Local (M2L):
Translation of the multipole moments ωof the boxes covered by the interaction set
of box iinto a local moment µfor i.
4. Local to Local (L2L):
Accumulative downwards-shift of the local moments µto the centers of the child
5. Local to Particle (L2P):
Transformation of the local moment µof each box ion the lowest level into far ﬁeld
force for each particle in i.
6. Particle to Particle (P2P):
Evaluation of the near ﬁeld forces between the particles contained by box iand its
near neighbors for each box on the lowest level by computing the direct interactions.
FMSolvr: An Eventify-based FMM. FMSolvr is a task-parallel implemen-
tation of the FMM. According to the tasking approach for FMSolvr with Even-
tify [7, 8], the steps of the sequential algorithm do not have to be computed
completely sequentially, but may overlap. Based on the sequential workﬂow, we
diﬀerentiate six types of tasks (P2M, M2M, M2L, L2L, L2P and P2P) that span
a tree-structured task graph. Since HFSMs such as the FMM are based on a hi-
erarchical decomposition, they exhibit tree-structured, acyclic task graphs and
use tree-based data structures. Figure 1 provides a schematic overview of the
horizontal and vertical task dependencies of the FMM implied by this paral-
lelization scheme. For reasons of comprehensible illustration, the dependencies
are depicted for a binary tree and thus a one-dimensional system, even though
the FMM simulates three-dimensional systems and thus works with octrees. In
this work, we focus on the tree-based properties of the FMM since these are
decisive for its adaption to NUMA architectures. For further details on the func-
tioning of Eventify and its usage for FMSolvr please refer to [7,8].
3 Software Architecture
3.1 Layering: Separation of Algorithm and Parallelization
Figure 2 provides an excerpt of the software architecture of FMSolvr by means
of UML. FMSolvr is divided into two main layers: the algorithm layer and the
6 L. Morgenstern et al.
Fig. 1. Exemplary horizontal and vertical data dependencies that lead to inter-node
Fig. 2. Layer-based software architecture of FMSolvr. The light gray layer shows an
excerpt of the UML-diagram of the algorithm layer that encapsulates the algorithmic
details of the FMM. The dark gray layer provides an excerpt of the UML-diagram of
the parallelization layer that encapsulates hardware and parallelization details. Both
layers are coupled by using the interfaces FMMHandle and TaskingHandle.
parallelization layer. The algorithm layer encapsulates the mathematical details
of the FMM, e. g. exchangeable FMM operators with diﬀerent time complexities
and memory footprints. The parallelization layer hides the details of parallel
hardware from the algorithm developer. It contains modules for threading and
vectorization and is designed to be extended with specialized modules, e. g. for
locking policy, NUMA-, and GPU-support. In this article, we focus on the de-
scription of the NUMA module. Hence, Figure 2 contains only that part of the
software architecture which is relevant to NUMA. We continuously work on fur-
ther decoupling of the parallelization layer as independent parallelization library,
namely Eventify, to increase its reusability and the reusability of its modules,
e. g. the NUMA module described in Section 4.
3.2 NUMA-Module: Modeling NUMA in Software
Figure 3 shows the software architecture of the NUMA module and its integration
in the software architecture of FMSolvr. Regarding the integration, we aim to
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 7
Fig. 3. Software architecture of the NUMA module and its connection to FMMHandle
preserve the layering and the according separation of concerns as eﬀectively
as possible. However, eﬀective NUMA-awareness requires information from the
data layout, which is part of the algorithm layer, and from the hardware, which
is part of the parallelization layer. Hence, a slight blurring of both layers is
unfortunately unavoidable in this case. After all, we preserve modularization by
providing an interface for the algorithm layer, namely NumaAllocator, and an
interface for the parallelization layer, namely NumaModule.
Based on the hardware information provided by the NUMA distance table,
we model the NUMA architecture of the compute node in software. Hence, a
compute node is a Resource that consists of NumaNodes, which consist of Cores,
which in turn consist of ProcessingUnits. To reuse the NUMA module for other
parallel applications, developers are required to redevelop or adjust the NUMA
policies (see Section 4) contained in class NumaModule only.
4 Applying the NUMA-Module
4.1 Data Distribution
As described in Section 2.4, the FMM exhibits a tree-structured task graph. As
shown in Figure 1, this task graph and its dedicated data are distributed to
NUMA nodes through an equal partitioning of the tasks on each tree level. To
assure data locality, a thread and the data it works on are assigned to the same
NUMA node. Even though this is an important step to improve data locality,
there are still task dependencies that lead to data transfer between NUMA nodes.
In order to reduce this inter-node data transfer, we present diﬀerent thread
4.2 Thread Pinning Policies
8 L. Morgenstern et al.
Algorithm 2 Equal Pinning
for i= 0; i < n;i+ + do
if i < r then
// number of threads per node for nodes 0,...,r −1
tpn =bt/nc+ 1
Assign threads i·tpn, . . . , (i·tpn) + tpn −1 to node i
// number of threads per node for nodes r, . . . , n −1
Assign threads i·tpn +r, . . . , (i·tpn +r) + tpn −1 to node i
Equal Pinning. With Equal Pinning (EP), we pursue a classical load balancing
approach (cf. Scatter Principally ). This means that the threads are equally
distributed among the NUMA nodes. Hence, this policy is suitable for NUMA
systems with homogeneous NUMA nodes only.
Algorithm 2 shows the pseudocode for NUMA-aware thread pinning via pol-
icy EP. Let tbe the number of threads, nbe the number of NUMA nodes and
tpn be the number of threads per NUMA node.
The determined number of threads is mapped to the cores of each NUMA
node in an ascending order of physical core-ids. This means that threads with
successive logical ids are pinned to neighboring cores. This is reasonable with re-
gard to architectures where neighboring cores share a cache-module. In addition,
strict pinning to cores serves the avoidance of side-eﬀects due to the behavior
of the process scheduler, which may vary dependent on the systems state and
complicate an analysis of NUMA-eﬀects.
Compact Pinning. The thread pinning policy Compact Pinning (CP) com-
bines the advantages of the NUMA-aware thread pinning policies Equal Pinning
and Compact Ideally (cf. ). The aim of CP is to avoid data transfer via the
NUMA interconnect by using as few NUMA nodes as possible while avoiding
the use of SMT.
Algorithm 3 shows the pseudocode for the NUMA-aware thread pinning pol-
icy CP. Let cbe the total number of cores of the NUMA system and cnbe the
number of cores per NUMA node excl. SMT-threads. Considering CP, threads
are assigned to a single NUMA node as long as the NUMA node has got cores
to which no thread is assigned to. Only if a thread is assigned to each core of
a NUMA node, the next NUMA node is ﬁlled up with threads. This means es-
pecially that data transfer via the NUMA interconnect is fully avoided if t≤cn
holds. If t≥cholds, thread pinning policy EP becomes eﬀective to reduce the
usage of SMT. For this policy we apply strict pinning based on the neighboring
cores principle as described in Section 4.2, too.
CP is tailor-made for the FMM as well as for tree-structured task graphs
with horizontal and vertical task dependencies in general. CP aims to keep the
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 9
Algorithm 3 Compact Pinning
if t < c then
for i= 0; i < n;i+ + do
Assign threads i·cn,...,(i·cn) + cn −1 to node i
Use policy Equal Pinning
vertical cut through the task graph as short as possible. Hence, as few as possible
task dependencies require data transfer via the NUMA interconnect.
4.3 Work Stealing Policies
Local and Remote Node. Local and Remote Node (LR) is the default work
stealing policy that does not consider the NUMA architecture of the system at
Prefer Local Node. Applying the NUMA-aware work stealing policy Prefer
Local Node (PL) means that threads preferably steal tasks from threads located
on the same NUMA node. However, threads are allowed to steal tasks from
threads located on remote NUMA nodes if no tasks are available on the local
Local Node Only. If the NUMA-aware work stealing policy Local Node Only
(LO) is applied, threads are allowed to steal tasks from threads that are located
on the same NUMA node only. According to , we would expect this policy
to improve performance if stealing across NUMA nodes is more expensive than
idling. This means that stealing a task, e. g. transferring its data, takes too long
in comparison to the execution time of the task.
5.1 Measurement Approach
The runtime measurements for the performance analysis were conducted on a
2-NUMA-node system and a 4-NUMA-node system. The 2-NUMA-node system
is a single compute node of Jureca. The dual-socket system is equipped with
two Intel Xeon E5-2680 v3 CPUs (Haswell) which are connected via QPI. Each
CPU consists of 12 two-way SMT-cores, meaning, 24 processing units. Hence,
the system provides 48 processing units overall. Each core owns an L1 data and
instruction cache with 32 kB each. Furthermore, it owns an L2 cache with 256
kB. Hence, each two processing units share an L1 and an L2 cache. L3 cache
and main memory are shared between all cores of a CPU.
The 4-NUMA-node system is a quad-socket system equipped with four Intel
Xeon E7-4830 v4 CPUs (Haswell) which are connected via QPI. Each CPU
10 L. Morgenstern et al.
consists of 14 two-way SMT-cores, meaning, 28 SMT-threads. Hence, the system
provides 112 SMT-threads overall. Each core owns an L1 data and instruction
cache with 32 kB each. Furthermore, it owns an L2 cache with 256 kB. L3 cache
and main memory are shared between all cores of a CPU.
During the measurements Intel’s Turbo Boost was disabled. Turbo Boost
is a hardware feature that accelerates applications by varying clock frequencies
dependent on the number of active cores and the workload. Even though Turbo
Boost is a valuable, runtime-saving feature in production runs, it distorts scaling
measurements by boosting the sequential run through a higher clock frequency.
The runtime measurements are performed with high resolution clock from
std::chrono. FMSolvr was executed 1000×for each measuring point, with each
execution covering the whole workﬂow of the FMM for a single time step of the
simulation. Afterwards, the 75%-quantile of these measuring points was com-
puted. This means that 75% of the measured runtimes were below the plotted
value. This procedure leads to stable timing results since the inﬂuence of clock
frequency variations during the starting phase of the measurements is eliminated.
As input data set a homogeneous particle ensemble with only a thousand
particles is used. The values of positions and charges of the particles are deﬁned
by random numbers in range [0,1). The measurements are performed with mul-
tipole order p= 0 and tree depth d= 3. Due to the small input data set and the
chosen FMM parameters, the computational eﬀort is very low. With this setup,
FMSolvr tends to be latency- and synchronization-critical. Hence, this setup is
most suitable to analyze the inﬂuence of NUMA-eﬀects on applications that aim
for strong scaling and small computational eﬀort per compute node.
NUMA-aware Thread Pinning We consider the runtime plot of FMSolvr
in Figure 4 (top) to analyze the impact of the NUMA-aware thread pinning
policies EP and CP on the 2-NUMA-node system without applying a NUMA-
aware work stealing. It can be seen that both policies lead to an increase in
runtime in comparison to the non-NUMA-aware base implementation for the
vast majority of cases. The most considerable runtime improvement occurs for
pinning policy CP at #Threads = 12 with a speed-up of 1.6. The reason for this
is that data transfer via the NUMA interconnect is fully avoided by CP since all
threads ﬁt on a single NUMA node for #Threads ≤12. Nevertheless, the best
runtime is not reached for 12, but for 47 threads with policy CP. Hence, the
practically relevant speed-up in comparison to the best runtime with the base
implementation is 1.19.
Figure 4 (bottom) shows the runtime plot of FMSolvr with the thread pinning
policies EP and CP on the 4-NUMA-node system without applying NUMA-
aware work stealing. Here, too, the most considerable runtime improvement is
reached for pinning policy CP if all cores of a single NUMA node are in use.
Accordingly, the highest speed-up on the 4-NUMA-node system is reached at
#Threads = 14 with a value of 2.1. In this case, none of the NUMA-aware
implementations of FMSolvr outperforms the base implementation. Even though
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 11
the best runtime is reached by the base implementation at #Threads = 97 with
1.77 ms, we get close to this runtime with 1.83ms using only 14 threads. Hence,
we can save compute resources and energy by applying CP.
Node 0 Node 1 SMT
1 2 4 8 16 32 48
Runtime in s
Node 0 1 2 3 SMT
1 2 4 8 16 32 64 112
Runtime in s
14 28 42 56
Fig. 4. Comparison of NUMA-aware thread pinning policies EP and CP with work
stealing policy LR on 2-NUMA-node and 4-NUMA-node system.
NUMA-aware Work Stealing Figure 5 (top) shows the runtime of FMSolvr
for CP in combination with the work stealing policies LR, PL and LO on the 2-
NUMA-node system dependent on the number of threads. The minimal runtime
of FMSolvr is reached with 1.16 ms for 48 threads if CP is applied in combination
with LO. The practically relevant speed-up in comparison with the minimal
runtime of the base implementation is 21%.
Figure 5 (bottom) shows the runtime plot of FMSolvr on the 4-NUMA-node
for CP in combination with the work stealing policies LR, PL and LO. As already
observed for the NUMA-aware thread pinning policies in Section 5.2, none of
the implemented NUMA-awareness policies outperforms the minimal runtime of
the base implementation with 1.76 ms. Hence, supplementing the NUMA-aware
thread pinning policies with NUMA-aware work stealing policies is not suﬃcient
and there is still room for the improvements described in Section 7.
12 L. Morgenstern et al.
Node 0 Node 1 SMT
1 2 4 8 16 32 48
Runtime in s
Local and Remote
Node 0 1 2 3 SMT
1 2 4 8 16 32 64 112
Runtime in s
Local and Remote
14 28 42 56
Fig. 5. Comparison of NUMA-aware work stealing policies LR, PL and LO based on
NUMA-aware thread pinning policy CP on a 2-NUMA-node and 4-NUMA-node system
5.3 Performance Portability
We evaluate performance portability based on Deﬁnition 2 with two diﬀerent
performance eﬃciency metrics e: architectural eﬃciency for Φarch and strong
scaling eﬃciency for Φscal.
Assumptions regarding Φarch: The theoretical double precision peak per-
formance P2NN of the 2-NUMA-node system is 960 GFLOPS (2 sockets with
480 GFLOPS  each), while the theoretical peak performance P4N N of the 4-
NUMA-node system is 1792 GFLOPS (4 sockets with 448 GFLOPS  each).
With the considered input data set, FMSolvr executes 2.4 Million ﬂoating point
operations per time step.
Assumptions regarding Φscal: The considered strong scaling eﬃciency is for
each platform and application determined for the lowest runtime and the ac-
cording amount of threads for which this runtime is reached.
As can be seen from Table 1 the determined performance portability varies
greatly dependent on the applied performance eﬃciency metric. However, the
NUMA Plug-In increases performance portability in both cases.
6 Threats to Validity
Even though we aimed at a well-deﬁned description of our main research objec-
tive – a sustainable support of NUMA architectures for FMSolvr, the evaluation
of this objective is subject to several limitations.
NUMA-Awareness as a Plug-In for an Eventify-based Fast Multipole Method 13
Table 1. Performance portabilities Φar ch and Φscal for Eventify FMSolvr with and
without NUMA Plug-In.
FMSolvr without NUMA Plug-In 0.10% 11.22%
FMSolvr with NUMA Plug-In 0.11% 30.41%
We did not fully prove that FMSolvr and the presented NUMA module are
sustainable since we analyzed only performance and performance portability in
a quantitative way. In extending this work, we should quantitatively analyze
maintainability, modiﬁability and evolvability as the remaining properties of
sustainability according to Deﬁnition 1.
Our results regarding the evaluation of the NUMA module are not gener-
alizable to its use in other applications or on other NUMA architectures since
we so far tested it for FMSolvr on a limited set of platforms only. Hence, the
NUMA module should be applied and evaluated within further applications and
on further platforms.
The chosen input data set is very small and exhibits a very low amount
of FLOPs to analyze the parallelization overhead of Eventify and the occurring
NUMA eﬀects. For lack of a performance metric for latency- and synchronization-
critical applications, we applied double precision peak performance as reference
value to preserve comparability with compute-bound inputs and applications.
However, Eventify FMSolvr is not driven by FLOPs, e.g. it does not yet make
explicit use of vectorization. In extending this work, we should reconsider the
performance eﬃciency metrics applied to evaluate performance portability.
7 Conclusion and Future Work
Based on the properties of NUMA systems and the FMM, we described NUMA-
aware data distribution and work stealing policies with respect to . Fur-
thermore, we present the NUMA-aware thread pinning policy CP based on the
performance analysis provided in .
We found that the minimal runtime of FMSolvr is reached on the 2-NUMA-
node system when thread pinning policy CP is applied in combination with work
stealing policy LO. The minimal runtime is then 1.16 ms and is reached for 48
threads. However, none of the described NUMA-awareness policies outperforms
the non-NUMA-aware base implementation on the 4-NUMA-node system. This
is unexpected and needs further investigation since the performance analysis on
another NUMA-node system provided in  revealed that NUMA-awareness
leads to increasing speed-ups with an increasing number of NUMA nodes. Nev-
ertheless, we can save compute resources and energy by applying CP since the
policy leads to a runtime close to the minimal one with considerably less cores.
Next up on our agenda is the implementation of a NUMA-aware pool al-
locator, the determination of more accurate NUMA distance information and
the implementation of a more balanced task graph partitioning. In this article,
14 L. Morgenstern et al.
suitable software design paid oﬀ regarding development time and application
performance. Hence, we aim at further decoupling of the parallelization layer
Eventify and its modules to be reusable by other research software engineers.
1. AbdulJabbar, M.A., Al Farhan, M., Yokota, R., Keyes, D.E.: Performance evalu-
ation of computation and communication kernels of the fast multipole method on
intel manycore architecture (2017). https://doi.org/10.1007/978-3-319-64203-1 40
2. Agullo, E., Bramas, B., Coulaud, O., Khannouz, M., Stanisic, L.: Task-based fast
multipole method for clusters of multicore processors. Research Report RR-8970,
Inria Bordeaux Sud-Ouest (Mar 2017), https://hal.inria.fr/hal-01387482
3. Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., Takahashi, T.: Task-
based fmm for multicore architectures. SIAM Journal on Scientiﬁc Computing
36(1), C66–C93 (2014). https://doi.org/10.1137/130915662
4. Amer, A., Matsuoka, S., Peric`as, M., Maruyama, N., Taura, K., Yokota, R., Balaji,
P.: Scaling fmm with data-driven openmp tasks on multicore architectures. In:
5. Beatson, R., Greengard, L.: A short course on fast multipole methods. Wavelets,
multilevel methods and elliptic PDEs 1, 1–37 (1997)
6. Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput.
Phys. 73(2), 325–348 (Dec 1987). https://doi.org/10.1016/0021-9991(87)90140-9
7. Haensel, D.: A C++-based MPI-enabled Tasking Framework to Eﬃciently Paral-
lelize Fast Multipole Methods for Molecular. Ph.D. thesis, TU Dresden (2018)
8. Haensel, D., Morgenstern, L., Beckmann, A., Kabadshow, I., Dachsel, H.: Eventify:
Event-Based Task Parallelism for Strong Scaling. Accepted at PASC (2020)
9. Intel: APP Metrics for Intel Microprocessors (2020)
10. Kabadshow, I.: Periodic Boundary Conditions and the Error-Controlled Fast Mul-
tipole Method. Ph.D. thesis, Bergische Universit¨at Wuppertal (2012)
11. Koziolek, H.: Sustainability evaluation of software architectures: A system-
atic review. In: Proceedings of the Joint ACM SIGSOFT Conference (2011).
12. Lameter, C.: NUMA (Non-Uniform Memory Access): An Overview. Queue 11(7),
40:40–40:51 (Jul 2013). https://doi.org/10.1145/2508834.2513149
13. Morgenstern, L.: A NUMA-Aware Task-Based Load-Balancing Scheme for the Fast
Multipole Method. Master’s thesis, TU Chemnitz (2017)
14. Pennycook, S.J., Sewall, J.D., Lee, V.W.: A metric for performance portability.
CoRR (2016), http://arxiv.org/abs/1611.07409
15. Ying, L., Biros, G., Zorin, D.: A kernel-independent adaptive fast multipole al-
gorithm in two and three dimensions. Journal of Computational Physics (2004).