from book A Bucket Sort Algorithm for the Particle-In-Cell Method on Manycore Architectures (pp.356-365)
FEniCS-HPC: Automated Predictive High-Performance Finite Element Computing with Applications in Aerodynamics
Developing multiphysics finite element methods (FEM) and scalable HPC implementations can be very challenging in terms of software complexity and performance, even more so with the addition of goal-oriented adaptive mesh refinement. To manage the complexity we in this work present general adaptive stabilized methods with automated implementation in the FEniCS-HPC automated open source software framework. This allows taking the weak form of a partial differential equation (PDE) as input in near-mathematical notation and automatically generating the low-level implementation source code and auxiliary equations and quantities necessary for the adaptivity. We demonstrate new optimal strong scaling results for the whole adaptive framework applied to turbulent flow on massively parallel architectures down to 25000 vertices per core with ca. 5000 cores with the MPI-based PETSc backend and for assembly down to 500 vertices per core with ca. 20000 cores with the PGAS-based JANPACK backend. As a demonstration of the power of the combination of the scalability together with the adaptive methodology allowing prediction of gross quantities in turbulent flow we present an application in aerodynamics of a full DLR-F11 aircraft in connection with the HiLift-PW2 benchmarking workshop with good match to experiments.
FEniCS-HPC: Automated predictive
high-performance ﬁnite element computing with
applications in aerodynamics
Johan Hoﬀman1,2,3, Johan Jansson2,1,4, and Niclas Jansson1,5
1Computational Technology Laboratory, School of Computer Science and
Communication, KTH, Stockholm, Sweden
2BCAM - Basque Center for Applied Mathematics, Bilbao, Spain
Developing multiphysics ﬁnite element methods (FEM) and
scalable HPC implementations can be very challenging in terms of soft-
ware complexity and performance, even more so with the addition of
goal-oriented adaptive mesh reﬁnement. To manage the complexity we in
this work present general adaptive stabilized methods with automated
implementation in the FEniCS-HPC automated open source software
framework. This allows taking the weak form of a partial diﬀerential
equation (PDE) as input in near-mathematical notation and automati-
cally generating the low-level implementation source code and auxiliary
equations and quantities necessary for the adaptivity. We demonstrate
new optimal strong scaling results for the whole adaptive framework
applied to turbulent ﬂow on massively parallel architectures down to
25000 vertices per core with ca. 5000 cores with the MPI-based PETSc
backend and for assembly down to 500 vertices per core with ca. 20000
cores with the PGAS-based JANPACK backend. As a demonstration of
the power of the combination of the scalability together with the adaptive
methodology allowing prediction of gross quantities in turbulent ﬂow
we present an application in aerodynamics of a full DLR-F11 aircraft
in connection with the HiLift-PW2 benchmarking workshop with good
match to experiments.
Keywords: FEM, adaptive, turbulence
As computational methods are applied to simulate even more advanced problems
of coupled physical processes and supercomputing hardware is developed towards
massively parallel heterogeneous systems, it is a major challenge to manage the
complexity and performance of methods, algorithms and software implemen-
tations. Adaptive methods based on quantitative error control pose additional
challenges. For simulation based on partial diﬀerential equation (PDE) models,
the ﬁnite element method (FEM) oﬀers a general approach to numerical discreti-
sation, which opens for automation of algorithms and software implementation.
In this paper we present the FEniCS-HPC open source software framework
with the goal to combine the generality of FEM with performance, by optimisation
of generic algorithms [4, 2, 13]. We demonstrate the performance of FEniCS-HPC
in an application to subsonic aerodynamics.
We give an overview of the methodology and the FEniCS-HPC framework,
key aspects of the framework include:
1. Automated discretization
where the weak form of a PDE in mathemat-
ical notation is translated into a system of algebraic equations using code
2. Automated error control
, ensures that the discretization error e = u -
U in a given quantity is smaller than a given tolerance by adaptive mesh
reﬁnement based on duality-based a posteriori error estimates. An a posteri
error estimate and error indicators are automatically generated from the
weak form of the PDE, by directly using the error representation.
3. Automated modeling
, which includes a residual based implicit turbulence
model, where the turbulent dissipation comes only from the numerical stabi-
lization, as well as treating the ﬂuid and solid in ﬂuid-structure interaction
(FSI) as one continuum with a phase indicator function tracked by a moving
mesh and implicitly modeling contact.
We demonstrate new optimal strong scaling results for the whole adaptive
framework applied to turbulent ﬂow on massively parallel architectures down to
25000 vertices per core with ca. 5000 cores with the MPI-based PETSc backend
and for assembly down to 500 vertices per core with ca. 20000 cores with the
PGAS-based JANPACK backend. We also present an application in aerodynamics
of a full DLR-F11 aircraft in connection with the HiLift-PW2 benchmarking
workshop with good match to experiments.
1.1 The FEniCS project and state of the art
The software described here is part of the FEniCS project , with the goal to
automate the scientiﬁc software process by relying on general implementations and
code generation, for robustness and to enable high speed of software development.
Deal.II  is a software framework with a similar goal, implementing general
PDE based on FEM in C++ where users write the “numerical integration
loop” for weak forms for computing the linear systems. The framework runs
on supercomputers with optimal strong scaling. Deal.II is based on quadrilater
(2D) and hexahedral (3D) meshes, whereas FEniCS is based on simplicial meshes
(triangles in 2D and tetrahedra in 3D).
Another FEM software framework with a similar goal is FreeFEM++ , which
has a high-level syntax close to mathematical notation, and has demonstrated
optimal strong scaling up to ca. 100 cores.
2 The FEniCS-HPC framework
FEniCS-HPC is a problem-solving environment (PSE) for automated solution
of PDE by the FEM with a high-level interface for the basic concepts of FEM:
weak forms, meshes, reﬁnement, sparse linear algebra, and with HPC concepts
such as partitioning, load balancing abstracted away.
The framework is based on components with clearly deﬁned responsibilities.
A compact description of the main components follows, with their dependencies
shown in the dependency diagram in Figure 1:
Automated generation of ﬁnite element spaces V and basis functions
on the reference cell and numerical integration with FInite element
Automated Tabulator (FIAT) [13, 12]
e= (K, V, L)
is a cell in a mesh
is a ﬁnite-dimensional function space,
a set of degrees of freedom.
Automated evaluation of weak forms in mathematical notation on
one cell based on code generation with Uniﬁed Form Language (UFL) and
FEniCS Form Compiler (FFC) [13, 11], using the basis functions
FIAT. For example, in the case of the Laplacian operator
ij =aK(φi, φj) = ZK
∇φi· ∇φjdx =ZK
where AKis the element stiﬀness matrix and r(·,·) is the weak residual.
Automated high performance assembly of weak forms and
interface to linear algebra of discrete systems and mesh reﬁnement on a
distributed mesh TΩ.
for all cells K∈ TΩ
Automated Uniﬁed Continuum modeling with Unicorn choosing a
speciﬁc weak residual form for incompressible balance equations of mass and
momentum with example visualizations of aircraft simulation below left and
turbulent FSI in vocal folds below right .
rUC ((v, q),(u, p)) = (v , ρ(∂tu+(u·∇)u)+∇·σ−g)+(q, ∇·u)+LS((v, q),(u, p))
where LS is a least-squares stabilizing term described in .
Fig. 1: FEniCS-HPC component dependency diagram.
A user of FEniCS-HPC writes the weak forms in the UFL language, compiles
it with FFC, and includes it in a high-level “solver” written in C++ in DOLFIN-
HPC to read in a mesh, assemble the forms, solve linear systems, reﬁne the mesh,
etc. The Unicorn solver for adaptive computation of turbulent ﬂow and FSI is
developed as part of FEniCS-HPC.
2.1 Solving PDE problems in FEniCS-HPC
To solve Poisson’s equation in weak form
) = 0
in the framework, we ﬁrst deﬁne the weak form in a UFL
“form ﬁle”, closely mapping mathematical notation (see Figure 2). The form ﬁle
is then compiled to low-level C++ source code for assembling the local element
matrix and vector with FFC. Finally we use DOLFIN-HPC to write a high-level
“solver” in C++, composing the diﬀerent abstractions, where a mesh is deﬁned,
the global matrix and vector are assembled by interfacing to the generated source
code, the linear system is solved by an abstract parallel linear algebra interface
(using PETSc as back-end by default), and then the solution function is saved to
disk. The source code for an example solver is presented in Figure 2.
Q=FiniteElement( " CG " , "t et r a he d r o n " , 1 )
v=TestFunction(Q)# te st ba si s f u nc ti on
u=TrialFunction(Q)# tr i al b as i s fu n ct io n
f=Coefficient(Q)# f un ct io n
# Bi li n ea r an d l in ea r f or ms
a=do t (grad(v) , gr ad (u) ) * dx
// Def in e m es h , BC s and co ef fi ci en ts
Po i ss o nB o un d ar y bo un d ar y ;
Di r ic h l e tB C bc (u0 ,mesh ,b ou n da ry ) ;
// D ef in e PD E
Li n ea r PD E pd e (a,L,mesh,b c );
// S ol ve P DE
Fu nc t io n u;
pd e .so l ve (u) ;
// S av e so l ut io n to f i le
File file( ‘‘ poisson .pvd’ ’) ;
file << u;
Fig. 2: Poisson solver in FEniCS-HPC with the weak form in the UFL language
(left) and the solver in C++ using DOLFIN-HPC (right).
The incompressible Navier-Stokes equations
We formulate the General
Galerkin (G2) method for incompressible Navier-Stokes equations
in UFL by
a direct input of the weak residual. We can automatically derive the Jacobian
in a quasi-Newton ﬁxed-point formulation and also automatically linearize and
generate the adjoint problem needed for adaptive error control. These examples
are presented in Figure 3
V=VectorElement( " CG " , "t et r a he d r o n " , 1 )
Q=FiniteElement( " CG " , "t et r a he d r o n " , 1 )
v=TestFunction(V); q=TestFunction (Q)
u_ =TrialFunction(V) ; p_ =TrialFunction (Q)
u=Coefficient(V); p=Coefficient (Q)
u0 =Coefficient(V) ; um = 0. 5* ( u+u0 )
# Mo me n tu m an d c on ti n ui t y we ak re s id ua l s
r_ m = ( i nn er (u-u 0 ,v)/ k+ \
(( nu *i n ne r (grad(u m ) , grad(v)) + \
in ne r (grad(p) + grad(um ) * um ,v) )) )* d x +LS_u*d x
r_ c =in ne r (d iv (u) , q))*d x +LS_p*d x
# N e wt o n ’s me t ho d Ju_ i + 1 = Ju _i - F ( u_ i )
a=de ri v at i ve (r _m ,u,u_ )
L=action(a,u) - r_m
# Ad jo in t p ro b le m ( st a ti on a ry p a rt ) fo r r_ m
a_adjoint =adjoint(de r iv a ti v e (r_ m -in ne r (u,v) / k*dx ,u) )
L_adjoint_c =de ri v at i ve (action (r_c ,p) , u,v)
L_adjoint =in ne r (psi_m ,v)* d x -L_adjoint_c
Fig. 3: Example of weak forms in UFL notation for the cG(1)cG(1) method for
incompressible Navier-Stokes equations (left) together with the adjoint problem
3 Parallelization strategy and performance
The parallelization is based on a fully distributed mesh approach, where everything
from preprocessing, assembly of linear systems, postprocessing and reﬁnement
is performed in parallel, without representing the entire problem or any pre-
/postprocessing step on a single core
Inital data distribution is deﬁned by the graph partitioning of the correspond-
ing dual graph of the mesh. Each core is assigned a set of whole elements and
the vertex overlap between cores is represented as ghosted entities.
3.1 Parallel assembly
The assembling of the global matrix is performed in a straightforward fashion.
Each core computes the local matrix of the local elements and add them to the
global matrix. Since we assign whole elements to each core, we can minimize
data dependency during assembly. Furthermore, we renumber all the degrees
of freedom such that a minimal amount of communication is required when
modifying entries in the sparse matrix.
3.2 Solution of discrete system
The FEM discretization generates a non-linear algebraic equation system to be
solved for each time step. In Unicorn we solve this by iterating between the
velocity and pressure equations by a Picard or quasi-Newton iteration .
Each iteration in turn generates a linear system to be solved. We use simple
Krylov solvers and preconditioners which scale well to many cores, typically
BiCGSTAB with a block-Jacobi preconditioner, where each block is solved with
3.3 Mesh reﬁnement
Local mesh reﬁnement is based around a parallelization of the well known recursive
longest edge bisection method . The parallelization splits up the reﬁnement
into two phases. First a local serial reﬁnement phase bisects all elements marked
for reﬁnement on each core (concurrently) leaving several hanging nodes on the
shared interface between cores. The second phase propagates these hanging nodes
onto adjacent cores.
The algorithm iterates between local reﬁnement and global propagation until
all cores are free of hanging nodes. For an eﬃcent implementation, one has to
detect when all cores are idling at the same time. Our implementation uses a fully
distributed termination detection scheme, which includes termination detection
in the global propagation step by using recusive doubling or hypercube exchange
type communication patterns . Also, the termination detection algorithm does
not have a central point of control, hence no bottlenecks, less message contention,
and no problems with load imbalance.
Dynamic load balancing
In order to sustain good load balance across several
adaptive iterations, dynamic load balancing is needed. DOLFIN-HPC is equipped
with a scratch and remap type load balancer, based on the widely used PLUM
scheme , where the new partitions are assigned in an optimal way by solving
the maximally weighted bipartite graph problem. We have improved the scheme
such that it scales linearly to thousands of cores [10, 8].
Furthermore, we have extended the load balancer with an a priori workload
estimation. With a dry run of the reﬁnement algorithm, we add weights to a
dual graph of the mesh, corresponding to the workload after reﬁnement. Finally,
we repartition the unreﬁned mesh according to the weighted dual graph and
redistribute the new partitions before the reﬁnement.
4 Strong scalability
To be able to take advantage of available supercomputers today the entire solver
in FEniCS-HPC needs to demonstrate good strong scaling to at least several
thousands of cores. For planned “exascale” systems with many million cores,
strong scalability has to be attained for at least hundreds of thousands of cores.
In this section we analyze scaling results using the PETSc parallel linear
algebra backend based on pure MPI and the JANPACK backend based on PGAS.
In Figure 4 we present strong scalability results with the PETSc pure MPI
backend for the full G2 method for turbulent incompressible Navier-Stokes
(assemble linear systems and solve the momentum and continuity
equations) in 3D on a mesh with 147M vertices on the Hornet Cray XC40
computer. We observe near-optimal scaling to ca. 4.6 kcores for all the main
algorithms (assembly and linear solves). Going from 4.6 kcores to 9.2 kcores we
start to see a degradation in the scaling with a speedup of ca. 0.7, and from 9.2
kcores to 18.4 kcores the speedup is 0.5. It’s clear that it’s mainly the assembly
that shows degraded scaling.
In Figure 5 we present results for assembling four diﬀerent equations using
the JANPACK backend, where FEniCS-HPC is running in a hybrid MPI+PGAS
mode. We observe that for large number of cores, the low latency one-sided
communication of PGAS languages in combination with our new sparse matrix
format  greatly improves the scalability.
Fig. 4: Strong scalabil-
ity test for the full G2
method for incompress-
ible turbulent Navier-
Stokes equations (as-
semble linear systems
and solve momentum
and continuity) in 3D
on a Cray XC40.
5 Unicorn simulation of a full aircraft
In the Unicorn component we implement the full G2 method and ﬁx the weak
residual to the cG(1)cG(1) stabilized space-time method for incompressible
Navier-Stokes equations (or a general stress for FSI)
In a cG(1)cG(1) method  we seek an approximate space-time solution
) which is continuous piecewise linear in space and time (equivalent
to the implicit Crank-Nicolson method). With
a time interval with subinter-
a standard spatial ﬁnite element space of continuous
piecewise linear functions, and
the functions in
which are zero on the
, the cG(1)cG(1) method for constant density incompressible ﬂow
with homogeneous Dirichlet boundary conditions for the velocity takes the form:
, ..., N
, ﬁnd (
Un, P n
2D Convection-diﬀusion 214M cells
3D Poisson 317M cells
3D Navier-Stokes 80M cells
3D Linear Elasticity 14M cells
Fig. 5: Sparse matrix assembly timings for four diﬀerent equations on a Cray
Pn∈Wn, such that
r((U, P ),(v, q )) = ((Un−Un−1)k−1
n+ ( ¯
Un, v) + (2νǫ(¯
−(P, ∇ · v) + (∇ · ¯
Un, q) + LS = 0, , ∀ˆv= (v, q)∈Vn
) is piecewise constant in time over
and LS a
least-squares stabilizing term described in .
We formulate a new general adjoint-based method for adaptive error control
based on the following error representation and adjoint weak bilinear and linear
forms with the error
, adjoint solution
, output quantity
hat signifying the full velocity-pressure vector ˆ
U= (U, P ), with rG=r−LS:
(ˆe, ψ) = r′(ˆe, ˆ
φ) = rG(ˆ
φ)aadjoint (v, ˆ
φ) = r′(v, ˆ
φ)Ladjoint (v) = (v, ψ ) (2)
We have used our adaptive ﬁnite element methodology for turbulent ﬂow and
FEniCS-HPC software to solve the incompressible Navier-Stokes equations of
the ﬂow past a full high-lift aircraft model (DLR-F11) with complex geometry at
realistic Reynolds number for take-oﬀ and landing. This work is an extension of
our contributed simulation results to the 2
AIAA CFD High-Lift Prediction
Workshop (HiLiftPW-2), in San Diego, California, in 2013 .
In the following results we focus on the angle of attack
. To quantify
mesh-convergence we plot the coeﬃcients and their relative error compared to
the experimental values (serving as the reference) versus the number of vertices
in the meshes, and plot meshes and volume renderings of quantities related to
the adaptivity in Figure 6.
We see that our adaptive computational results come very close to the
experimental results on the ﬁnest mesh, with a relative error under 1% for cl and
cd. For other angles we observe similar results presented in .
Fig. 6: Plots for the aircraft simulation at
. Lift coeﬃcient,
, vs. angle of attack,
, for the diﬀerent meshes from the
iterative adaptive method (left). Slice aligned with the angle of attack showing
the tetrahedra of the starting mesh versus the ﬁnest adaptive mesh (top right).
Volume rendering of the velocity residual and adjoint velocity magnitude (bottom
We have given an overview of the general FEniCS-HPC software framework for
automated solution of PDE, taking the weak form as input in near-mathematical
notation, with automated discretization and a new simple method for adaptive
error control, suitable for parallel implementation. On the Hornet Cray XC40
supercomputer we demonstrate new optimal strong scaling results for the whole
adaptive framework applied to turbulent ﬂow on massively parallel architectures
down to 25000 vertices per core with ca. 5000 cores with the MPI-based PETSc
backend and for assembly down to 500 vertices per core with ca. 20000 cores
with the PGAS-based JANPACK backend.
Using the Unicorn component in FEniCS-HPC we have simulated the aero-
dynamics of a full DLR-F11 aircraft in connection with the HiLift-PW2 bench-
marking workshop. We ﬁnd that the simulation results compare very well with
experimental data; moreover, we show mesh-convergence by the adaptive method,
while using a low number of spatial degrees of freedom.
This research has been supported by EU-FET grant EUNISON 308874, the
European Research Council, the Swedish Foundation for Strategic Research, the
Swedish Research Council, the Basque Excellence Research Center (BERC 2014-
2017) program by the Basque Government, the Spanish Ministry of Economy and
Competitiveness MINECO: BCAM Severo Ochoa accreditation SEV-2013-0323
and the Project of the Spanish MINECO: MTM2013-40824.
We acknowledge PRACE for awarding us access to the supercomputer re-
sources Hermit, Hornet and SuperMUC based in Germany at The High Perfor-
mance Computing Center Stuttgart (HLRS) and Leibniz Supercomputing Center
(LRZ), from the Swedish National Infrastructure for Computing (SNIC) at PDC –
Center for High-Performance Computing and on resources provided by the “Red
Espa˜nola de Supercomputaci´on” and the “Barcelona Supercomputing Center -
Centro Nacional de Supercomputaci´on”.
We would also like to acknowledge the FEniCS and FEniCS-HPC developers
W. Bangerth, R. Hartmann, and G. Kanschat. deal.II — a general-purpose object-
oriented ﬁnite element library. ACM Trans. Math. Softw., 33(4), 2007.
2. FEniCS. FEniCS project, 2003. http://www.fenicsproject.org.
3. F. Hecht. New development in freefem++. J. Numer. Math., 20, 2012.
J. Hoﬀman, J. Jansson, R. Vilela de Abreu, N. C. Degirmenci, N. Jansson, K. M¨uller,
M. Nazarov, and J. H. Sp¨uhler. Unicorn: Parallel adaptive ﬁnite element simulation
of turbulent ﬂow and ﬂuid-structure interaction for deforming domains and complex
geometry. Comput. Fluids, 80(0):310 – 319, 2013.
J. Hoﬀman, J. Jansson, N. Jansson, and R. Vilela De Abreu. Towards a parameter-
free method for high reynolds number turbulent ﬂow simulation based on adaptive
ﬁnite element approximation. Computer Methods in Applied Mechanics and Engi-
neering, 288(0):60 – 74, 2015.
J. Hoﬀman, J. Jansson, and M. St¨ockli. Uniﬁed continuum modeling of ﬂuid-
structure interaction. Math. Mod. Meth. Appl. S., 2011.
Johan Hoﬀman and Claes Johnson. Computational Turbulent Incompressible Flow,
volume 4 of Applied Mathematics: Body and Soul. Springer, 2007.
Niclas Jansson. High Performance Adaptive Finite Element Methods: With Appli-
cations in Aerodynamics. PhD thesis, KTH Royal Institute of Technology, 2013.
Niclas Jansson. Optimizing Sparse Matrix Assembly in Finite Element Solvers with
One-sided Communication. In High Performance Computing for Computational
Science – VECPAR 2012, volume 7851 of Lecture Notes in Computer Science.
Springer Berlin Heidelberg, 2013.
Niclas Jansson, Johan Hoﬀman, and Johan Jansson. Framework for Massively
Parallel Adaptive Finite Element Computational Fluid Dynamics on Tetrahedral
Meshes. SIAM J. Sci. Comput., 34(1):C24–C41, 2012.
R. C. Kirby and A. Logg. A compiler for variational forms. ACM Transactions on
Mathematical Software, 32(3):417–444, 2006.
Robert C Kirby. Algorithm 839: Fiat, a new paradigm for computing ﬁnite element
basis functions. ACM Transactions on Mathematical Software (TOMS), 2004.
Anders Logg, Kent-Andre Mardal, Garth N. Wells, et al. Automated Solution of
Diﬀerential Equations by the Finite Element Method. Springer, 2012.
Leonid Oliker. PLUM parallel load balancing for unstructured adaptive meshes.
Technical Report RIACS-TR-98-01, RIACS, NASA Ames Research Center, 1998.
MC Rivara. New longest-edge algorithms for the reﬁnement and/or improvement
of unstructured triangulations. Int. J. Numer. Meth. Eng., 1997.