Content uploaded by Honghui Shang
Author content
All content in this area was uploaded by Honghui Shang on Apr 28, 2020
Content may be subject to copyright.
Special Issue Paper
The static parallel distribution
algorithms for hybrid density-functional
calculations in HONPAS package
Xinming Qin
1
, Honghui Shang
2
, Lei Xu
2
, Wei Hu
1
,
Jinlong Yang
1
, Shigang Li
2
and Yunquan Zhang
2
Abstract
Hybrid density-functional calculation is one of the most commonly adopted electronic structure theories in computational
chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have
developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-
functionals. In our scheme, the most time-consuming step is the calculation of the electron repulsion integrals (ERIs)
part, so creating an even distribution of these ERIs in parallel implementation is an issue of particular importance. Here, we
present two static scalable distributed algorithms for the ERIs computation. Firstly, the ERIs are distributed over ERIs shell
pairs. Secondly, the ERIs are distributed over ERIs shell quartets. In both algorithms, the calculation of ERIs is independent
of each other, so the communication time is minimized. We show our speedup results to demonstrate the performance of
these static parallel distributed algorithms in the Hefei Order-N packages for ab initio simulations.
Keywords
Distributed algorithms, hybrid density-functional calculations, HONPAS package, electron repulsion integrals, parallel
implementation
1. Introduction
The electronic structure calculations based on density func-
tional theory (DFT) (Hohenberg and Kohn, 1964; Kohn and
Sham, 1965; Parr and Yang, 1989) are the workhorse of
computational chemistry and materials science. However,
widely used semi-local density functionals could underes-
timate the band gaps because of its inclusion of the unphy-
sical self-interaction (Mori-Sa´nchez et al., 2008). A
possible solution is to add the nonlocal Hartree–Fock-
type exchange (HFX) into semi-local density-functionals
to construct hybrid functionals (Becke, 1993; Delhalle and
Calais, 1987; Frisch et al., 2009; Gell-Mann and Brueckner,
1957; Heyd et al., 2003; Heyd et al., 2006; Janesko et al.,
2009; Krukau et al., 2006; Monkhorst, 1979; Paier et al.,
2006; Paier et al., 2009; Stephens et al., 1994). However,
the drawback of hybrid density-functionals is that it is sig-
nificantly more expensive than conventional DFT. The
most time-consuming part in hybrid density-functional cal-
culations becomes the construction of HFX matrix, even
with the appearance of fast linear scaling algorithms that
overcome the bottlenecks encountered in conventional
methods (Burant et al., 1996; Guidon et al., 2010; Merlot
et al., 2014; Ochsenfeld et al., 1998; Polly et al., 2004;
Schwegler and Challacombe, 1996; Schwegler et al.,
1997; Sodt and Head-Gordon, 2008; Tymczak and Challa-
combe, 2005). As a result, hybrid density-functional calcu-
lations must make efficient use of parallel computing
resources in order to reduce the execution time of HFX
matrix construction.
The implementation of hybrid density-functionals for
solid-state physics calculations is mostly based on plane
waves (PW) (Gonze et al., 2002, 2016; Paier et al., 2006)
or the linear combination of atomic orbitals (LCAO)
(Dovesi et al., 2006; Frisch et al., 2009; Krukau et al.,
2006) method. The atomic orbitals basis set is efficient for
real-space formalisms, which have attracted considerable
interest for DFT calculations because of their favorable
1
Hefei National Laboratory for Physical Sciences at Microscale,
Department of Chemical Physics, and Synergetic Innovation Center of
Quantum Information and Quantum Physics, University of Science and
Technology of China, Hefei, Anhui, China
2
State Key Laboratory of Computer Architecture, Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, China
Corresponding author:
Honghui Shang, State Key Laboratory of Computer Architecture, Institute
of Computing Technology, Chinese Academy of Sciences, Beijing, China.
Email: shanghui.ustc@gmail.com
The International Journal of High
Performance Computing Applications
1–10
ªThe Author(s) 2019
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/1094342019845046
journals.sagepub.com/home/hpc
scaling with respect to the number of atoms and their poten-
tial for massively parallel implementations for large-scale
calculations (Blum et al., 2009; Delley, 1990; Enkovaara
et al., 2010; Frisch et al., 2009; Havu et al., 2009; Mohr
et al., 2014; Ren et al., 2012; Shang et al., 2011; Soler et al.,
2002). Unlike the plane wave method, when constructing
HFX matrix within the LCAO method, we must first cal-
culate the electron repulsion integrals (ERIs) via the atomic
orbitals. There are currently two types of atomic orbits that
are most commonly used. The first is Gaussian-type orbital
(GTO), as adopted in Gaussian (Frisch et al., 2009) and
CRYSTAL (Dovesi et al., 2006); its advantage is to calcu-
late ERIs analytically. The second is numerical atomic
orbital (NAO), which is adopted in SIESTA (Soler et al.,
2002), DMOL (Delley, 1990), and OPENMX (Ozaki,
2003). The advantage of NAO is its strict locality, which
naturally leads to lower order scaling of computational time
versus system size. In order to take advantages of both
types of atomic orbitals, we have proposed a new scheme
called NAO2GTO (Shang et al., 2011), in which GTO can
be used for analytical computation of ERIs in a straightfor-
ward and efficient way, while NAO can be employed to set
the strict cutoff for atomic orbitals. After employing several
ERI screening techniques, the construction of HFX matrix
can be very efficient and scale linearly (Qin et al., 2014;
Shang et al., 2011).
Parallelization of HFX matrix construction faces two
major problems of load imbalance and high communication
cost. The load imbalance arises from the irregularity of the
independent tasks available in the computation, which is
due to the screening procedure and different types of shell
quartets distributed among processes. The high communi-
cation cost is from interprocessor communication of the
density and/or HFX matrices, which is associated with the
data access pattern. It is well known that NWChem (Valiev
et al., 2010) and CP2K/Quickstep (VandeVondele et al.,
2005) are the most outstanding software in the field of high
performance parallel quantum chemical computing, and
both of them use GTOs to construct HFX matrix. In
NWChem, the parallelization of HFX matrix construction
is based on a static partitioning of work followed by a work
stealing phase (Chow et al., 2015; Liu et al., 2014). The
tasks are statically partitioned throughout the set of shell (or
atom) quartets, and then the work stealing phase acts to
polish the load balance. As a result, this parallel implemen-
tation gives very good parallel scalability of Hartree–Fock
calculations (Liu et al., 2014). In CP2K/Quickstep, the
HFX parallelization strategy is to replicate the global den-
sity and HFX matrix on each MPI process in order to
reduce communication. A load balance optimization based
on simulated annealing and a binning procedure to coarse
grain the load balancing problem have been developed
(Guidon et al., 2008). However, this approach may limit
both system size and ultimately parallel scalability.
As the ERIs calculation is the most computationally
demanding step in the NAO2GTO scheme, the develop-
ment of the new parallel algorithms is of particular
importance. Previously, for codes using localized atomic
orbitals, the parallelization of ERIs is mainly implemented
to treat finite, isolated systems (Alexeev et al., 2002; Chow
et al., 2015; Liu et al., 2014; Schmidt et al., 1993), but
only a few literature reports exist for the treatment of
periodic boundary conditions with such basis sets (Bush
et al., 2011; Guidon et al., 2008), in which the Order-N
screening for the ERIs calculations has not been consid-
ered. The purpose of this work is to present the static
parallel distribution algorithms for the NAO2GTO
scheme (Shang et al., 2011) with Order-N performance
in Hefei Order-N packages for ab initio simulations
code (Qin et al., 2014). In our approaches, the calcula-
tions of ERIs are not replicated, but are distributed over
CPU cores, as a result both the memory and the CPU
requirements of the ERIs calculation are paralleled. The
efficiency and scalability of these algorithms are demon-
strated by benchmark timings in periodic solid system
with 64 silicon atoms in the unit cell.
The outline of this article is as follows: In Section 2, we
begin with a description of the theory of hybrid functionals.
In Section 3, we describe the detailed implementation of
our parallel distribution. In Section 4, we present the bench-
mark results and the efficiency of our scheme. In Section 5,
we summarize the whole paper and show the future
research directions.
2. Fundamental theoretical framework
Before addressing the parallel algorithms, we recall the
basic equations used in this work. A spin-unpolarized nota-
tion is used throughout the text for the sake of simplicity,
but a formal generalization to the collinear spin case is
straightforward. In Kohn–Sham DFT, the total-energy
functional is given as
EKS ¼Ts½nþEext½nþEH½nþExc ½nþEnucnuc ð1Þ
Here, nðrÞis the electron density and Tsis the
kinetic energy of noninteracting electrons, while Eext is
external energy stemming from the electron-nuclear
attraction, EHis the Hartree energy, Exc is the exchange-
correlation energy, and Enucnuc is the nucleus–nucleus
repulsion energy.
The ground state electron density n0ðrÞ(and the associ-
ated ground state total energy) is obtained by variationally
minimizing equation (1) under the constraint that the num-
ber of electrons Neis conserved. This yields the chemical
potential ¼dEKS=dnof the electrons and the Kohn–
Sham single particle equations
^
hKS i¼½
^
tsþvextðrÞþvHþvxc i¼ep ið2Þ
for the Kohn–Sham Hamiltonian ^
hKS. In equation (2), ^
ts
denotes thekinetic energy operator, vext the ext ernal potential,
vHthe Hartree potential, and vxc the exchange-correlation
potential. Solving equation (2) yields the Kohn–Sham single
particle states pand their eigenenergies ep. The single par-
ticle states determine the electron density via
2The International Journal of High Performance Computing Applications XX(X)
nðrÞ¼X
i
fij ij2ð3Þ
in which fidenotes the Fermi–Dirac distribution function.
To solve equation (2) in numerical implementations, the
Kohn–Sham states are expanded in a finite basis set. For
periodic systems, the crystalline orbital iðk;rÞnormalized
in all space is a linear combination of Bloch functions
ðk;rÞ, defined in terms of atomic orbitals R
ðrÞ
iðk;rÞ¼X
c;iðkÞðk;rÞð4Þ
ðk;rÞ¼ 1
ffiffiffiffi
N
pX
R
R
ðrÞeikðRþrÞð5Þ
where the Greek letter is the index of atomic orbitals,
iis the suffix for different bands, Ris the origin of a unit
cell, and Nis the number of unit cells in the system.
R
ðrÞ¼ðrRrÞis the th atomic orbital, whose
center is displaced from the origin of the unit cell at Rby
r.c;iðkÞis the wave function coefficient, which is
obtained by solving the following equation
HðkÞcðkÞ¼EðkÞSðkÞcðkÞð6Þ
½HðkÞn¼X
R
HR
neikðRþrnrÞð7Þ
HR
n¼h0
j^
HjR
nið8Þ
½SðkÞn¼X
R
SR
neikðRþrnrÞð9Þ
SR
n¼h0
jR
nið10Þ
In equation (8), HR
nis a matrix element of the one-
electron Hamiltonian operator ^
Hbetween the atomic orbi-
tal located in the central unit cell 0 and nlocated in the
unit cell R.
It should be noted that the exchange-correlation poten-
tial vxc is local and periodic in semi-local DFT, while in
Hartree–Fock and hybrid functionals the HFX potential
matrix element is defined as
VX
G
l¼1
2X
ns X
N;H
PHN
ns 0
N
njG
lH
s
hi
ð11Þ
where G,N, and Hrepresent different unit cells. The den-
sity matrix element PN
ns is computed by integration over the
Brillouin zone (BZ)
PN
ns ¼X
jZBZ
c
n;jðkÞcs;jðkÞqðeFejðkÞÞeikNdkð12Þ
where qis the step function, eFis the Fermi energy, and
ejðkÞis the jth eigenvalue at point k.
In order to calculate the following ERI in equation (11)
0
N
njG
lH
s
¼ZZ0
ðrÞN
nðrÞG
lðr0ÞH
sðr0Þ
jrr0jdrdr0ð13Þ
we use NAO2GTO scheme described in the following
section.
Following the flowchart in Figure 1, the NAO2GTO
scheme is to firstly fit the NAO with GTOs, and then to
calculate ERIs analytically with fitted GTOs. A NAO is a
product of a numerical radial function and a spherical
harmonic
’IlmnðrÞ¼’IlnðrÞYlmð^
rÞð14Þ
The radial part of the NAO ’IlnðrÞis calculated by the
following equation
1
2
1
r
d2
dr2rþlðlþ1Þ
2r2þVðrÞþVcut
’IlnðrÞ¼el’IlnðrÞ
ð15Þ
where VðrÞdenotes the electrostatic potential for orbital
’IlnðrÞand Vcut ensures a smooth decay of each radial
function which is strictly zero outside a confining
radius rcut.
3. Methods
3.1. NAO2GTO scheme
In our NAO2GTO approach, the radial part of the NAO
’IlnðrÞis fitted by the summation of several GTOs, denoted
as ðrÞ
ðrÞX
m
Dmrlexp amr2
ð16Þ
Parameters amand Dmare determined by minimizing
the residual sum of squares of the difference
X
iðriÞ=rl
i’IlnðriÞ=rl
i2ð17Þ
Figure 1. The flowchart of the NAO2GTO scheme in HONPAS.
HONPAS: Hefei Order-N packages for ab initio simulations;
NAO: numerical atomic orbital; ERI: electron repulsion integral;
GTO: Gaussian-type orbital.
Qin et al. 3
In practice of the solid system calculation, too diffused
basis set may cause convergence problem (Guidon et al.,
2008), as a result the exponents smaller than 0.10 are usually
not needed, and we made a constraint, that is, (a>0:1)
during our minimal search. The flowchart is shown in Algo-
rithm 1. First, we use constrained genetic algorithm (Conn
et al., 1991; Goldberg, 1989) to do a global search for initial
guess and then do a constrained local minimal search using
trust-region-reflective algorithm, which is a subspace trust-
region method and is based on the interior-reflective Newton
method described in Coleman and Li (1994, 1996). Each
iteration involves the approximate solution of a large linear
system using the method of preconditioned conjugate gradi-
ents. We make N(N>500) global searches to make sure to
have a global minimal.
Using the above fitting parameters, we build all ERIs
needed for the HFX. In our implementation, the full per-
mutation symmetry of the ERIs has been considered for
solid systems
ð0nHjlGsNÞ¼ð0nHjsNlGÞ
¼ðn0HjlGHsNHÞ¼ðn0HjsNHlGHÞ
¼ðl0sNGjGnHGÞ¼ðl0sNGjnHGGÞ
¼ðs0lGNjNnHNÞ¼ðs0lGNjnHNNÞ
ð18Þ
In this way, we save a factor of 8 in the number of
integrals that have to be calculated.
When calculating the ERIs with GTOs, the atomic
orbitals are grouped into shells according to the angular
momentum. The list need to be distributed in parallel is in
fact the shell quartet. For a shell with a angular momen-
tum of l, the number of atomic orbital basis functions is
2lþ1, so in a shell quartet integral ðIJjKLÞwe calculate
in total ð2lIþ1Þð2lJþ1Þð2lKþ1Þð2lLþ1Þatomic basis
orbital ERIs together. As a result, the computational
expense is strongly dependent on the angular momenta
of the shell quartet. It is a challenge to distribute these
shell ERIs not only in number but also considering the
time-weight.
In our NAO2GTO scheme, two shell pair lists (list-IJ
and list-KL) are firstly preselected according to Schwarz
screening (Ha¨ser and Ahlrichs, 1989)
jðnjlsÞj8ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðnjnÞðlsjlsÞ
pð19Þ
and only the shell list indexes with ðIJjIJÞ>tor
ðKLjKLÞ>t(here tis the drop tolerance) are stored.
As shown in equation (11), the first index I runs only
within the unit cell, while the indexes (J, K, L) run over
the whole supercell, so the list-IJ is smaller than the list-
KL. Then in the ERIs calculations, the loops run over
these two shell lists.
Then before the calculation of every ERI, we use
Schwarz inequality (equation (19)) again to estimate a rig-
orous upper bound, that only the ERIs with non-negligible
contributions are calculated, we note this screening method
as Schwarz screening. Using the exponential decay of the
charge distributions, the Schwarz screening reduces the
total number of ERIs to be computed from OðN4Þto
OðN2Þ. The Schwarz screening tolerance is set to 10
5
in
the following calculation. In addition, the NAO screening
is also adopted as the NAO is strictly truncated (Shang
et al., 2011). The NAO screening is safe in the calculation
of the short-range ERI because in this case the HFX Hamil-
tonian matrix is sparse due to the screened Coulomb poten-
tial (Izmaylov et al., 2006). As a result, we store this HFX
Hamiltonian with a sparse matrix data structure.
In practice, it should also be noted that as the angular
part of the NAOs is spherical harmonic while the GTOs are
Cartesian Gaussians, we need to make a transformation
between Cartesian and spherical harmonic functions. The
difference between these two harmonic functions is the
number of atomic orbitals included in the shells whose
angular momenta are larger than 1. For example, a d-
shell has five spherical orbitals, but has six Cartesian orbi-
tals. A detailed transformation method can be found in the
study by Schlegel and Frisch (1995).
3.1.1. Parallel schemes. The ERIs have four indexes that can
be paralleled. One possible parallel scheme is to make
distribution of just one shell index, however, as the index
number in one shell is too small to make distribution over
CPU cores, such shell-one distribution may cause serious
load imbalance.
The other parallel scheme is to make distribution for
shell-pair (list-KL). It is a straightforward way to paralle-
lize the ERIs as in practice we loop over two pair lists.
However, although the shell-pair can be distributed evenly
before ERIs calculations, ERI screening is needed during
the ERIs calculations, which also causes load imbalance.
The practical implementation of the described formalism
closely follows the flowchart shown in Algorithm 2. After
the building list-KL is completed, we distributed it into
CPU cores with list-KL-local at every core. Then in the
following ERIs screening and calculation, only loops over
list-KL-local are needed for every core. The advantage of
this scheme is that it naturally bypasses the communication
process, and every CPU core only goes over and computes
its assigned list-KL-local. However, although the list-KL is
distributed evenly over processors, the ERI screening is
located after the parallelization, which causes different
number of shell ERIs that need to be calculated in every
processor. Such different number of shell ERIs makes load
imbalance.
In order to achieve load balance, distribution of individ-
ual shell-quartet ðIJjKLÞafter the ERI screening process is
a possible choice. Even if the computational time is non-
uniformity in the case of the different shell type, this dis-
tribution can also yield an even time over CPU cores
because of its smallest distribution chunks. The practical
implementation of the shell-quartet algorithm is shown in
Algorithm 3. Every CPU core will go over the global pair
lists (list-IJ and list-KL) and make the ERI screening to
4The International Journal of High Performance Computing Applications XX(X)
determine which ERIs are needed to be computed. Then a
global counter is set to count the number of computed ERIs,
this counter is distributed over CPU cores to make sure the
number of calculated ERIs is evenly distributed. The dis-
advantage of this algorithm is that every processor needs to
make the whole ERI screening, while in the abovemen-
tioned parallel-pair algorithm only the ERI screening in its
local lists is needed. Such globally calculated ERIs screen-
ing decreases the parallel efficiency.
4. Results and discussions
In order to demonstrate the performance of the abovemen-
tioned two static parallel schemes, we use silicon bulk
containing 64 atoms in the unit cell as a test case as shown
in Figure 2. Norm-conserving pseudopotentials generated
with the Troullier–Martins scheme (Troullier and Martins,
1991), in fully separable form developed by Kleinman and
Bylander (1982), are used to represent interaction between
core ion and valence electrons. The screened hybrid func-
tional HSE06 (Krukau et al., 2006) was used in the follow-
ing calculations. Both single-zeta (SZ) containing s and p
shells and double-zeta plus polarization (DZP) basis set
containing s, p, and d shells are considered. All calculations
were carried out on Tianhe-2 supercomputer located in the
National Supercomputer Center in Guangzhou, China; the
configuration of the machine is shown in Table 1. The Intel
Math Kernel Library (version 10.0.3.020) is used in the
calculations.
For the parallel-pair and parallel-quartet algorithms,
which are fully parallelized and involve no communication,
load imbalance is one of the factors that may affect the
parallel efficiency. To examine the load balance, the timing
Algorithm 3. Flowchart of the parallel-quartet algorithms for
ERIs. Here Nrefers to the total number of the CPU cores and
current-core refers to the index of the current processor. The
description of Schwarz screening and NAO screening is in the
text.
for list-IJ in shell-pair-list-IJ do
for list-KL in shell-pair-list-KL do
if Schwarz screening.and. NAO screening then
iþþ
if i mod Neq current-core then
compute shell ERI ðIJjKLÞ
end if
end if
end for
end for
Basis set Shells NAOs
Si 64 SZ 128 256
Si 64 DZP 320 832
Figure 2. The silicon bulk contained 64 atoms in the unit cell that
used as benchmark system in this work. In the upper table, the
number of shells as well as the number of the NAO basis functions
for different basis sets are listed.
Algorithm 2. Flowchart of the parallel-pair algorithms for ERIs.
Here the shell-pair-list-KL-local means to distribute the shell-list-
KL over CPU cores at the beginning. The description of Schwarz
screening and NAO screening is in the text.
get shell-pair-list-KL-local
for list-IJ in shell-pair-list-IJ do
for list-KL in shell-pair-list-KL-local do
If Schwarz screening.and. NAO screening then
compute shell ERI ðIJjKLÞ
end if
end for
end for
Table 1. The parameters for each node of Tianhe-2.
Component Value
CPU Intel(R) Xeon(R) CPU E5-2692
Frequency (GHz) 2.2
Cores 12
Algorithm 1. The algorithm of NAO2GTO fitting scheme.
for iter ¼1toNdo
constrained genetic algorithm get initial aiter
mand Diter
m
constrained local minimal search to get aiter
mand Diter
m
err ¼Pi½PmDmexpðamr2
iÞfIlnðriÞ=rl
i2
if iter ¼1 .or. err <best err then
best err ¼err
am¼aiter
mand Dm¼Diter
m
end if
end for
Qin et al. 5
at every core is shown in Figures 3 to 6. It is clearly shown
that the parallel-pair algorithm (red line) is load imbalance;
for SZ basis set, the time difference between cores is
around 10%in 12 cores (Figure 3) and around 80%in
192 cores (Figure 4). For DZP basis set, d shells
have been considered, which caused more serious load
imbalance; the time difference between cores is even
around 22%in 12 cores (Figure 5) and around 100%in
0 2 4 6 8 10 12
CPU core index
98
100
102
104
106
108
110
112
CPU Time (s)
1. parallel-pair
2. parallel-quartet
SZ basis set
Figure 3. The load balance for bulk silicon supercell with
64 atoms using SZ basis set at 12 CPU cores. SZ: single-zeta.
0 50 100 150 200
CPU core index
4
5
6
7
8
9
10
11
12
CPU Time (s)
1. parallel-pair
2. parallel-quartet
SZ basis set
Figure 4. The load balance for bulk silicon supercell with
64 atoms using SZ basis set at 192 CPU cores. SZ: single-zeta.
0 2 4 6 8 10 12
CPU core index
1300
1350
1400
1450
1500
1550
1600
1650
CPU Time (s)
1. parallel-pair
2. parallel-quartet
DZP basis set
Figure 5. The load balance for bulk silicon supercell with
64 atoms using DZP basis set at 12 CPU cores. DZP: double-zeta
plus polarization.
0 50 100 150 200
CPU core index
60
70
80
90
100
110
120
130
CPU Time (s)
1. parallel-pair
2. parallel-quartet
DZP basis set
Figure 6. The load balance for bulk silicon supercell with 64
atoms using DZP basis set at 192 CPU cores. DZP: double-zeta
plus polarization.
Table 2. The CPU time (in seconds) for the calculation of the
ERIs of Si 64 system with SZ basis set using different parallel
schemes.
Cores Parallel-pair Parallel-quartet
12 110.9 108.2
24 55.6 57.0
96 15.5 18.2
192 8.6 11.6
ERI: electron repulsion integral; SZ: single-zeta.
0 50 100 150 200
Number of CPU cores
0 50 100 150 200
Number of CPU cores
0
50
100
150
200
Speedup
1. parallel-pair
2. parallel-quartet
ideal
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Parallel Efficiency
SZ basis set
Figure 7. (Color online) Parallel speedups and efficiency for ERIs
calculation formation using different parallel schemes. Speedups
were obtained on Tianhe-2 for bulk silicon supercell with
64 atoms using SZ basis set. The speedup is referenced to a run on
12 CPUs. ERI: electron repulsion integral; SZ: single-zeta.
6The International Journal of High Performance Computing Applications XX(X)
192 cores (Figure 6). On the other hand, the load balance in
parallel-quartet algorithm is quite well; the time difference
between cores is within 1%. However, in this algorithm, as
the ERIs screening part is made for the whole ERIs by all
the CPU cores, which is a constant time even with the
increasing CPU cores, there are replicate calculations for
the ERIs screening which decreases the parallel efficiency.
As shown in Figure 4, for small basis set at a large number of
CPU cores, the average CPU time of the parallel-quartet is
around twice as the parallel-pair algorithm.
Such global ERI screening in parallel-quartet algorithm
also contributes significantly to lowering the parallel
speedup and efficiency for SZ basis set, as shown in Table 2
and Figure 7. The parallel efficiency is only 58%at 192
CPU cores for parallel-quartet while holding around 80%
for parallel-pair algorithm.
For DZP basis set, the load imbalance caused by d shells
becomes another factor for lowering parallel efficiency, so
in this case, as shown in Table 3 and Figure 8, the parallel
speedup and efficiency are similar for both parallel-pair
and parallel-quartet algorithms, which is around 80%at
192 CPU cores.
5. Conclusions
In summary, we have shown our two static parallel algo-
rithms for the ERIs calculations in the NAO2GTO method.
We have also analyzed the performance of these two par-
allel algorithms for their load balance and parallel effi-
ciency. On the basis of our results, the static distribution
of ERI shell pairs produces load imbalance that causes the
efficiency to decrease, limiting the number of CPU cores
that can be utilized. On the other hand, the static distribu-
tion of ERI shell quartet can yield very high load balance;
however, because of the need of the global ERI screening
calculation, the parallel efficiency has been dramatically
reduced for small basis set. We have also tried another
static method that firstly creates a need-to-calculate ERIs
list by considering all the screening methods as well as the
eightfold permutational symmetry and secondly distributes
the ERIs in the need-to-calculate list over a number of
processes. However, we find the time to build the need-
to-calculate ERIs list is even larger than the global ERI
screening calculation. In the next step, we need to distribute
the ERI screening calculation while keeping the load bal-
ance of the ERI calculation, and a dynamic distribution
could enable load balance with little loss of efficiency.
Acknowledgement
The authors thank the Tianhe-2 Supercomputer Center for
computational resources.
Author Contributions
XQ and HS contributed equally to this work.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest
with respect to the research, authorship, and/or publication
of this article.
Funding
The author(s) disclosed receipt of the following financial
support for the research, authorship, and/or publication of
this article: This work was supported by the National Key
Research and Development Program of China (grant no.
2017YFB0202302), the Special Fund for Strategic Pilot
Technology of Chinese Academy of Sciences (grant no.
XDC01040000), the National Natural Science Foundation
of China (grant nos 61502450, 21803066), Research Start-
Up Grants (grant no. KY2340000094) from University of
Science and Technology of China, and Chinese Academy
of Sciences Pioneer Hundred Talents Program.
References
Alexeev Y, Kendall RA and Gordon MS (2002) The distributed
data SCF. Computer Physics Communications 143(1): 69–82.
Becke AD (1993) Density-functional thermochemistry. III. The
role of exact exchange. The Journal of Chemical Physics
98(7): 5648–5652.
Blum V, Gehrke R, Hanke F, et al. (2009) Ab initio molecular
simulations with numeric atom-centered orbitals. Computer
Physics Communications 180(11): 2175–2196.
0 50 100 150 200
Number of CPU cores
0 50 100 150 200
Number of CPU cores
0
50
100
150
200
Speedup
1. parallel-pair
2. parallel-quartet
ideal
0.75
0.80
0.85
0.90
0.95
1.00
Parallel Efficiency
DZP basis set
Figure 8. (Color online) Parallel speedups and efficiency for ERIs
calculation using different parallel schemes. Speedups were
obtained on Tianhe-2 for bulk silicon supercell with 64 atoms
using DZP basis set. The speedup is referenced to a run on
12 CPUs. ERI: electron repulsion integral; DZP: double-zeta plus
polarization.
Table 3. The CPU time (in seconds) for the calculation of the
ERIs of Si 64 system with DZP basis set using different parallel
schemes.
Cores Parallel-pair Parallel-quartet
12 1645.1 1572.9
24 904.0 806.1
96 251.3 225.9
192 129.6 128.0
ERI: electron repulsion integral; DZP: double-zeta plus polarization.
Qin et al. 7
Burant JC, Scuseria GE and Frisch MJ (1996) A linear scaling
method for Hartree–Fock exchange calculations of large mole-
cules. The Journal of Chemical Physics 105(19): 8969–8972.
Bush IJ, Tomic S, Searle BG, et al. (2011) Parallel implementa-
tion of the ab initio CRYSTAL program: electronic structure
calculations for periodic systems. Proceedings of the Royal
Society A: Mathematical, Physical and Engineering Sciences
467(2131): 2112–2126.
Chow E, Liu X, Smelyanskiy M, et al. (2015) Parallel scalability
of Hartree–Fock calculations. The Journal of Chemical Phy-
sics 142(10): 104103.
Coleman TF and Li Y (1994) On the convergence of interior-
reflective newton methods for nonlinear minimization subject
to bounds. Mathematical Programming 67(1-3): 189–224.
Coleman TF and Li Y (1996) An interior trust region approach for
nonlinear minimization subject to bounds. SIAM Journal on
Optimization 6(2): 418–445.
Conn AR, Gould NIM and Toint P (1991) A globally convergent
augmented Lagrangian algorithm for optimization with gen-
eral constraints and simple bounds. SIAM Journal on Numer-
ical Analysis 28(2): 545–572.
Delhalle J and Calais JL (1987) Direct-space analysis of the
Hartree-Fock energy bands and density of states for metallic
extended systems. Physical Review B 35(18): 9460–9466.
Delley B (1990) An all-electron numerical method for solving the
local density functional for polyatomic molecules. The Jour-
nal of Chemical Physics 92(1): 508–517.
Dovesi R, Saunders VR, Roetti C, et al. (2006) Crystal06 Users
Manual. Torino: University of Torino.
Enkovaara J, Rostgaard C, Mortensen JJ, et al. (2010) Electronic
structure calculations with GPAW: a real-space implementa-
tion of the projector augmented-wave method. Journal of Phy-
sics: Condensed Matter 22(25): 253202.
Frisch MJ, Trucks GW, Schlegel HB, et al. (2009) Gaussian 09,
Revision B.01. Wallingford, CT: Gaussian.
Gell-Mann M and Brueckner KA (1957) Correlation energy of an
electron gas at high density. Physical Review 106(2): 364–368.
Goldberg DE (1989) Genetic Algorithms in Search, Optimization
and Machine Learning, 1st ed. Boston: Addison-Wesley
Longman, 1989.
Gonze X, Beuken JM, Caracas R, et al. (2002) First-principles
computation of material properties: the ABINIT software proj-
ect. Computational Materials Science 25(3): 478–492.
Gonze X, Jollet F, Abreu Araujo F, et al. (2016) Recent develop-
ments in the ABINIT software package. Computer Physics
Communications 205: 106–131.
Guidon M, Hutter J and VandeVondele J (2010) Auxiliary density
matrix methods for Hartree-Fock exchange calculations. Jour-
nal of Chemical Theory and Computation 6(8): 2348–2364.
Guidon M, Schiffmann F, Hutter J, et al. (2008) Ab initio mole-
cular dynamics using hybrid density functionals. The Journal
of Chemical Physics 128(21): 214104.
Havu V, Blum V, Havu P, et al. (2009) Efficient integration for
all-electron electronic structure calculation using numeric
basis functions. Journal of Computational Physics 228(22):
8367–8379.
Heyd J, Scuseria GE and Ernzerhof M (2003) Hybrid functionals
based on a screened Coulomb potential. The Journal of Chem-
ical Physics 118(18): 8207–8215.
Heyd J, Scuseria GE and Ernzerhof M (2006) Erratum: “hybrid
functionals based on a screened coulomb potential” [The Jour-
nal of Chemical Physics 118: 8207 (2003)]. The Journal of
Chemical Physics 124(21): 219906.
Hohenberg P and Kohn W (1964) Inhomogeneous electron gas.
Physical Review 136(3B): B864–B871.
Ha¨ser M and Ahlrichs R (1989) Improvements on the direct SCF
method. Journal of Computational Chemistry 10(1): 104–111.
Izmaylov AF, Scuseria GE and Frisch MJ (2006) Efficient eva-
luation of short-range Hartree-Fock exchange in large mole-
cules and periodic systems. The Journal of Chemical Physics
125(10): 104103.
Janesko BG, Henderson TM and Scuseria GE (2009) Screened
hybrid density functionals for solid-state chemistry and phy-
sics. Physical Chemistry Chemical Physics 11(3): 443–454.
Kleinman L and Bylander DM (1982) Efficacious form for model
pseudopotentials. Physical Review Letters 48(20): 1425–1428.
Kohn W and Sham LJ (1965) Self-consistent equations including
exchange and correlation effects. Physical Review 140(4A):
A1133–A1138.
Krukau AV, Vydrov OA, Izmaylov AF, et al. (2006) Influence of
the exchange screening parameter on the performance of
screened hybrid functionals. The Journal of Chemical Physics
125(22): 224106.
Liu X, Patel A and Chow E (2014) A new scalable parallel algo-
rithm for Fock matrix construction. In: 28th IEEE Interna-
tional Parallel & Distributed Processing Symposium
(IPDPS), Phoenix, AZ, 19–23 May 2014.
Merlot P, Izsa´k R, Borgoo A, et al. (2014) Charge-constrained
auxiliary-density-matrix methods for the Hartree-Fock
exchange contribution. The Journal of Chemical Physics
141(9): 094104.
Mohr S, Ratcliff LE, Boulanger P, et al. (2014) Daubechies wave-
lets for linear scaling density functional theory. The Journal of
Chemical Physics 140(20): 204110.
Monkhorst HJ (1979) Hartree-Fock density of states for extended
systems. Physical Review B 20(4): 1504–1513.
Mori-Sa´nchez P, Cohen AJ and Yang W (2008) Localization and
delocalization errors in density functional theory and implica-
tions for band-gap prediction. Physical Review Letters
100(14): 146401.
Ochsenfeld C, White CA and Head-Gordon M (1998) Linear and
sublinear scaling formation of Hartree–Fock-type exchange
matrices. The Journal of Chemical Physics 109(5):
1663–1669.
Ozaki T (2003) Variationally optimized atomic orbitals for large-
scale electronic structures. Physical Review B 67(15): 155108.
Paier J, Diaconu CV, Scuseria GE, et al. (2009) Accurate Hartree-
Fock energy of extended systems using large Gaussian basis
sets. Physical Review B 80(17): 174114.
Paier J, Marsman M, Hummer K, et al. (2006) Screened hybrid
density functionals applied to solids. The Journal of Chemical
Physics 124(15): 154709.
8The International Journal of High Performance Computing Applications XX(X)
Parr RG and Yang W (1989) Density Functional Theory of Atoms
and Molecules. New York: Oxford University Press, 1989.
Polly R, Werner HJ, Manby FR, et al. (2004) Fast Hartree–Fock
theory using local density fitting approximations. Molecular
Physics 102(21-22): 2311–2321.
Qin X, Shang H, Xiang H, et al. (2014) HONPAS: a linear
scaling open-source solution for large system simulations.
International Journal of Quantum Chemistry 115(10):
647–655.
Ren X, Rinke P, Blum V, et al. (2012) Resolution-of-identity
approach to Hartree–Fock, hybrid density functionals, RPA,
MP2 and GW with numeric atom-centered orbital basis func-
tions. New Journal of Physics 14(5): 053020.
Schlegel HB and Frisch MJ (1995) Transformation between Car-
tesian and pure spherical harmonic Gaussians. International
Journal of Quantum Chemistry 54(2): 83–87.
Schmidt MW, Baldridge KK, Boatz JA, et al. (1993) General
atomic and molecular electronic structure system. Journal of
Computational Chemistry 14(11): 1347–1363.
Schwegler E and Challacombe M (1996) Linear scaling computa-
tion of the Hartree–Fock exchange matrix. The Journal of
Chemical Physics 105(7): 2726–2734.
Schwegler E, Challacombe M and Head-Gordon M (1997) Linear
scaling computation of the Fock matrix. II. Rigorous bounds
on exchange integrals and incremental Fock build. The Jour-
nal of Chemical Physics 106(23): 9708–9717.
Shang H, Li Z and Yang J (2011) Implementation of screened
hybrid density functional for periodic systems with numerical
atomic orbitals: basis function fitting and integral screening.
The Journal of Chemical Physics 135(3): 034110.
Sodt A and Head-Gordon M (2008) Hartree-Fock exchange
computed using the atomic resolution of the identity
approximation. The Journal of Chemical Physics 128(10):
104106.
Soler JM, Artacho E, Gale JD, et al. (2002) The SIESTA method
for ab initio order-N materials simulation. Journal of Physics:
Condensed Matter 14(11): 2745–2779.
Stephens PJ, Devlin FJ, Chabalowski CF, et al. (1994) Ab initio
calculation of vibrational absorption and circular dichroism
spectra using density functional force fields. The Journal of
Physical Chemistry 98(45): 11623–11627.
Troullier N and Martins JL (1991) Efficient pseudopotentials for
plane-wave calculations. Physical Review B 43(3):
1993–2006.
Tymczak CJ and Challacombe M (2005) Linear scaling computa-
tion of the Fock matrix. VII. Periodic density functional theory
at the Gpoint. The Journal of Chemical Physics 122(13):
134102.
Valiev M, Bylaska EJ, Govind N, et al. (2010) NWChem: a com-
prehensive and scalable open-source solution for large scale
molecular simulations. Computer Physics Communications
181(9): 1477–1489.
VandeVondele J, Krack M, Mohamed F, et al. (2005) Quickstep:
fast and accurate density functional calculations using a mixed
Gaussian and plane waves approach. Computer Physics Com-
munications 167(2): 103–128.
Author biographies
Xinming Qin is a PhD student in chemical physics at USTC
under the supervision of Prof. Jinlong Yang. He received the
BS degree in chemistry in July 2009 at the same university.
His main research interests are developing and applying new
algorithms for large-scale electronic structure calculations.
He is the main developer of the HONPAS.
Honghui Shang is an associate professor in computer sci-
ence at the Institute of Computing Technology, Chinese
Academy of Sciences, China. She received her BS degree
in physics from University of Science and Technology of
China in 2006, and the PhD degree in physical chemistry
from University of Science and Technology of China in
2011. Between 2012 and 2018, she worked as a postdoctoral
research assistant at the Fritz Haber Institute of the Max
Planck Society, Berlin, Germany, which was responsible for
the Hybrid Inorganic/Organic Systems for Opto-Electronics
(HIOS) project and for the Novel Materials Discovery
(NOMAD) project. Her main research interests are develop-
ing the physical algorithm or numerical methods for the
first-principle calculations as well as accelerating these
applications in the high performance computers. Currently,
she is the main developer of the HONPAS (leader of the
hybrid density-functional part) and FHI-aims (leader of the
density-functional perturbation theory part).
Lei Xu is currently working toward the BS degree in the
Department of Physics at the SiChuan University. His cur-
rent research interest is the parallel programming in the
high performance computing domain.
Wei Hu is currently a professor in the division of theoretical
and computational sciences at Hefei National Laboratory
for Physical Sciences at Microscale (HFNL) at USTC. He
received the BS degree in chemistry from USTC in 2007,
and the PhD degree in physical chemistry from the same
university in 2013. From 2014 to 2018, he worked as a
postdoctoral fellow in the Scalable Solvers Group of the
Computational Research Division at Lawrence Berkeley
National Laboratory (LBNL), Berkeley, United States.
During his postdoctoral research, he developed a new mas-
sively parallel methodology, DGDFT (Discontinuous
Galerkin Method for Density Functional Theory), for
large-scale DFT calculations. His main research interests
focus on development, parallelization, and application of
advanced and highly efficient DFT methods and codes for
accurate first-principles modeling and simulations of
nanomaterials.
Jinlong Yang is a full professor of chemistry and executive
dean of the School of Chemistry and Material Sciences at
USTC. He obtained his PhD degree in 1991 from USTC.
Qin et al. 9
He was awarded Outstanding Youth Foundation of China
in 2000 and selected as Changjiang Scholars Program Chair
Professor in 2001 and as a fellow of American Physical
Society (APS) in 2011. He is the second awardee of the
2005 National Award for Natural Science (the second
prize) and the awardee (principal contributor) of the 2014
Outstanding Science and Technology Achievement Prize
of the Chinese Academy of Sciences (CAS). His research
mainly focuses on the development of first-principles
methods and their application to clusters, nanostructures,
solid materials, surfaces, and interfaces. He is the initiator
and leader of the HONPAS.
Shigang Li received his Bachelor in computer science and
technology and PhD in computer architecture from the
University of Science and Technology Beijing, China, in
2009 and 2014, respectively. He was funded by CSC for a
2-year PhD study in University of Illinois, Urbana-
Champaign. He was an assistant professor (from June
2014 to August 2018) in State Key Lab of Computer Archi-
tecture, Institute of Computing Technology, Chinese Acad-
emy of Sciences at the time of his achievement in this work.
From August 2018 to now, he is a postdoc researcher in
Department of Computer Science, ETH Zurich. His
research interests focus on the performance optimization
for parallel and distributed computing systems, including
parallel algorithms, parallel programming model, perfor-
mance model, and intelligent methods for performance
optimization.
Yunquan Zhang received his BS degree in computer sci-
ence and engineering from the Beijing Institute of Tech-
nology in 1995. He received a PhD degree in computer
software and theory from the Chinese Academy of Sciences
in 2000. He is a full professor and PhD Advisor of State
Key Lab of Computer Architecture, ICT, CAS. He is also
appointed as the Director of National Supercomputing Cen-
ter at Jinan and the General Secretary of China’s High
Performance Computing Expert Committee. He organizes
and distributes China’s TOP100 List of High Performance
Computers, which traces and reports the development of
the HPC system technology and usage in China. His
research interests include the areas of high performance
parallel computing, focusing on parallel programming
models, high performance numerical algorithms, and per-
formance modeling and evaluation for parallel programs.
10 The International Journal of High Performance Computing Applications XX(X)