Content uploaded by Honghui Shang

Author content

All content in this area was uploaded by Honghui Shang on Apr 28, 2020

Content may be subject to copyright.

Special Issue Paper

The static parallel distribution

algorithms for hybrid density-functional

calculations in HONPAS package

Xinming Qin

1

, Honghui Shang

2

, Lei Xu

2

, Wei Hu

1

,

Jinlong Yang

1

, Shigang Li

2

and Yunquan Zhang

2

Abstract

Hybrid density-functional calculation is one of the most commonly adopted electronic structure theories in computational

chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have

developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-

functionals. In our scheme, the most time-consuming step is the calculation of the electron repulsion integrals (ERIs)

part, so creating an even distribution of these ERIs in parallel implementation is an issue of particular importance. Here, we

present two static scalable distributed algorithms for the ERIs computation. Firstly, the ERIs are distributed over ERIs shell

pairs. Secondly, the ERIs are distributed over ERIs shell quartets. In both algorithms, the calculation of ERIs is independent

of each other, so the communication time is minimized. We show our speedup results to demonstrate the performance of

these static parallel distributed algorithms in the Hefei Order-N packages for ab initio simulations.

Keywords

Distributed algorithms, hybrid density-functional calculations, HONPAS package, electron repulsion integrals, parallel

implementation

1. Introduction

The electronic structure calculations based on density func-

tional theory (DFT) (Hohenberg and Kohn, 1964; Kohn and

Sham, 1965; Parr and Yang, 1989) are the workhorse of

computational chemistry and materials science. However,

widely used semi-local density functionals could underes-

timate the band gaps because of its inclusion of the unphy-

sical self-interaction (Mori-Sa´nchez et al., 2008). A

possible solution is to add the nonlocal Hartree–Fock-

type exchange (HFX) into semi-local density-functionals

to construct hybrid functionals (Becke, 1993; Delhalle and

Calais, 1987; Frisch et al., 2009; Gell-Mann and Brueckner,

1957; Heyd et al., 2003; Heyd et al., 2006; Janesko et al.,

2009; Krukau et al., 2006; Monkhorst, 1979; Paier et al.,

2006; Paier et al., 2009; Stephens et al., 1994). However,

the drawback of hybrid density-functionals is that it is sig-

nificantly more expensive than conventional DFT. The

most time-consuming part in hybrid density-functional cal-

culations becomes the construction of HFX matrix, even

with the appearance of fast linear scaling algorithms that

overcome the bottlenecks encountered in conventional

methods (Burant et al., 1996; Guidon et al., 2010; Merlot

et al., 2014; Ochsenfeld et al., 1998; Polly et al., 2004;

Schwegler and Challacombe, 1996; Schwegler et al.,

1997; Sodt and Head-Gordon, 2008; Tymczak and Challa-

combe, 2005). As a result, hybrid density-functional calcu-

lations must make efficient use of parallel computing

resources in order to reduce the execution time of HFX

matrix construction.

The implementation of hybrid density-functionals for

solid-state physics calculations is mostly based on plane

waves (PW) (Gonze et al., 2002, 2016; Paier et al., 2006)

or the linear combination of atomic orbitals (LCAO)

(Dovesi et al., 2006; Frisch et al., 2009; Krukau et al.,

2006) method. The atomic orbitals basis set is efficient for

real-space formalisms, which have attracted considerable

interest for DFT calculations because of their favorable

1

Hefei National Laboratory for Physical Sciences at Microscale,

Department of Chemical Physics, and Synergetic Innovation Center of

Quantum Information and Quantum Physics, University of Science and

Technology of China, Hefei, Anhui, China

2

State Key Laboratory of Computer Architecture, Institute of Computing

Technology, Chinese Academy of Sciences, Beijing, China

Corresponding author:

Honghui Shang, State Key Laboratory of Computer Architecture, Institute

of Computing Technology, Chinese Academy of Sciences, Beijing, China.

Email: shanghui.ustc@gmail.com

The International Journal of High

Performance Computing Applications

1–10

ªThe Author(s) 2019

Article reuse guidelines:

sagepub.com/journals-permissions

DOI: 10.1177/1094342019845046

journals.sagepub.com/home/hpc

scaling with respect to the number of atoms and their poten-

tial for massively parallel implementations for large-scale

calculations (Blum et al., 2009; Delley, 1990; Enkovaara

et al., 2010; Frisch et al., 2009; Havu et al., 2009; Mohr

et al., 2014; Ren et al., 2012; Shang et al., 2011; Soler et al.,

2002). Unlike the plane wave method, when constructing

HFX matrix within the LCAO method, we must first cal-

culate the electron repulsion integrals (ERIs) via the atomic

orbitals. There are currently two types of atomic orbits that

are most commonly used. The first is Gaussian-type orbital

(GTO), as adopted in Gaussian (Frisch et al., 2009) and

CRYSTAL (Dovesi et al., 2006); its advantage is to calcu-

late ERIs analytically. The second is numerical atomic

orbital (NAO), which is adopted in SIESTA (Soler et al.,

2002), DMOL (Delley, 1990), and OPENMX (Ozaki,

2003). The advantage of NAO is its strict locality, which

naturally leads to lower order scaling of computational time

versus system size. In order to take advantages of both

types of atomic orbitals, we have proposed a new scheme

called NAO2GTO (Shang et al., 2011), in which GTO can

be used for analytical computation of ERIs in a straightfor-

ward and efficient way, while NAO can be employed to set

the strict cutoff for atomic orbitals. After employing several

ERI screening techniques, the construction of HFX matrix

can be very efficient and scale linearly (Qin et al., 2014;

Shang et al., 2011).

Parallelization of HFX matrix construction faces two

major problems of load imbalance and high communication

cost. The load imbalance arises from the irregularity of the

independent tasks available in the computation, which is

due to the screening procedure and different types of shell

quartets distributed among processes. The high communi-

cation cost is from interprocessor communication of the

density and/or HFX matrices, which is associated with the

data access pattern. It is well known that NWChem (Valiev

et al., 2010) and CP2K/Quickstep (VandeVondele et al.,

2005) are the most outstanding software in the field of high

performance parallel quantum chemical computing, and

both of them use GTOs to construct HFX matrix. In

NWChem, the parallelization of HFX matrix construction

is based on a static partitioning of work followed by a work

stealing phase (Chow et al., 2015; Liu et al., 2014). The

tasks are statically partitioned throughout the set of shell (or

atom) quartets, and then the work stealing phase acts to

polish the load balance. As a result, this parallel implemen-

tation gives very good parallel scalability of Hartree–Fock

calculations (Liu et al., 2014). In CP2K/Quickstep, the

HFX parallelization strategy is to replicate the global den-

sity and HFX matrix on each MPI process in order to

reduce communication. A load balance optimization based

on simulated annealing and a binning procedure to coarse

grain the load balancing problem have been developed

(Guidon et al., 2008). However, this approach may limit

both system size and ultimately parallel scalability.

As the ERIs calculation is the most computationally

demanding step in the NAO2GTO scheme, the develop-

ment of the new parallel algorithms is of particular

importance. Previously, for codes using localized atomic

orbitals, the parallelization of ERIs is mainly implemented

to treat finite, isolated systems (Alexeev et al., 2002; Chow

et al., 2015; Liu et al., 2014; Schmidt et al., 1993), but

only a few literature reports exist for the treatment of

periodic boundary conditions with such basis sets (Bush

et al., 2011; Guidon et al., 2008), in which the Order-N

screening for the ERIs calculations has not been consid-

ered. The purpose of this work is to present the static

parallel distribution algorithms for the NAO2GTO

scheme (Shang et al., 2011) with Order-N performance

in Hefei Order-N packages for ab initio simulations

code (Qin et al., 2014). In our approaches, the calcula-

tions of ERIs are not replicated, but are distributed over

CPU cores, as a result both the memory and the CPU

requirements of the ERIs calculation are paralleled. The

efficiency and scalability of these algorithms are demon-

strated by benchmark timings in periodic solid system

with 64 silicon atoms in the unit cell.

The outline of this article is as follows: In Section 2, we

begin with a description of the theory of hybrid functionals.

In Section 3, we describe the detailed implementation of

our parallel distribution. In Section 4, we present the bench-

mark results and the efficiency of our scheme. In Section 5,

we summarize the whole paper and show the future

research directions.

2. Fundamental theoretical framework

Before addressing the parallel algorithms, we recall the

basic equations used in this work. A spin-unpolarized nota-

tion is used throughout the text for the sake of simplicity,

but a formal generalization to the collinear spin case is

straightforward. In Kohn–Sham DFT, the total-energy

functional is given as

EKS ¼Ts½nþEext½nþEH½nþExc ½nþEnucnuc ð1Þ

Here, nðrÞis the electron density and Tsis the

kinetic energy of noninteracting electrons, while Eext is

external energy stemming from the electron-nuclear

attraction, EHis the Hartree energy, Exc is the exchange-

correlation energy, and Enucnuc is the nucleus–nucleus

repulsion energy.

The ground state electron density n0ðrÞ(and the associ-

ated ground state total energy) is obtained by variationally

minimizing equation (1) under the constraint that the num-

ber of electrons Neis conserved. This yields the chemical

potential ¼dEKS=dnof the electrons and the Kohn–

Sham single particle equations

^

hKS i¼½

^

tsþvextðrÞþvHþvxc i¼ep ið2Þ

for the Kohn–Sham Hamiltonian ^

hKS. In equation (2), ^

ts

denotes thekinetic energy operator, vext the ext ernal potential,

vHthe Hartree potential, and vxc the exchange-correlation

potential. Solving equation (2) yields the Kohn–Sham single

particle states pand their eigenenergies ep. The single par-

ticle states determine the electron density via

2The International Journal of High Performance Computing Applications XX(X)

nðrÞ¼X

i

fij ij2ð3Þ

in which fidenotes the Fermi–Dirac distribution function.

To solve equation (2) in numerical implementations, the

Kohn–Sham states are expanded in a finite basis set. For

periodic systems, the crystalline orbital iðk;rÞnormalized

in all space is a linear combination of Bloch functions

ðk;rÞ, defined in terms of atomic orbitals R

ðrÞ

iðk;rÞ¼X

c;iðkÞðk;rÞð4Þ

ðk;rÞ¼ 1

ﬃﬃﬃﬃ

N

pX

R

R

ðrÞeikðRþrÞð5Þ

where the Greek letter is the index of atomic orbitals,

iis the suffix for different bands, Ris the origin of a unit

cell, and Nis the number of unit cells in the system.

R

ðrÞ¼ðrRrÞis the th atomic orbital, whose

center is displaced from the origin of the unit cell at Rby

r.c;iðkÞis the wave function coefficient, which is

obtained by solving the following equation

HðkÞcðkÞ¼EðkÞSðkÞcðkÞð6Þ

½HðkÞn¼X

R

HR

neikðRþrnrÞð7Þ

HR

n¼h0

j^

HjR

nið8Þ

½SðkÞn¼X

R

SR

neikðRþrnrÞð9Þ

SR

n¼h0

jR

nið10Þ

In equation (8), HR

nis a matrix element of the one-

electron Hamiltonian operator ^

Hbetween the atomic orbi-

tal located in the central unit cell 0 and nlocated in the

unit cell R.

It should be noted that the exchange-correlation poten-

tial vxc is local and periodic in semi-local DFT, while in

Hartree–Fock and hybrid functionals the HFX potential

matrix element is defined as

VX

G

l¼1

2X

ns X

N;H

PHN

ns 0

N

njG

lH

s

hi

ð11Þ

where G,N, and Hrepresent different unit cells. The den-

sity matrix element PN

ns is computed by integration over the

Brillouin zone (BZ)

PN

ns ¼X

jZBZ

c

n;jðkÞcs;jðkÞqðeFejðkÞÞeikNdkð12Þ

where qis the step function, eFis the Fermi energy, and

ejðkÞis the jth eigenvalue at point k.

In order to calculate the following ERI in equation (11)

0

N

njG

lH

s

¼ZZ0

ðrÞN

nðrÞG

lðr0ÞH

sðr0Þ

jrr0jdrdr0ð13Þ

we use NAO2GTO scheme described in the following

section.

Following the flowchart in Figure 1, the NAO2GTO

scheme is to firstly fit the NAO with GTOs, and then to

calculate ERIs analytically with fitted GTOs. A NAO is a

product of a numerical radial function and a spherical

harmonic

’IlmnðrÞ¼’IlnðrÞYlmð^

rÞð14Þ

The radial part of the NAO ’IlnðrÞis calculated by the

following equation

1

2

1

r

d2

dr2rþlðlþ1Þ

2r2þVðrÞþVcut

’IlnðrÞ¼el’IlnðrÞ

ð15Þ

where VðrÞdenotes the electrostatic potential for orbital

’IlnðrÞand Vcut ensures a smooth decay of each radial

function which is strictly zero outside a confining

radius rcut.

3. Methods

3.1. NAO2GTO scheme

In our NAO2GTO approach, the radial part of the NAO

’IlnðrÞis fitted by the summation of several GTOs, denoted

as ðrÞ

ðrÞX

m

Dmrlexp amr2

ð16Þ

Parameters amand Dmare determined by minimizing

the residual sum of squares of the difference

X

iðriÞ=rl

i’IlnðriÞ=rl

i2ð17Þ

Figure 1. The flowchart of the NAO2GTO scheme in HONPAS.

HONPAS: Hefei Order-N packages for ab initio simulations;

NAO: numerical atomic orbital; ERI: electron repulsion integral;

GTO: Gaussian-type orbital.

Qin et al. 3

In practice of the solid system calculation, too diffused

basis set may cause convergence problem (Guidon et al.,

2008), as a result the exponents smaller than 0.10 are usually

not needed, and we made a constraint, that is, (a>0:1)

during our minimal search. The flowchart is shown in Algo-

rithm 1. First, we use constrained genetic algorithm (Conn

et al., 1991; Goldberg, 1989) to do a global search for initial

guess and then do a constrained local minimal search using

trust-region-reflective algorithm, which is a subspace trust-

region method and is based on the interior-reflective Newton

method described in Coleman and Li (1994, 1996). Each

iteration involves the approximate solution of a large linear

system using the method of preconditioned conjugate gradi-

ents. We make N(N>500) global searches to make sure to

have a global minimal.

Using the above fitting parameters, we build all ERIs

needed for the HFX. In our implementation, the full per-

mutation symmetry of the ERIs has been considered for

solid systems

ð0nHjlGsNÞ¼ð0nHjsNlGÞ

¼ðn0HjlGHsNHÞ¼ðn0HjsNHlGHÞ

¼ðl0sNGjGnHGÞ¼ðl0sNGjnHGGÞ

¼ðs0lGNjNnHNÞ¼ðs0lGNjnHNNÞ

ð18Þ

In this way, we save a factor of 8 in the number of

integrals that have to be calculated.

When calculating the ERIs with GTOs, the atomic

orbitals are grouped into shells according to the angular

momentum. The list need to be distributed in parallel is in

fact the shell quartet. For a shell with a angular momen-

tum of l, the number of atomic orbital basis functions is

2lþ1, so in a shell quartet integral ðIJjKLÞwe calculate

in total ð2lIþ1Þð2lJþ1Þð2lKþ1Þð2lLþ1Þatomic basis

orbital ERIs together. As a result, the computational

expense is strongly dependent on the angular momenta

of the shell quartet. It is a challenge to distribute these

shell ERIs not only in number but also considering the

time-weight.

In our NAO2GTO scheme, two shell pair lists (list-IJ

and list-KL) are firstly preselected according to Schwarz

screening (Ha¨ser and Ahlrichs, 1989)

jðnjlsÞj8ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ðnjnÞðlsjlsÞ

pð19Þ

and only the shell list indexes with ðIJjIJÞ>tor

ðKLjKLÞ>t(here tis the drop tolerance) are stored.

As shown in equation (11), the first index I runs only

within the unit cell, while the indexes (J, K, L) run over

the whole supercell, so the list-IJ is smaller than the list-

KL. Then in the ERIs calculations, the loops run over

these two shell lists.

Then before the calculation of every ERI, we use

Schwarz inequality (equation (19)) again to estimate a rig-

orous upper bound, that only the ERIs with non-negligible

contributions are calculated, we note this screening method

as Schwarz screening. Using the exponential decay of the

charge distributions, the Schwarz screening reduces the

total number of ERIs to be computed from OðN4Þto

OðN2Þ. The Schwarz screening tolerance is set to 10

5

in

the following calculation. In addition, the NAO screening

is also adopted as the NAO is strictly truncated (Shang

et al., 2011). The NAO screening is safe in the calculation

of the short-range ERI because in this case the HFX Hamil-

tonian matrix is sparse due to the screened Coulomb poten-

tial (Izmaylov et al., 2006). As a result, we store this HFX

Hamiltonian with a sparse matrix data structure.

In practice, it should also be noted that as the angular

part of the NAOs is spherical harmonic while the GTOs are

Cartesian Gaussians, we need to make a transformation

between Cartesian and spherical harmonic functions. The

difference between these two harmonic functions is the

number of atomic orbitals included in the shells whose

angular momenta are larger than 1. For example, a d-

shell has five spherical orbitals, but has six Cartesian orbi-

tals. A detailed transformation method can be found in the

study by Schlegel and Frisch (1995).

3.1.1. Parallel schemes. The ERIs have four indexes that can

be paralleled. One possible parallel scheme is to make

distribution of just one shell index, however, as the index

number in one shell is too small to make distribution over

CPU cores, such shell-one distribution may cause serious

load imbalance.

The other parallel scheme is to make distribution for

shell-pair (list-KL). It is a straightforward way to paralle-

lize the ERIs as in practice we loop over two pair lists.

However, although the shell-pair can be distributed evenly

before ERIs calculations, ERI screening is needed during

the ERIs calculations, which also causes load imbalance.

The practical implementation of the described formalism

closely follows the flowchart shown in Algorithm 2. After

the building list-KL is completed, we distributed it into

CPU cores with list-KL-local at every core. Then in the

following ERIs screening and calculation, only loops over

list-KL-local are needed for every core. The advantage of

this scheme is that it naturally bypasses the communication

process, and every CPU core only goes over and computes

its assigned list-KL-local. However, although the list-KL is

distributed evenly over processors, the ERI screening is

located after the parallelization, which causes different

number of shell ERIs that need to be calculated in every

processor. Such different number of shell ERIs makes load

imbalance.

In order to achieve load balance, distribution of individ-

ual shell-quartet ðIJjKLÞafter the ERI screening process is

a possible choice. Even if the computational time is non-

uniformity in the case of the different shell type, this dis-

tribution can also yield an even time over CPU cores

because of its smallest distribution chunks. The practical

implementation of the shell-quartet algorithm is shown in

Algorithm 3. Every CPU core will go over the global pair

lists (list-IJ and list-KL) and make the ERI screening to

4The International Journal of High Performance Computing Applications XX(X)

determine which ERIs are needed to be computed. Then a

global counter is set to count the number of computed ERIs,

this counter is distributed over CPU cores to make sure the

number of calculated ERIs is evenly distributed. The dis-

advantage of this algorithm is that every processor needs to

make the whole ERI screening, while in the abovemen-

tioned parallel-pair algorithm only the ERI screening in its

local lists is needed. Such globally calculated ERIs screen-

ing decreases the parallel efficiency.

4. Results and discussions

In order to demonstrate the performance of the abovemen-

tioned two static parallel schemes, we use silicon bulk

containing 64 atoms in the unit cell as a test case as shown

in Figure 2. Norm-conserving pseudopotentials generated

with the Troullier–Martins scheme (Troullier and Martins,

1991), in fully separable form developed by Kleinman and

Bylander (1982), are used to represent interaction between

core ion and valence electrons. The screened hybrid func-

tional HSE06 (Krukau et al., 2006) was used in the follow-

ing calculations. Both single-zeta (SZ) containing s and p

shells and double-zeta plus polarization (DZP) basis set

containing s, p, and d shells are considered. All calculations

were carried out on Tianhe-2 supercomputer located in the

National Supercomputer Center in Guangzhou, China; the

configuration of the machine is shown in Table 1. The Intel

Math Kernel Library (version 10.0.3.020) is used in the

calculations.

For the parallel-pair and parallel-quartet algorithms,

which are fully parallelized and involve no communication,

load imbalance is one of the factors that may affect the

parallel efficiency. To examine the load balance, the timing

Algorithm 3. Flowchart of the parallel-quartet algorithms for

ERIs. Here Nrefers to the total number of the CPU cores and

current-core refers to the index of the current processor. The

description of Schwarz screening and NAO screening is in the

text.

for list-IJ in shell-pair-list-IJ do

for list-KL in shell-pair-list-KL do

if Schwarz screening.and. NAO screening then

iþþ

if i mod Neq current-core then

compute shell ERI ðIJjKLÞ

end if

end if

end for

end for

Basis set Shells NAOs

Si 64 SZ 128 256

Si 64 DZP 320 832

Figure 2. The silicon bulk contained 64 atoms in the unit cell that

used as benchmark system in this work. In the upper table, the

number of shells as well as the number of the NAO basis functions

for different basis sets are listed.

Algorithm 2. Flowchart of the parallel-pair algorithms for ERIs.

Here the shell-pair-list-KL-local means to distribute the shell-list-

KL over CPU cores at the beginning. The description of Schwarz

screening and NAO screening is in the text.

get shell-pair-list-KL-local

for list-IJ in shell-pair-list-IJ do

for list-KL in shell-pair-list-KL-local do

If Schwarz screening.and. NAO screening then

compute shell ERI ðIJjKLÞ

end if

end for

end for

Table 1. The parameters for each node of Tianhe-2.

Component Value

CPU Intel(R) Xeon(R) CPU E5-2692

Frequency (GHz) 2.2

Cores 12

Algorithm 1. The algorithm of NAO2GTO fitting scheme.

for iter ¼1toNdo

constrained genetic algorithm get initial aiter

mand Diter

m

constrained local minimal search to get aiter

mand Diter

m

err ¼Pi½PmDmexpðamr2

iÞfIlnðriÞ=rl

i2

if iter ¼1 .or. err <best err then

best err ¼err

am¼aiter

mand Dm¼Diter

m

end if

end for

Qin et al. 5

at every core is shown in Figures 3 to 6. It is clearly shown

that the parallel-pair algorithm (red line) is load imbalance;

for SZ basis set, the time difference between cores is

around 10%in 12 cores (Figure 3) and around 80%in

192 cores (Figure 4). For DZP basis set, d shells

have been considered, which caused more serious load

imbalance; the time difference between cores is even

around 22%in 12 cores (Figure 5) and around 100%in

0 2 4 6 8 10 12

CPU core index

98

100

102

104

106

108

110

112

CPU Time (s)

1. parallel-pair

2. parallel-quartet

SZ basis set

Figure 3. The load balance for bulk silicon supercell with

64 atoms using SZ basis set at 12 CPU cores. SZ: single-zeta.

0 50 100 150 200

CPU core index

4

5

6

7

8

9

10

11

12

CPU Time (s)

1. parallel-pair

2. parallel-quartet

SZ basis set

Figure 4. The load balance for bulk silicon supercell with

64 atoms using SZ basis set at 192 CPU cores. SZ: single-zeta.

0 2 4 6 8 10 12

CPU core index

1300

1350

1400

1450

1500

1550

1600

1650

CPU Time (s)

1. parallel-pair

2. parallel-quartet

DZP basis set

Figure 5. The load balance for bulk silicon supercell with

64 atoms using DZP basis set at 12 CPU cores. DZP: double-zeta

plus polarization.

0 50 100 150 200

CPU core index

60

70

80

90

100

110

120

130

CPU Time (s)

1. parallel-pair

2. parallel-quartet

DZP basis set

Figure 6. The load balance for bulk silicon supercell with 64

atoms using DZP basis set at 192 CPU cores. DZP: double-zeta

plus polarization.

Table 2. The CPU time (in seconds) for the calculation of the

ERIs of Si 64 system with SZ basis set using different parallel

schemes.

Cores Parallel-pair Parallel-quartet

12 110.9 108.2

24 55.6 57.0

96 15.5 18.2

192 8.6 11.6

ERI: electron repulsion integral; SZ: single-zeta.

0 50 100 150 200

Number of CPU cores

0 50 100 150 200

Number of CPU cores

0

50

100

150

200

Speedup

1. parallel-pair

2. parallel-quartet

ideal

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Parallel Efficiency

SZ basis set

Figure 7. (Color online) Parallel speedups and efficiency for ERIs

calculation formation using different parallel schemes. Speedups

were obtained on Tianhe-2 for bulk silicon supercell with

64 atoms using SZ basis set. The speedup is referenced to a run on

12 CPUs. ERI: electron repulsion integral; SZ: single-zeta.

6The International Journal of High Performance Computing Applications XX(X)

192 cores (Figure 6). On the other hand, the load balance in

parallel-quartet algorithm is quite well; the time difference

between cores is within 1%. However, in this algorithm, as

the ERIs screening part is made for the whole ERIs by all

the CPU cores, which is a constant time even with the

increasing CPU cores, there are replicate calculations for

the ERIs screening which decreases the parallel efficiency.

As shown in Figure 4, for small basis set at a large number of

CPU cores, the average CPU time of the parallel-quartet is

around twice as the parallel-pair algorithm.

Such global ERI screening in parallel-quartet algorithm

also contributes significantly to lowering the parallel

speedup and efficiency for SZ basis set, as shown in Table 2

and Figure 7. The parallel efficiency is only 58%at 192

CPU cores for parallel-quartet while holding around 80%

for parallel-pair algorithm.

For DZP basis set, the load imbalance caused by d shells

becomes another factor for lowering parallel efficiency, so

in this case, as shown in Table 3 and Figure 8, the parallel

speedup and efficiency are similar for both parallel-pair

and parallel-quartet algorithms, which is around 80%at

192 CPU cores.

5. Conclusions

In summary, we have shown our two static parallel algo-

rithms for the ERIs calculations in the NAO2GTO method.

We have also analyzed the performance of these two par-

allel algorithms for their load balance and parallel effi-

ciency. On the basis of our results, the static distribution

of ERI shell pairs produces load imbalance that causes the

efficiency to decrease, limiting the number of CPU cores

that can be utilized. On the other hand, the static distribu-

tion of ERI shell quartet can yield very high load balance;

however, because of the need of the global ERI screening

calculation, the parallel efficiency has been dramatically

reduced for small basis set. We have also tried another

static method that firstly creates a need-to-calculate ERIs

list by considering all the screening methods as well as the

eightfold permutational symmetry and secondly distributes

the ERIs in the need-to-calculate list over a number of

processes. However, we find the time to build the need-

to-calculate ERIs list is even larger than the global ERI

screening calculation. In the next step, we need to distribute

the ERI screening calculation while keeping the load bal-

ance of the ERI calculation, and a dynamic distribution

could enable load balance with little loss of efficiency.

Acknowledgement

The authors thank the Tianhe-2 Supercomputer Center for

computational resources.

Author Contributions

XQ and HS contributed equally to this work.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest

with respect to the research, authorship, and/or publication

of this article.

Funding

The author(s) disclosed receipt of the following financial

support for the research, authorship, and/or publication of

this article: This work was supported by the National Key

Research and Development Program of China (grant no.

2017YFB0202302), the Special Fund for Strategic Pilot

Technology of Chinese Academy of Sciences (grant no.

XDC01040000), the National Natural Science Foundation

of China (grant nos 61502450, 21803066), Research Start-

Up Grants (grant no. KY2340000094) from University of

Science and Technology of China, and Chinese Academy

of Sciences Pioneer Hundred Talents Program.

References

Alexeev Y, Kendall RA and Gordon MS (2002) The distributed

data SCF. Computer Physics Communications 143(1): 69–82.

Becke AD (1993) Density-functional thermochemistry. III. The

role of exact exchange. The Journal of Chemical Physics

98(7): 5648–5652.

Blum V, Gehrke R, Hanke F, et al. (2009) Ab initio molecular

simulations with numeric atom-centered orbitals. Computer

Physics Communications 180(11): 2175–2196.

0 50 100 150 200

Number of CPU cores

0 50 100 150 200

Number of CPU cores

0

50

100

150

200

Speedup

1. parallel-pair

2. parallel-quartet

ideal

0.75

0.80

0.85

0.90

0.95

1.00

Parallel Efficiency

DZP basis set

Figure 8. (Color online) Parallel speedups and efficiency for ERIs

calculation using different parallel schemes. Speedups were

obtained on Tianhe-2 for bulk silicon supercell with 64 atoms

using DZP basis set. The speedup is referenced to a run on

12 CPUs. ERI: electron repulsion integral; DZP: double-zeta plus

polarization.

Table 3. The CPU time (in seconds) for the calculation of the

ERIs of Si 64 system with DZP basis set using different parallel

schemes.

Cores Parallel-pair Parallel-quartet

12 1645.1 1572.9

24 904.0 806.1

96 251.3 225.9

192 129.6 128.0

ERI: electron repulsion integral; DZP: double-zeta plus polarization.

Qin et al. 7

Burant JC, Scuseria GE and Frisch MJ (1996) A linear scaling

method for Hartree–Fock exchange calculations of large mole-

cules. The Journal of Chemical Physics 105(19): 8969–8972.

Bush IJ, Tomic S, Searle BG, et al. (2011) Parallel implementa-

tion of the ab initio CRYSTAL program: electronic structure

calculations for periodic systems. Proceedings of the Royal

Society A: Mathematical, Physical and Engineering Sciences

467(2131): 2112–2126.

Chow E, Liu X, Smelyanskiy M, et al. (2015) Parallel scalability

of Hartree–Fock calculations. The Journal of Chemical Phy-

sics 142(10): 104103.

Coleman TF and Li Y (1994) On the convergence of interior-

reflective newton methods for nonlinear minimization subject

to bounds. Mathematical Programming 67(1-3): 189–224.

Coleman TF and Li Y (1996) An interior trust region approach for

nonlinear minimization subject to bounds. SIAM Journal on

Optimization 6(2): 418–445.

Conn AR, Gould NIM and Toint P (1991) A globally convergent

augmented Lagrangian algorithm for optimization with gen-

eral constraints and simple bounds. SIAM Journal on Numer-

ical Analysis 28(2): 545–572.

Delhalle J and Calais JL (1987) Direct-space analysis of the

Hartree-Fock energy bands and density of states for metallic

extended systems. Physical Review B 35(18): 9460–9466.

Delley B (1990) An all-electron numerical method for solving the

local density functional for polyatomic molecules. The Jour-

nal of Chemical Physics 92(1): 508–517.

Dovesi R, Saunders VR, Roetti C, et al. (2006) Crystal06 Users

Manual. Torino: University of Torino.

Enkovaara J, Rostgaard C, Mortensen JJ, et al. (2010) Electronic

structure calculations with GPAW: a real-space implementa-

tion of the projector augmented-wave method. Journal of Phy-

sics: Condensed Matter 22(25): 253202.

Frisch MJ, Trucks GW, Schlegel HB, et al. (2009) Gaussian 09,

Revision B.01. Wallingford, CT: Gaussian.

Gell-Mann M and Brueckner KA (1957) Correlation energy of an

electron gas at high density. Physical Review 106(2): 364–368.

Goldberg DE (1989) Genetic Algorithms in Search, Optimization

and Machine Learning, 1st ed. Boston: Addison-Wesley

Longman, 1989.

Gonze X, Beuken JM, Caracas R, et al. (2002) First-principles

computation of material properties: the ABINIT software proj-

ect. Computational Materials Science 25(3): 478–492.

Gonze X, Jollet F, Abreu Araujo F, et al. (2016) Recent develop-

ments in the ABINIT software package. Computer Physics

Communications 205: 106–131.

Guidon M, Hutter J and VandeVondele J (2010) Auxiliary density

matrix methods for Hartree-Fock exchange calculations. Jour-

nal of Chemical Theory and Computation 6(8): 2348–2364.

Guidon M, Schiffmann F, Hutter J, et al. (2008) Ab initio mole-

cular dynamics using hybrid density functionals. The Journal

of Chemical Physics 128(21): 214104.

Havu V, Blum V, Havu P, et al. (2009) Efficient integration for

all-electron electronic structure calculation using numeric

basis functions. Journal of Computational Physics 228(22):

8367–8379.

Heyd J, Scuseria GE and Ernzerhof M (2003) Hybrid functionals

based on a screened Coulomb potential. The Journal of Chem-

ical Physics 118(18): 8207–8215.

Heyd J, Scuseria GE and Ernzerhof M (2006) Erratum: “hybrid

functionals based on a screened coulomb potential” [The Jour-

nal of Chemical Physics 118: 8207 (2003)]. The Journal of

Chemical Physics 124(21): 219906.

Hohenberg P and Kohn W (1964) Inhomogeneous electron gas.

Physical Review 136(3B): B864–B871.

Ha¨ser M and Ahlrichs R (1989) Improvements on the direct SCF

method. Journal of Computational Chemistry 10(1): 104–111.

Izmaylov AF, Scuseria GE and Frisch MJ (2006) Efficient eva-

luation of short-range Hartree-Fock exchange in large mole-

cules and periodic systems. The Journal of Chemical Physics

125(10): 104103.

Janesko BG, Henderson TM and Scuseria GE (2009) Screened

hybrid density functionals for solid-state chemistry and phy-

sics. Physical Chemistry Chemical Physics 11(3): 443–454.

Kleinman L and Bylander DM (1982) Efficacious form for model

pseudopotentials. Physical Review Letters 48(20): 1425–1428.

Kohn W and Sham LJ (1965) Self-consistent equations including

exchange and correlation effects. Physical Review 140(4A):

A1133–A1138.

Krukau AV, Vydrov OA, Izmaylov AF, et al. (2006) Influence of

the exchange screening parameter on the performance of

screened hybrid functionals. The Journal of Chemical Physics

125(22): 224106.

Liu X, Patel A and Chow E (2014) A new scalable parallel algo-

rithm for Fock matrix construction. In: 28th IEEE Interna-

tional Parallel & Distributed Processing Symposium

(IPDPS), Phoenix, AZ, 19–23 May 2014.

Merlot P, Izsa´k R, Borgoo A, et al. (2014) Charge-constrained

auxiliary-density-matrix methods for the Hartree-Fock

exchange contribution. The Journal of Chemical Physics

141(9): 094104.

Mohr S, Ratcliff LE, Boulanger P, et al. (2014) Daubechies wave-

lets for linear scaling density functional theory. The Journal of

Chemical Physics 140(20): 204110.

Monkhorst HJ (1979) Hartree-Fock density of states for extended

systems. Physical Review B 20(4): 1504–1513.

Mori-Sa´nchez P, Cohen AJ and Yang W (2008) Localization and

delocalization errors in density functional theory and implica-

tions for band-gap prediction. Physical Review Letters

100(14): 146401.

Ochsenfeld C, White CA and Head-Gordon M (1998) Linear and

sublinear scaling formation of Hartree–Fock-type exchange

matrices. The Journal of Chemical Physics 109(5):

1663–1669.

Ozaki T (2003) Variationally optimized atomic orbitals for large-

scale electronic structures. Physical Review B 67(15): 155108.

Paier J, Diaconu CV, Scuseria GE, et al. (2009) Accurate Hartree-

Fock energy of extended systems using large Gaussian basis

sets. Physical Review B 80(17): 174114.

Paier J, Marsman M, Hummer K, et al. (2006) Screened hybrid

density functionals applied to solids. The Journal of Chemical

Physics 124(15): 154709.

8The International Journal of High Performance Computing Applications XX(X)

Parr RG and Yang W (1989) Density Functional Theory of Atoms

and Molecules. New York: Oxford University Press, 1989.

Polly R, Werner HJ, Manby FR, et al. (2004) Fast Hartree–Fock

theory using local density fitting approximations. Molecular

Physics 102(21-22): 2311–2321.

Qin X, Shang H, Xiang H, et al. (2014) HONPAS: a linear

scaling open-source solution for large system simulations.

International Journal of Quantum Chemistry 115(10):

647–655.

Ren X, Rinke P, Blum V, et al. (2012) Resolution-of-identity

approach to Hartree–Fock, hybrid density functionals, RPA,

MP2 and GW with numeric atom-centered orbital basis func-

tions. New Journal of Physics 14(5): 053020.

Schlegel HB and Frisch MJ (1995) Transformation between Car-

tesian and pure spherical harmonic Gaussians. International

Journal of Quantum Chemistry 54(2): 83–87.

Schmidt MW, Baldridge KK, Boatz JA, et al. (1993) General

atomic and molecular electronic structure system. Journal of

Computational Chemistry 14(11): 1347–1363.

Schwegler E and Challacombe M (1996) Linear scaling computa-

tion of the Hartree–Fock exchange matrix. The Journal of

Chemical Physics 105(7): 2726–2734.

Schwegler E, Challacombe M and Head-Gordon M (1997) Linear

scaling computation of the Fock matrix. II. Rigorous bounds

on exchange integrals and incremental Fock build. The Jour-

nal of Chemical Physics 106(23): 9708–9717.

Shang H, Li Z and Yang J (2011) Implementation of screened

hybrid density functional for periodic systems with numerical

atomic orbitals: basis function fitting and integral screening.

The Journal of Chemical Physics 135(3): 034110.

Sodt A and Head-Gordon M (2008) Hartree-Fock exchange

computed using the atomic resolution of the identity

approximation. The Journal of Chemical Physics 128(10):

104106.

Soler JM, Artacho E, Gale JD, et al. (2002) The SIESTA method

for ab initio order-N materials simulation. Journal of Physics:

Condensed Matter 14(11): 2745–2779.

Stephens PJ, Devlin FJ, Chabalowski CF, et al. (1994) Ab initio

calculation of vibrational absorption and circular dichroism

spectra using density functional force fields. The Journal of

Physical Chemistry 98(45): 11623–11627.

Troullier N and Martins JL (1991) Efficient pseudopotentials for

plane-wave calculations. Physical Review B 43(3):

1993–2006.

Tymczak CJ and Challacombe M (2005) Linear scaling computa-

tion of the Fock matrix. VII. Periodic density functional theory

at the Gpoint. The Journal of Chemical Physics 122(13):

134102.

Valiev M, Bylaska EJ, Govind N, et al. (2010) NWChem: a com-

prehensive and scalable open-source solution for large scale

molecular simulations. Computer Physics Communications

181(9): 1477–1489.

VandeVondele J, Krack M, Mohamed F, et al. (2005) Quickstep:

fast and accurate density functional calculations using a mixed

Gaussian and plane waves approach. Computer Physics Com-

munications 167(2): 103–128.

Author biographies

Xinming Qin is a PhD student in chemical physics at USTC

under the supervision of Prof. Jinlong Yang. He received the

BS degree in chemistry in July 2009 at the same university.

His main research interests are developing and applying new

algorithms for large-scale electronic structure calculations.

He is the main developer of the HONPAS.

Honghui Shang is an associate professor in computer sci-

ence at the Institute of Computing Technology, Chinese

Academy of Sciences, China. She received her BS degree

in physics from University of Science and Technology of

China in 2006, and the PhD degree in physical chemistry

from University of Science and Technology of China in

2011. Between 2012 and 2018, she worked as a postdoctoral

research assistant at the Fritz Haber Institute of the Max

Planck Society, Berlin, Germany, which was responsible for

the Hybrid Inorganic/Organic Systems for Opto-Electronics

(HIOS) project and for the Novel Materials Discovery

(NOMAD) project. Her main research interests are develop-

ing the physical algorithm or numerical methods for the

first-principle calculations as well as accelerating these

applications in the high performance computers. Currently,

she is the main developer of the HONPAS (leader of the

hybrid density-functional part) and FHI-aims (leader of the

density-functional perturbation theory part).

Lei Xu is currently working toward the BS degree in the

Department of Physics at the SiChuan University. His cur-

rent research interest is the parallel programming in the

high performance computing domain.

Wei Hu is currently a professor in the division of theoretical

and computational sciences at Hefei National Laboratory

for Physical Sciences at Microscale (HFNL) at USTC. He

received the BS degree in chemistry from USTC in 2007,

and the PhD degree in physical chemistry from the same

university in 2013. From 2014 to 2018, he worked as a

postdoctoral fellow in the Scalable Solvers Group of the

Computational Research Division at Lawrence Berkeley

National Laboratory (LBNL), Berkeley, United States.

During his postdoctoral research, he developed a new mas-

sively parallel methodology, DGDFT (Discontinuous

Galerkin Method for Density Functional Theory), for

large-scale DFT calculations. His main research interests

focus on development, parallelization, and application of

advanced and highly efficient DFT methods and codes for

accurate first-principles modeling and simulations of

nanomaterials.

Jinlong Yang is a full professor of chemistry and executive

dean of the School of Chemistry and Material Sciences at

USTC. He obtained his PhD degree in 1991 from USTC.

Qin et al. 9

He was awarded Outstanding Youth Foundation of China

in 2000 and selected as Changjiang Scholars Program Chair

Professor in 2001 and as a fellow of American Physical

Society (APS) in 2011. He is the second awardee of the

2005 National Award for Natural Science (the second

prize) and the awardee (principal contributor) of the 2014

Outstanding Science and Technology Achievement Prize

of the Chinese Academy of Sciences (CAS). His research

mainly focuses on the development of first-principles

methods and their application to clusters, nanostructures,

solid materials, surfaces, and interfaces. He is the initiator

and leader of the HONPAS.

Shigang Li received his Bachelor in computer science and

technology and PhD in computer architecture from the

University of Science and Technology Beijing, China, in

2009 and 2014, respectively. He was funded by CSC for a

2-year PhD study in University of Illinois, Urbana-

Champaign. He was an assistant professor (from June

2014 to August 2018) in State Key Lab of Computer Archi-

tecture, Institute of Computing Technology, Chinese Acad-

emy of Sciences at the time of his achievement in this work.

From August 2018 to now, he is a postdoc researcher in

Department of Computer Science, ETH Zurich. His

research interests focus on the performance optimization

for parallel and distributed computing systems, including

parallel algorithms, parallel programming model, perfor-

mance model, and intelligent methods for performance

optimization.

Yunquan Zhang received his BS degree in computer sci-

ence and engineering from the Beijing Institute of Tech-

nology in 1995. He received a PhD degree in computer

software and theory from the Chinese Academy of Sciences

in 2000. He is a full professor and PhD Advisor of State

Key Lab of Computer Architecture, ICT, CAS. He is also

appointed as the Director of National Supercomputing Cen-

ter at Jinan and the General Secretary of China’s High

Performance Computing Expert Committee. He organizes

and distributes China’s TOP100 List of High Performance

Computers, which traces and reports the development of

the HPC system technology and usage in China. His

research interests include the areas of high performance

parallel computing, focusing on parallel programming

models, high performance numerical algorithms, and per-

formance modeling and evaluation for parallel programs.

10 The International Journal of High Performance Computing Applications XX(X)