Content uploaded by Wei Hu
Author content
All content in this area was uploaded by Wei Hu on Aug 26, 2021
Content may be subject to copyright.
Hybrid MPI and OpenMP parallel
implementation of large-scale linear-response
time-dependent density functional theory with
plane-wave basis set
Lingyun Wan,aXiaofeng Liu,aJie Liu,aWei Hu,∗aand Jinlong Yang∗a
aHefei National Laboratory for Physical Sciences at the Microscale, Department of Chemical
Physics, and Synergetic Innovation Center of Quantum Information and Quantum Physics,
University of Science and Technology of China, Hefei, Anhui 230026, China
E-mail: whuustc@ustc.edu.cn (Wei Hu), jlyang@ustc.edu.cn
(Jinlong Yang)
September 2020
Abstract.
High performance computing (HPC) is a powerful tool to accelerate the Kohn-Sham
density functional theory (KS-DFT) calculation on modern heterogeneous supercomputers.
Here, we describe a massively parallel implementation of large-scale linear-response time-
dependent density functional theory (LR-TDDFT) to compute the excitation energies and
wavefunctions in solid with the plane-wave basis set under the periodic boundary condition in
the Plane Wave Density Functional Theory (PWDFT) software package. We adopt a two-level
parallelization strategy which combines with the Message Passing Interface (MPI) and Open
Multi-Processing (OpenMP) parallel programming to deal with the matrix computation and
data communication of constructing and diagonalizing the LR-TDDFT Hamiltonian matrix.
Numerical results illustrate that the LR-TDDFT calculation can scale up to 24,576 processing
cores on modern heterogeneous supercomputers to study the excited state properties of bulky
silicon systems which contain thousands of atoms (4,096 atoms). We prove the LR-TDDFT
calculation to study the photoinduced charge separation of water molecule adsorption on rutile
TiO2(110) surface from an excitonic perspective.
Keywords: time-dependent density functional theory, thousands of atoms, MPI and OpenMP,
high performance computing
1. Introduction
Time-dependent density functional theory (TDDFT) based on the Runge-Gross theorem1is a
self-consistent theoretical framework to describe the excited state properties in molecule and
solid. In general, there are two schemes for solving the time-dependent Schr¨
odinger equation
in the framework of TDDFT2,3 . The first approach referred as real-time TDDFT(RT-TDDFT)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 2
is to solve the problem with the propagator integrated over time and real space which uses
molecular dynamics to do the time evolution4. The second one, which is the most widely
used and often referred as TDDFT in literatures, solves the many-body quantum problems
in the frequency domain via Fourier transformation of the time-dependent linear response
function, and obtains excitation energies and corresponding oscillation strengths from poles
and residues in the complex response function5,6. In order to distinguish this approach from
the first one, it is referred as the linear-response TDDFT (LR-TDDFT).
The common way to obtain the excitation energies and wavefunctions for LR-TDDFT is
to solve the linear-response Casida equation5. There are two main components when solving
the Casida equation, one is to construct the LR-TDDFT Hamiltonian with high computational
complexity of O(N5
e), and another is to diagonalize the LR-TDDFT Hamiltonian with
ultrahigh computational complexity of O(N6
e), where Neis the number of electrons in the
systems. Therefore, the computational cost and memory usage of the LR-TDDFT calculation
become prohibitively expensive as the system size increases2,3.Several low-scaling methods
have been proposed to reduce the ultrahigh computational cost and memory usage in
the TDDFT calculation with small localized basis set7, such as Gaussian-type orbitals or
numerical atomic orbitals. However, it is still challenging to simulate the excited state
properties of large-scale systems containing thousands of atoms by using the LR-TDDFT
calculation especially for the plane-wave basis set under the periodic boundary condition.
In recent years, the rapid development of modern heterogeneous supercomputers
enables the high performance computing (HPC) as a powerful tool to accelerate the KS-
DFT calculation for large-scale systems and provide huge memory space. In particular,
several highly effective HPC KS-DFT software packages for ground-state electronic
structure calculation with small localized basis set have been developed, such as SIESTA8,
CP2K9, CONQUEST10, FHI-aims 11 , BigDFT12, HONPAS 13–15 and DGDFT 16–18 , which
are beneficial to take advantage of the massive parallelism on modern heterogeneous
supercomputers. For example, large-scale KS-DFT calculation containing tens of thousands
of atoms have been performed in CP2K9, CONQUEST 10 and DGDFT 17,18 , which can scale
up to hundreds of thousands of cores on the Cray, Edison, Cori and Sunway TaihuLight
supercomputers.
Based on such large-scale DFT calculation, large-scale TDDFT calculation become
reachable. In particular, large-scale TDDFT excited state electronic structure calculation
using the Gaussian basis set have been implemented in NWChem19 and QChem 20 . For
example, NWChem21 has performed medium-scale LR-TDDFT excited state calculation
on the Au20Ne100 cluster (120 atoms and 1,840 basis functions) and scaling up to 2,250
processing cores on the CINECA supercomputer. However, there is always no breakthrough
to the case of the plane-wave basis set, due to such ultrahigh computational cost and memory
usage in the LR-TDDFT excited state electronic structure calculation22,23.
In the present work, we describe a massively parallel implementation of LR-TDDFT
to compute the excited state electronic structures in solid with the plane-wave basis set
under the periodic boundary condition in the PWDFT (Plane Wave Density Functional
Theory)24–26 software package. We adopt a two-level parallelization strategy which combines
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 3
with the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallel
programming to deal with the ultrahigh computational cost and memory usage of constructing
and diagonalizing the LR-TDDFT Hamiltonian. We show that the LR-TDDFT calculation can
scale up to 24,576 processing cores on modern heterogeneous supercomputers for simulating
the excited state properties of bulky silicon systems that contain thousands of atoms. We
perform the LR-TDDFT calculation to study the excited states of water molecule adsorption
on rutile TiO2(110) surface from an excitonic perspective.
2. Methodology
In this section, we describe the theoretical algorithms and parallel implementation of the
LR-TDDFT calculation with the plane-wave basis set under the periodic boundary condition
in the PWDFT24–26 software package. We present the implementation of the LR-TDDFT
calculation based on the two-level parallelization strategy that combines with the Message
Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallel programming to deal
with the ultrahigh computational cost and memory usage of constructing and diagonalizing
the LR-TDDFT Hamiltonian matrix, resulting in HPC excited state electronic structure
calculation on modern heterogeneous supercomputers.
2.1. Linear response time-dependent density functional theory
The Casida equation in linear response time-dependent density functional theory (LR-
TDDFT) can be described as an eigenvalue problem of the form (The detailed derivation
of the Casida equation is given in the Appendix)
HX =XΛ(1)
where Xis the expansion coefficient of excitation wavefunction in KS orbitals and Λis the
corresponding excitation energies. we can call Has LR-TDDFT Hamiltonian because its
form is similar with Hamiltonian. Hhas the following block structure
H="D+2VHxc 2WHxc
−2W†
Hxc −D−2V†
Hxc #(2)
VHxc and WHxc represent the Hartree-exchange-correlation integrals defined as
VHxc(ivic,jvjc) =Zψ∗
iv(r)ψic(r)fHxc(r,r0)ψjv(r0)ψ∗
jc(r0)drdr0
WHxc(ivic,jvjc) =Zψ∗
iv(r)ψic(r)fHxc(r,r0)ψ∗
jv(r0)ψjc(r0)drdr0(3)
here, fHxc is called the Hartree-exchange-correlation kernel and it is expressed as a function
of VHand Vxc after self-consistent field process
fHxc(r,r0) = fH+fxc =1
|r−r0|+δVxc[ρ]
δ ρ (r0)(4)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 4
where ρ(r) = ΣNe
iv|ψiv(r)|2is the electron density and Neis the number of electron in the
ground state.
In the LR-TDDFT Hamiltonian, the Dis an Nvc ×Nvc (Nvc =NvNc) diagonal matrix with
matrix elements:
D(ivic,jvjc) = (εic−εiv)δivjvδicjc(5)
which the orbital energies εiv(iv=1,2,...,Nv) and εic(ic=1,2,...,Nc) are associated with
selected valence orbitals ψiv(r)and selected conduction orbitals ψic(r)respectively which
used in the LR-TDDFT calculation. These energies and orbitals are typically obtained from
the KS-DFT calculation.
Under the Tamm-Dancoff approximation (TDA)3,WHxc can be neglected and Hbecomes
Hermitian matrix rewritten as
H=D+2VHxc (6)
In the real space, we chose Nrgrid points to sample the wavefunctions, so the Hartree-
exchange-correlation integrals VHxc can be rewritten as a multiplication of the matrix fHxc ∈
CNr×Nrwith the transposed Khatri-Rao product (also known as face-splitting product)27
matrix P
vc ={ρvc(r):=ψ∗
iv(r)ψic(r)} ∈ CNr×Nvc for the valence and conduction orbitals
(ψiv(r)and ψic(r)) in real space ({ri}Nr
i=1)
VHxc =P†
vc fHxcP
vc
=P†
vc(fH+fxc)P
vc
(7)
For the purpose of simulate, we just consider the local-density approximation (LDA)
functional28 in the KSDFT and LR-TDDFT calculation in this work. In that case, the
LDA exchange-correlation potential fxc is diagonal in real space ({ri}Nr
i=1), the exchange-
correlation product state fxcP
vc can be computed directly and efficiently in real space with
computational complexity of NrNvNc∼O(N3
e). For the Hartree operator apply onto product
state fHP
vc, we need to transform it into reciprocal space ({Gi}Ng
i=1whose Ngis the number
of grid points in reciprocal space), where the Hartree operator fHis diagonal. We denote it
by ˜
fH(G) = 4π/|G|2(Notice that ˜
fH(G=0) = 0) and the fHP
vc =˜
fH˜
P
vc can be efficiently
computed in reciprocal space. Where ˜
P
vc ={˜
ρvc(G):=ψ∗
iv(G)ψic(G)} ∈ CNg×Nvc. The total
number of grid points Nrin real space is determined from the kinetic energy cutoff Ecut
defined as (Nr)i=√2EcutLi/π, where Liis the length of supercells along the i-th (x, y and z)
coordinate direction.
In this work, we utilize the Fast Fourier Transforms (FFTs) to do the transformation
between real ({ri}Nr
i=1) and reciprocal ({Gi}Ng
i=1) space. So the total number of reciprocal grid
points Ngis equal to the number of grid points Nr. In addition to performing NvNc∼O(N2
e)
FFTs to transfer from P
vc ={ρvc(r)}to ˜
P
vc ={˜
ρvc(G)}, the Hartree operator ˜
fHthat can be
applied on ˜
P
vc requires a computational complexity of NgNvNc∼O(N3
e). After above steps,
we apply FFTs inverse on Hartree product matrixes, then add the two results and multiply
corresponding conjugate product matrices P†
vc to compute the Hartree-exchange-correlation
integrals.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 5
Table 1: Computational cost and memory usage in the LR-TDDFT calculation. Notice that
Nr≈1,000 ×No, and Nv≈Nc≈No∼O(Ne)in the plane-wave basis set.
LR-TDDFT Computational cost Memory usage
FFT NvNcNrlogNrNrNvNc
VHxc product (Gemm) NrN2
vN2
cN2
vN2
c
Hdiagonalization (Syevd) N3
vN3
cN2
vN2
c
After constructing the LR-TDDFT Hamiltonian matrix H, the next step is to diagonalize
the LR-TDDFT Hamiltonian matrix Hexplicitly with an ultrahigh complexity of N3
vN3
c∼
O(N6
e), and to obtain the wavefunction expansion coefficients matrix Xand energies Λ. The
pseudocode of the LR-TDDFT calculation is shown in Algorithm 1.
Algorithm 1 The pseudocode for explicitly constructing and directly diagonalizing the LR-
TDDFT Hamiltonian in the LR-TDDFT calculation in the plane-wave basis set.
Input: Ground-state energies εiand wavefunctions ψi(r).
Output: excited state energies {λi}and wavefunction expansion coefficients {xi j}.
1: Initialize the P
vc in real space and transfer into reciprocal space ˜
P
vc →O(NrNvNc+
NrlogNrNvNc)
2: Apply the exchange-correction potential to the P
vc in real space →O(NrNvNc)
3: Apply the Hartree potential to the ˜
P
vc and transfer into real space to get fHP
vc →
O(NrlogNrNvNc+NrNvNc)
4: Compute the Hartree-exchange-correlation integrals VHxc =P†
vc(fHP
vc +fxcP
vc)in real
space →O(NrN2
vN2
c)
5: Diagonalize the LR-TDDFT Hamiltonian Hand obtain the excitation energies {λi}and
wavefunctions’ expansion coefficients {xi j} → O(N3
vN3
c)
It should be noted that for large normalized plane-wave basis set, Ngis typically much
larger than Nvor Nc. The number Nocorresponding occupied orbitals is either Neor Ne/2
depending on how the spin is counted. The number of conduction orbitals Ncincluded in the
LR-TDDFT calculation is typically with a small Nv(the precise number is a free parameter to
be converged), where Nrand Ngis often much larger than 1,000×Nedue to the high accuracy
of plane-wave basis set. In this work, we compute the LR-TDDFT Hamiltonian matrix in the
real space and set Nr=Ngin real and reciprocal space. Table 1 summarizes the computational
cost and memory usage for explicitly constructing and directly diagonalizing the LR-TDDFT
Hamiltonian in the LR-TDDFT calculation under the plane-wave basis set.
2.2. Parallel implementation
The implementation of the LR-TDDFT calculation is based on a two-level parallelization
strategy which exploits the MPI parallel programming to deal with the matrix computation
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 6
and data communication, and combined with the multi-thread parallelism of OpenMP parallel
programming to share the memory usage in the LR-TDDFT calculation with the plane-wave
basis set.
2.2.1. MPI parallelism
In this section, we demonstrate an efficient parallel implementation of three different
types of data partition with the MPI parallelism for the LR-TDDFT calculation in the
planewave basis set as shown in figure 1.
Figure 1: Three different types of data partition with the MPI parallelism for the matrix
computation and data communication used in the LR-TDDFT calculation. (a) 1D column
block with the size of 1 ×NpMPI processor grids (Red box), (b) 1D row block with the size
of Np×1 MPI processor grids (Blue box) and (c) 2D cyclic block with the size of I ×J
MPI processor grids (Green box). Npis the number of MPI processes used in the LR-TDDFT
calculation and I ×J = Np.
When Npprocessors are used in the MPI parallelism, the wavefunctions, which are getted
from the ground-state DFT calculation, are stored in the 1D column block partition form as
shown in figure 1(a). In order to obtain face-splitting product P
νc, we should convert the
wavefunctions to 1D row partition as shown in figure 1(b). In the LR-TDDFT Hamiltonian
matrix, the Hartree operator is diagonal in reciprocal space and the exchange-correlation
operator is diagonal in real space. So we use the Fast Fourier Transforms (FFTs) to transform
the P
νcin real space to the ˜
P
νcin reciprocal space and acquire the Hartree operator. In order
to apply the FFTs, we should convert P
νcfrom 1D row partition to 1D column partition by
MPI Alltoall as shown in figure 1(a) and (b). After getting the LR-TDDFT Hamiltonian, we
diagonalize the Hamiltonian to obtain excitation energies and excitation wavefunctions. In the
diagonalization process, the 2D block cyclic partition shown in figure 1(c) is the most efficient
data distribution scheme for performing diagonalization implemented in the ScaLAPACK
software package. The conversion among 1D row block partition and 2D block cyclic partition
is by using the pdgemr2d subroutine (similar to MPI Alltoall) in the ScaLAPACK library.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 7
Figure 2: The chartflow of parallel implementation of the LR-TDDFT method in the PWDFT
software package. There are three different types (Red, blue and green boxes) of MPI
data partition for the matrix computation and data communication used in the LR-TDDFT
calculation.
More specifically, we describe the whole method of LR-TDDFT and various quantities in
the LR-TDDFT method with the corresponding storage formats in figure 2. Starting from Ψiν
and Ψicdistributed in the 1D column block partition, we first transform these wavefunctions
into the 1D row block partition and apply the Harmard product to get the P
νc. In order to
construct the LR-TDDFT Hamiltonian matrix, we send the P
νcseparately to two parts which
showed in figure 2. In the right part, we convert P
νcinto the 1D column block partition and
apply the Hartree operator to P
νc. After that we use inverse FFTs to transform the ˜
fH˜
P
νc
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 8
from reciprocal space back to real space in order to reduce one matrix product which is the
most expensive part. In the left part, the fxc can be calculated from density directly and apply
the xc kernel to P
νcin real space to get fxcP
νcand at the same time we add the fHP
νcand
fxcP
νcto get fHxcP
νc. Then we multiply fHxcP
νcand P†
νc. Finally, we calculate the difference
between Kohn-Sham energy eigenvalues and add that to P†
νcfxcP
νcto construct LRTDDFT
Hamiltonian matrix.
In order to implement the process of LR-TDDFT Hamiltonian’s diagonalization, we
form the Hamiltonian matrix in parallel within a 2D block cyclic distribution scheme, and
perform a parallel diagonalization (invoke the Syevd subroutine in the ScaLAPACK library)
of Hamiltonian on the 2D block cyclic grid. Finally, we reconstruct the excitation eigenvectors
with the pdgemr2d subroutine.
2.2.2. Hybrid MPI and OpenMP parallelism
The hybrid MPI-OpenMP implementation can overcome the memory issues and improve
the scaling where the shared memory space is on an individual node. This suitability arises
from the fact that MPI and OpenMP differ significantly in terms of the visibility of memory
between cores29. In the MPI parallelism, each process has its own memory space, and the
movement of data between processes is explicitly controlled using calls to the MPI application
programming interface (API), such as MPI Reduce and MPI Alltoall. In contrast, OpenMP
is a shared memory system, and all threads spawned by a given process have accessed to
the same memory. The relevance of hybrid approaches is currently increasing, as trends in
processor development focus on increasing shared-memory parallelism (more cores per node),
rather than faster processors.
The memory usage and time consuming in the LR-TDDFT calculation is spent in most of
matrix operations, such as the vector-vector, matrix-vector, and matrix-matrix multiplications
(Gemm), matrix diagonalization (Syevd), fast Fourier transform (FFT). All of these matrix
operations can be implemented through the Basic Linear Algebra Subprograms (BLAS)30,
Linear Algebra PACKage (LAPACK) and Fastest Fourier Transform in the West (FFTW)
libraries31. Fortunately, most of these subroutines in the BLAS, LAPACK and FFTW libraries
can be effectively multi-thread accelerated by the OpenMP parallelism. Furthermore, the data
communication will be expensive for large system, and then OpenMP can balance the load in
each threads automatically and shared memory can reduce communication between the MPI
processes.
In the LR-TDDFT calculation, matrix operations and data communication often utilize
the BLAS, LAPACK, FFT and for-loop statements, in the serially executed codes. These
paradigms are implemented through the addition of programmes, such as following examples
of OpenMP usage within the LR-TDDFT calculation
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 9
Figure 3: Two-level parallelization strategy that uses the MPI parallel programming
(The number of MPI processes is Np= 4) to deal with the matrix computation and
data communication, and combines with the multi-thread parallelism of OpenMP parallel
programming (The number of OpenMP threads is Nt= 2) to share the memory usage in the
LR-TDDFT calculation.
#pragma omp parallel
{
#pragma omp for schedule (dynamic,1)
For i = 1,numit
...
#pragma omp critical
{
Serially executed code
...
}
}
3. Results and discussion
The LR-TDDFT formulation is implemented in the PWDFT (Plane Wave Density Functional
Theory)24–26 software package, which is included in the DGDFT (Discontinuous Galerkin
Density Functional Theory)16–18 software package. DGDFT is a massively parallel electronic
structure software package for large-scale DFT calculation of tens of thousands of atoms,
which includes a self-contained module called PWDFT for performing conventional standard
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 10
Table 2: Computational parameters of seven bulk silicon systems (Si64, Si216, Si512 , Si1000,
Si1728, Si2744, and Si4096) used in the LR-TDDFT calculation, including the supercell length
L(˚
A), the grid numbers Nrin real space, the number Neof electrons, the numbers Nvand Nc
of selected valence and conduction orbitals.
Bulky silicon Systems L NrNeNvNc
Si64 10.86 216,000 256 128 128
Si216 16.29 681,472 864 128 128
Si512 21.72 1,643,032 2,048 128 128
Si1000 27.15 3,241,792 4,000 128 128
Si1728 32.58 5,451,776 6,912 128 128
Si2744 38.01 8,741,816 10,976 128 128
Si4096 43.44 12,812,904 16,384 128 128
planewave-based electronic structure calculation. We adopt the Hartwigsen-Goedecker-
Hutter (HGH) norm-conserving pseudopotentials32 and the LDA functional28 to describe
the electronic structures of the systems. All the calculation is carried out on the Cori
supercomputer at the National Energy Research Scientific Computing Center (NERSC). Each
node consists of two Intel processors with 32 processing cores in total and 64 gigabyte (GB)
of memory.
3.1. Computational efficiency
In this section, we measure the computational efficiency, including the strong and weak
scaling behavior of the LR-TDDFT calculation on bulky silicon systems. We consider seven
different bulk silicon systems contain 64, 216, 512, 1000, 1728, 2744, and 4096 silicon atoms
labeled by Si64, Si216, Si512, Si1000 , Si1728, Si2744 , and Si4096 , respectively. All the bulk
systems are closed shell system, and the number of occupied orbitals is No=Ne/2 in the
DFT calculation, where Neis the number of electrons in the systems. In the LRTDDFT
calculation, we choose the number of valence and conduction orbitals Nv=Nc= 128, and
thus the dimension of the LR-TDDFT Hamiltonian is Nvc = 128 ×128 = 16,384. The kinetic
energy cutoff is set to Ecut = 40.0 Ha for these seven bulk silicon systems.
For the MPI parallelism, we consider seven numbers of MPI processes, including Np=
32, 64, 128, 256, 512, 1,024, and 2,048 MPI tasks, respectively. For the OpenMP parallelism,
we consider six numbers of OpenMP threads, including Nt= 1, 2, 4, 8, 12, and 16 threads.
The maximum total number of processing core is 2,048 ×16 = 32,768 used in the LR-TDDFT
calculation.
3.1.1. Strong scaling
The most important criterion of the LR-TDDFT calculation is the parallel strong
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 11
scalability on the HPC architectures, which indicates how the parallel performance is for a
given system when increasing the numbers of processing cores. Figure 4 shows the change of
wallclock time of the LR-TDDFT calculation for the Si1000 system with respect to the number
of MPI processes (Np= 32, 64, 128, 256, 512, 1,024, and 2,048) with three different OpenMP
threads (Nt= 1, 4, and 12).
Figure 4: The parallel strong scaling performance of the LR-TDDFT calculation. (a) The
change of total time of the LR-TDDFT calculation for the Si1000 system with respect to the
number of MPI processes (Np= 32, 64, 128, 256, 512, 1,024, and 2,048) with three different
OpenMP threads (Nt= 1, 4, and 12), as well as (b) and (c) corresponding time of four sub-
parts (Gemm, FFT, Syevd and MPI Alltoall) with two different OpenMP threads (Nt= 1 and
12.)
Figure 4(a) shows that the total time of the LR-TDDFT for using OpenMP in our
program. OpenMP can balance the load between multithreads to accelerate the program,
so the shared threads can accelerate the total calculation. But because of the difference of
implementation speed between multithreads, OpenMP can not accelerate the speed illimitably.
In our test, the accelerate efficiency is nearly 30% when set 4 shared threads but the accelerate
efficiency is only 15% when use 12 shared threads. So 4 shared threads used in OpenMP will
be enough. Figure 4(b) and (c) show the strong scaling behavior for 1 and 12 shared threads.
The costly parts in LR-TDDFT are Gemm, FFT, Syevd and Alltoallv. In detail, when using
32 cores, the time for four expensive parts is 1257.32 s for Gemm, 101.93 and 48.17 s for FFT
and diagonalizing the LRTDDFT Hamiltonian, and 137.63 s for Alltoallv without OpenMP,
respectively. However, when using 2048 cores, the time is reduced to 21.4, 3.21, 22.41 and
2.19 s, respectively. According to the figure 4(b) and (c), It should be noticed that three cheap
parts, including Gemm, FFT and Alltoallv, show high parallel efficiencies up to 98.80%,
96.85% and 98.41%, respectively, when using 2,048 cores. But the parallel efficiency of the
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 12
expensive part to diagonalize the LR-TDDFT Hamiltonian matrix using the Syevd subroutine
is only 53.48% when using 2,048 cores.
3.1.2. Weak scaling
Another important criterion of the LR-TDDFT calculation is the parallel weak scalability
on the HPC architectures, which implies how the parallel performance is for a scaled problem
size with respect to the fixed number of cores. Figure 5 shows the change of wallclock time
of the LR-TDDFT calculation with respect to the number of silicon atoms with three different
MPI processes (Np= 512, 1,024, and 2,048) and single OpenMP thread (Nt= 1).
Figure 5: The parallel weak scaling performance of the LR-TDDFT calculation. (a) The
change of total time of the LR-TDDFT calculation with respect to the number of silicon atoms
with three different MPI processes (Np= 512, 1,024, and 2,048) and single OpenMP thread
(Nt= 1), as well as (b) and (c) corresponding time of four sub-parts (Gemm, FFT, Syevd and
MPI Alltoall) with two different MPI processes (Np= 512 and 1,024).
We demonstrate the performance of LR-TDDFT for weak scaling in 512 cores and 1,024
cores and figure 5(a) shows the total time. The total time of Si64 system are 20.54 and 19.07
s for 512 cores and 1,024 cores respectively. It turns to 431.19 and 233.46 s for Si4096 when
using 512 cores and 1,024 cores, respectively. From the result, we can see that it is closed to
ideal scaling. In detail, the three expensive parts, Gemm, FFT and MPI Alltoall, show linear
scaling according to the number of atoms. The part to diagonalize the LRTDDFT Hamiltonian
matrix shows an inertia for the size of the system because the choice of Nvc is inertia for the
number of atoms in the system. Therefore, this part will still be cheap even for large-scale
systems.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 13
3.2. Excited states of water molecule adsorption on TiO2
Titanium dioxide (TiO2) with desirable optoelectronic properties (ideal band gap, high carrier
mobility, and strong visible light absorption) is a promising photocatalyst for water splitting,
which initiates the light-driven splitting of water into its elemental constituents.33,34 The
ground- and excited state electronic structures of water molecule adsorbed on TiO2have been
extensively studied by through the TDDFT calculation to understand the photocatalytic water
splitting process on TiO2.35–37 Here, we perform the LR-TDDFT calculation to study the
excited states of water molecule adsorption on rutile TiO2(110) surface from an excitonic
perspective.
Figure 6: Ground and excited states of single water molecule adsorption on rutile TiO2(110)
surface. Ground-state Projected density of states of TiO2with H2O (a) chemisorption and (b)
dissociative adsorption. The inset is the charge density difference. Green and blue regions
denote the charge accumulation and depletion. Pink dashed lines denote the valence band
maximum (VBM). Electron (green) and hole (blue) density of lowest three excited states for
(c-e) chemisorption and (f-h) dissociative adsorption for H2O on TiO2.
We use a 2 ×4 TiO2(110) supercell containing three O-Ti-O layers to study the ground-
and excited state electronic structures of H2O adsorbed on the TiO2(110) surface as shown
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 14
in figure 6. H2O can be chemically adsorbed or dissociated on the TiO2(110) surface. For
the chemisorption configuration, H2O is absorbed on the surface by forming a bond between
O and the four-coordinated Ti atom with the length value of 2.10 ˚
A. The charge density
difference (CDD) also proves the H2O can form a bond with the four-coordinated Ti atom
(Figure 6a). The projected density of states (PDOS) on H2O and TiO2indicates the orbital
hybridization between them. In the case of dissociated configuration, a hydrogen atom of
H2O is captured by the nearest bridge oxygen atom forming a O-H bond with the length
value of 1.00 ˚
A and also forms a hydrogen bond H···OH with the bong length value of 1.81
˚
A. The rest of H2O, OH, can also form a bond with the four-coordinated Ti atom with the
length value of 1.84 ˚
A, smaller than that in the case of chemisorption configuration. The
CDD clearly manifests the bond forming between H or OH and the bridge oxygen atom or the
four-coordinated Ti atom. the corresponding PDOS of H2O is expended to the valence area
of TiO2, which indicates the strong hybridization between them.
When water molecule adsorbed on rutile TiO2(110) surface, after a photoexcitation
taking place in TiO2, the photogenerated excited state relaxes towards lower-energy excited
states and finally reaches the lowest excited state. Figure 6(c-h) shows the electron and hole
density population of three lowest-lying excited states for chemisorption and dissociative
adsorption configurations of H2O on TiO2. For both the chemisorption and dissociative
adsorption, most of photoexcited electrons and holes are populated in TiO2, especially for
the hole states, which do not be populated in H2O. Therefore, the photocatalytic activity is
weak for single water molecule adsorption on rutile TiO2(110) surface.
For photoexcited electron states, few electrons are trapped in H2O in both of the
adsorption configurations. Furthermore, more electrons are trapped in H2O in case of
dissociative adsorption compared to chemisorption. It should be noticed that only the
lowest excited state can hold the photoexcited electrons trapped in H2O. Thus, after initially
promoted into a high-energy excited state Sn, it goes through Sn→... →S3→S2→S1, which
is a localized state →a charge-transfer state in TiO2→a charge-transfer state separately
localized on H2O and TiO2. We expect that when more water molecule adsorbed on rutile
TiO2(110) surface, the photocatalytic activity should be enhanced in the photocatalytic water
splitting process.
4. Conclusion
In summary, we describe a massively parallel implementation of large-scale linear-response
time-dependent density functional theory (LR-TDDFT) to compute the excitation energies
and wavefunctions in solid with the plane-wave basis set under the periodic boundary
condition in the PWDFT (Plane Wave Density Functional Theory) software package. We
adopt a two-level parallelization strategy that combines with the Message Passing Interface
(MPI) and Open Multi-Processing (OpenMP) parallel programming to deal with the matrix
computation and data communication of constructing and diagonalizing the LR-TDDFT
Hamiltonian. Our work provides the numerical evidence that the LR-TDDFT calculation can
scale up to 24,576 processing cores on modern heterogeneous supercomputers for studying
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 15
the excited state properties of bulky silicon systems that contain thousands of atoms (4,096
atoms). Enabling accurate and efficient predictions of large system’s response properties will
potentially open up a broad range of new applications in material science and photocatalysis
in the near future. We also show the LR-TDDFT calculation to study the photoinduced
charge separation of water molecule adsorption on rutile TiO2(110) surface from an excitonic
perspective.
Acknowledgments
This work is partly supported by the National Natural Science Foundation of China
(21688102, 21803066), by the Chinese Academy of Sciences Pioneer Hundred Tal-
ents Program (KJ2340000031), by the National Key Research and Development Pro-
gram of China (2016YFA0200604), the Anhui Initiative in Quantum Information Tech-
nologies (AHY090400), the Strategic Priority Research Program of Chinese Academy of
Sciences (XDC01040100), the Fundamental Research Funds for the Central Universities
(WK2340000091), the Supercomputer Application Project Trail Funding from Wuxi Jiang-
nan Institute of Computing Technology (BB2340000016), the Research Start-Up Grants
(KY2340000094) and the Academic Leading Talents Training Program (KY2340000103)
from University of Science and Technology of China. The authors thank the National En-
ergy Research Scientific Computing Center (NERSC), the Supercomputing Center of Chi-
nese Academy of Sciences, the Supercomputing Center of USTC, the National Supercomput-
ing Center in Wuxi, and Tianjin, Shanghai, and Guangzhou Supercomputing Centers for the
computational resources.
Appendix
Linear response time-dependent density functional theory is a very widely used method,
which applies whenever one is considering the response to a weak perturbation. In this
section, we give a brief overview of the basic linear-response formalism7.
The KS-DFT introduces an exact mapping of the interacting N-particle problem onto a
suitable effective noninteracting system. In the noninteracting system, the ground- and excited
states can be given by Slater determinant which constructed by KS orbitals. The response
function of the KS systems can given by that form
χ0(r,r0,ω) = lim
η→0
ψ∗
iν(r)ψic(r)ψ∗
ic(r0)ψiν(r0)
ω−ωiνic+iη−ψ∗
iν(r)ψic(r)ψ∗
ic(r0)ψiν(r0)
ω+ωiνic−iη(A.1)
where ψare the KS orbitals, and
ωiνic=εic−εiν
where ωiνicare differences between KS energies and eigenvalues. And the effective potential
of every particle as that (except Hybrid functional):
Veff[ρ](r,t) = Vext[ρ](r,t) + Zd3r0ρ(r0,t)
|r−r0|+Vxc[ρ](r,t)(A.2)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 16
here, the first term is external potential, the second and third terms are Hatree potential and
exchange-correlation potentials.
If we write Vext(r,t) = V0
ext[ρ](r) + V0
ext[ρ1](r,t)θ(t−t0), that means the external
perturbation is appeared from t0. We suppose as above we can write the first order perturbation
of the potential as following
V0
eff[ρ1](r,t) = V0
ext[ρ1](r,t) + Zd3r0ρ1(r0,t)
|r−r0|+V0
xc[ρ1](r0,t)(A.3)
The last term is the exchange-correlation perturbation, use Taylor expansion we can only
consider the linear term of ρ1:
V0
xc[n1](r0,t) = Zdt0Zd3r0δVxc[ρ](r,t)
δ ρ (r0,t0)|ρ0ρ1(r0,t0)(A.4)
suppose
fxc(r,r0,t,t0) = δVxc[ρ](r,t)
δ ρ (r0,t0)|ρ0(A.5)
that is so-called time-dependent xc kernel. In adiabatic case, we can approximate:
fxc(r,r0,t,t0) = fxc(r,r0,t−t0) = fxc(r,r0)δ(t−t0)(A.6)
and the density response can be given:
ρ1(r,t) = Zdr0Zdt0χ0(r,r0,t,t0)Veff[ρ1](r0,t0)(A.7)
If we want to get eigen-mode of the system, we can choose the limit Vext →0:
V0
eff[ρ1](r,t) = Zd3r0Zdt 0(ρ1(r0,t0)
|r−r0|+fxc(r,r0))δ(t−t0)(A.8)
here, this equation is relate t and t0as t−t0. so we can use Fourier transform to frequency
space and suppose fHxc
fHxc(r,r0) = 1
|r−r0|+fxc(r,r0)(A.9)
so the density response is
ρ1(r,ω) = Zdr0χ0(r,r0,ω)Zdr00 fHxc(r0,r00)ρ1(r00,ω)(A.10)
after that we use some algebraic substitution to get the final equation. Suppose that
g(r,ω) = Zdr0fHxc(r,r0)ρ1(r0,ω)
then we can get
g(r,ω) = Zdr0Zdr00 fHxc(r,r0)χ0(r0,r00,ω)g(r00,ω)
REFERENCES 17
substitute χ0with eq.1 and if we consider poles in upper plane, suppose
Hiνic(ω) = Zdr0ψ∗
iν(r0)ψic(r0)g(r0,ω)(A.11)
and
Vjνjc,iνic=ZdrZdr0ψ∗
jνψjc(r)fHxc(r,r0)ψ∗
iν(r0)ψic(r0)
Xiνic=Hiνic
ω−ωiνic
Finally, the equation is
ωXjνjc=
∞
∑
iν,ic
(ωiνicδjνjνδjcic+Vjνjc,iνic)Xiνic(A.12)
That is Casida equation.
References
[1] Runge E and Gross E K U 1984 Phys. Rev. Lett. 52 997
[2] Beck T L 2000 Rev. Mod. Phys. 72 1041
[3] Onida G, Reining L and Rubio A 2002 Rev. Mod. Phys. 74 601
[4] Yabana K and Bertsch G F 1996 Phys. Rev. B54 4484
[5] Casida M E 1995 Recent Advances in Density Functional Methods:(Part I) 1155
[6] Sternheimer R M 1954 Phys. Rev. 96 951
[7] Dreuw A and Head-Gordon M 2005 Chem. Rev. 105 4009–4037
[8] Soler J M, Artacho E, Gale J D et al. 2002 J. Phys.: Condens. Matter 14 2745
[9] VandeVondele J, Borstnik U and Hutter J 2012 J.Chem.Theory Comput 83565–3573
ISSN 1549-9618
[10] Gillan M J, Bowler D R, Torralba A S and Miyazaki T 2007 Comput. Phys. Commun.
177 14–18
[11] Marek A, Blum V, Johanni R, Havu V et al. 2014 J. Phys: Condens. Matter 26 213201
[12] Genovese L, Neelov A, Goedecker S et al. 2008 J. Chem. Phys. 129 014109
[13] Qin X, Shang H, Xiang H, Li Z and Yang J 2015 Int. J. Quantum Chem. 115 647–655
[14] Qin X, Shang H, Xu L, Hu W, Yang J, Li S and Zhang Y 2019 Int. J. High Perform.
Comput. Appl. 34 159–168
[15] Shang H, Xu L, Wu B, Qin X, Zhang Y and Yang J 2020 Comput. Phys. Commun. 254
107204
[16] Lin L, Lu J, Ying L and E W 2012 J. Comput. Phys. 231 2140–2154
[17] Hu W, Lin L and Yang C 2015 J. Chem. Phys. 143 124110
[18] Hu W, Qin X, Jiang Q, Chen J, An H, Jia W, Li F, Liu X, Chen D, Liu F, Zhaoe Y and
Yang J 2020 Sci. Bull.
REFERENCES 18
[19] Valiev M, Bylaska E J, Govind N et al. 2010 Comput. Phys. Commun. 181 1477–1489
[20] Shao Y, Gan Z, Epifanovsky E et al. 2015 Mol. Phys. 113 184–215
[21] Apra1 E, Bylaska E J, de Jong W A et al. 2020 J. Chem. Phys. 152 184102
[22] Malcıoˇ
glu O B, Gebauer R, Rocca D and Baroni S 2011 Comput. Phys. Commun. 182
1744–1754
[23] Ge X, Binnie S J, Rocca D, Gebauer R and Baroni S 2014 Comput. Phys. Commun. 185
2080–2089
[24] Hu W, Lin L, Banerjee A S, Vecharynski E and Yang C 2017 J. Chem. Theory Comput.
13 1188–1198
[25] Hu W, Lin L and Yang C 2017 J. Chem. Theory Comput. 13 5420–5431
[26] Hu W, Lin L and Yang C 2017 J. Chem. Theory Comput. 13 5458–5467
[27] Khatri C G and Rao C R 1968 Sankhya 30 167–180
[28] Goedecker S, Teter M and Hutter J 1996 Phys. Rev. B54 1703
[29] Wilkinson K A, Hine N D M and Skylaris C K 2014 J. Chem. Theory Comput 10 4782–
4794
[30] Duff I, Heroux M and Pozo R 2002 ACM Trans. Math. Softw. 28 239–267
[31] Frigo M and Johnson S G 2005 Proceedings of the IEEE 93 216–231
[32] Hartwigsen C, Goedecker S and Hutter J 1998 Phys. Rev. B58 3641
[33] Tan S, Feng H, Ji Y, Wang Y, Zhao J, Zhao A, Wang B, Luo Y, Yang J and Hou J 2012
J. Am. Chem. Soc. 134 9978–9985
[34] Onda K, Li B, Zhao J, Jordan K D, Yang J and Petek H 2005 Science 308 1154–1158
[35] Sun H, Zheng Q, Lu W and Zhao J 2019 J. Phys. Condens. Matter 31 114004
[36] Migani A, Mowbray D J, Zhao J and Petek H 2015 J. Chem. Theory Comput. 11 239–251
[37] Sun H, Mowbray D J, Migani A, Zhao J, Petek H and Rubio A 2015 ACS Catal. 5
4242–4254
A preview of this full-text is provided by IOP Publishing.
Content available from Electronic Structure
This content is subject to copyright. Terms and conditions apply.