ArticlePDF Available

Hybrid MPI and OpenMP parallel implementation of large-scale linear-response time-dependent density functional theory with plane-wave basis set

IOP Publishing
Electronic Structure
Authors:

Abstract and Figures

High performance computing is a powerful tool to accelerate the Kohn–Sham density functional theory calculations on modern heterogeneous supercomputers. Here, we describe a massively parallel implementation of large-scale linear-response time-dependent density functional theory (LR-TDDFT) to calculate the excitation energies and wave functions of solids with plane-wave basis set. We adopt a two-level parallelization strategy that combines the message passing interface with open multi-processing parallel programming to deal with the matrix operations and data communications of constructing and diagonalizing the LR-TDDFT Hamiltonian matrix. Numerical results illustrate that the LR-TDDFT calculations can scale up to 24 576 processing cores on modern heterogeneous supercomputers to study the excited state properties of bulky silicon systems containing thousands of atoms (4,096 atoms). We demonstrate that the LR-TDDFT calculations can be used to investigate the photoinduced charge separation of water molecule adsorption on rutile TiO 2 (110) surface from an excitonic perspective.
This content is subject to copyright. Terms and conditions apply.
Hybrid MPI and OpenMP parallel
implementation of large-scale linear-response
time-dependent density functional theory with
plane-wave basis set
Lingyun Wan,aXiaofeng Liu,aJie Liu,aWei Hu,aand Jinlong Yanga
aHefei National Laboratory for Physical Sciences at the Microscale, Department of Chemical
Physics, and Synergetic Innovation Center of Quantum Information and Quantum Physics,
University of Science and Technology of China, Hefei, Anhui 230026, China
E-mail: whuustc@ustc.edu.cn (Wei Hu), jlyang@ustc.edu.cn
(Jinlong Yang)
September 2020
Abstract.
High performance computing (HPC) is a powerful tool to accelerate the Kohn-Sham
density functional theory (KS-DFT) calculation on modern heterogeneous supercomputers.
Here, we describe a massively parallel implementation of large-scale linear-response time-
dependent density functional theory (LR-TDDFT) to compute the excitation energies and
wavefunctions in solid with the plane-wave basis set under the periodic boundary condition in
the Plane Wave Density Functional Theory (PWDFT) software package. We adopt a two-level
parallelization strategy which combines with the Message Passing Interface (MPI) and Open
Multi-Processing (OpenMP) parallel programming to deal with the matrix computation and
data communication of constructing and diagonalizing the LR-TDDFT Hamiltonian matrix.
Numerical results illustrate that the LR-TDDFT calculation can scale up to 24,576 processing
cores on modern heterogeneous supercomputers to study the excited state properties of bulky
silicon systems which contain thousands of atoms (4,096 atoms). We prove the LR-TDDFT
calculation to study the photoinduced charge separation of water molecule adsorption on rutile
TiO2(110) surface from an excitonic perspective.
Keywords: time-dependent density functional theory, thousands of atoms, MPI and OpenMP,
high performance computing
1. Introduction
Time-dependent density functional theory (TDDFT) based on the Runge-Gross theorem1is a
self-consistent theoretical framework to describe the excited state properties in molecule and
solid. In general, there are two schemes for solving the time-dependent Schr¨
odinger equation
in the framework of TDDFT2,3 . The first approach referred as real-time TDDFT(RT-TDDFT)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 2
is to solve the problem with the propagator integrated over time and real space which uses
molecular dynamics to do the time evolution4. The second one, which is the most widely
used and often referred as TDDFT in literatures, solves the many-body quantum problems
in the frequency domain via Fourier transformation of the time-dependent linear response
function, and obtains excitation energies and corresponding oscillation strengths from poles
and residues in the complex response function5,6. In order to distinguish this approach from
the first one, it is referred as the linear-response TDDFT (LR-TDDFT).
The common way to obtain the excitation energies and wavefunctions for LR-TDDFT is
to solve the linear-response Casida equation5. There are two main components when solving
the Casida equation, one is to construct the LR-TDDFT Hamiltonian with high computational
complexity of O(N5
e), and another is to diagonalize the LR-TDDFT Hamiltonian with
ultrahigh computational complexity of O(N6
e), where Neis the number of electrons in the
systems. Therefore, the computational cost and memory usage of the LR-TDDFT calculation
become prohibitively expensive as the system size increases2,3.Several low-scaling methods
have been proposed to reduce the ultrahigh computational cost and memory usage in
the TDDFT calculation with small localized basis set7, such as Gaussian-type orbitals or
numerical atomic orbitals. However, it is still challenging to simulate the excited state
properties of large-scale systems containing thousands of atoms by using the LR-TDDFT
calculation especially for the plane-wave basis set under the periodic boundary condition.
In recent years, the rapid development of modern heterogeneous supercomputers
enables the high performance computing (HPC) as a powerful tool to accelerate the KS-
DFT calculation for large-scale systems and provide huge memory space. In particular,
several highly effective HPC KS-DFT software packages for ground-state electronic
structure calculation with small localized basis set have been developed, such as SIESTA8,
CP2K9, CONQUEST10, FHI-aims 11 , BigDFT12, HONPAS 13–15 and DGDFT 16–18 , which
are beneficial to take advantage of the massive parallelism on modern heterogeneous
supercomputers. For example, large-scale KS-DFT calculation containing tens of thousands
of atoms have been performed in CP2K9, CONQUEST 10 and DGDFT 17,18 , which can scale
up to hundreds of thousands of cores on the Cray, Edison, Cori and Sunway TaihuLight
supercomputers.
Based on such large-scale DFT calculation, large-scale TDDFT calculation become
reachable. In particular, large-scale TDDFT excited state electronic structure calculation
using the Gaussian basis set have been implemented in NWChem19 and QChem 20 . For
example, NWChem21 has performed medium-scale LR-TDDFT excited state calculation
on the Au20Ne100 cluster (120 atoms and 1,840 basis functions) and scaling up to 2,250
processing cores on the CINECA supercomputer. However, there is always no breakthrough
to the case of the plane-wave basis set, due to such ultrahigh computational cost and memory
usage in the LR-TDDFT excited state electronic structure calculation22,23.
In the present work, we describe a massively parallel implementation of LR-TDDFT
to compute the excited state electronic structures in solid with the plane-wave basis set
under the periodic boundary condition in the PWDFT (Plane Wave Density Functional
Theory)24–26 software package. We adopt a two-level parallelization strategy which combines
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 3
with the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallel
programming to deal with the ultrahigh computational cost and memory usage of constructing
and diagonalizing the LR-TDDFT Hamiltonian. We show that the LR-TDDFT calculation can
scale up to 24,576 processing cores on modern heterogeneous supercomputers for simulating
the excited state properties of bulky silicon systems that contain thousands of atoms. We
perform the LR-TDDFT calculation to study the excited states of water molecule adsorption
on rutile TiO2(110) surface from an excitonic perspective.
2. Methodology
In this section, we describe the theoretical algorithms and parallel implementation of the
LR-TDDFT calculation with the plane-wave basis set under the periodic boundary condition
in the PWDFT24–26 software package. We present the implementation of the LR-TDDFT
calculation based on the two-level parallelization strategy that combines with the Message
Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallel programming to deal
with the ultrahigh computational cost and memory usage of constructing and diagonalizing
the LR-TDDFT Hamiltonian matrix, resulting in HPC excited state electronic structure
calculation on modern heterogeneous supercomputers.
2.1. Linear response time-dependent density functional theory
The Casida equation in linear response time-dependent density functional theory (LR-
TDDFT) can be described as an eigenvalue problem of the form (The detailed derivation
of the Casida equation is given in the Appendix)
HX =XΛ(1)
where Xis the expansion coefficient of excitation wavefunction in KS orbitals and Λis the
corresponding excitation energies. we can call Has LR-TDDFT Hamiltonian because its
form is similar with Hamiltonian. Hhas the following block structure
H="D+2VHxc 2WHxc
2W
Hxc D2V
Hxc #(2)
VHxc and WHxc represent the Hartree-exchange-correlation integrals defined as
VHxc(ivic,jvjc) =Zψ
iv(r)ψic(r)fHxc(r,r0)ψjv(r0)ψ
jc(r0)drdr0
WHxc(ivic,jvjc) =Zψ
iv(r)ψic(r)fHxc(r,r0)ψ
jv(r0)ψjc(r0)drdr0(3)
here, fHxc is called the Hartree-exchange-correlation kernel and it is expressed as a function
of VHand Vxc after self-consistent field process
fHxc(r,r0) = fH+fxc =1
|rr0|+δVxc[ρ]
δ ρ (r0)(4)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 4
where ρ(r) = ΣNe
iv|ψiv(r)|2is the electron density and Neis the number of electron in the
ground state.
In the LR-TDDFT Hamiltonian, the Dis an Nvc ×Nvc (Nvc =NvNc) diagonal matrix with
matrix elements:
D(ivic,jvjc) = (εicεiv)δivjvδicjc(5)
which the orbital energies εiv(iv=1,2,...,Nv) and εic(ic=1,2,...,Nc) are associated with
selected valence orbitals ψiv(r)and selected conduction orbitals ψic(r)respectively which
used in the LR-TDDFT calculation. These energies and orbitals are typically obtained from
the KS-DFT calculation.
Under the Tamm-Dancoff approximation (TDA)3,WHxc can be neglected and Hbecomes
Hermitian matrix rewritten as
H=D+2VHxc (6)
In the real space, we chose Nrgrid points to sample the wavefunctions, so the Hartree-
exchange-correlation integrals VHxc can be rewritten as a multiplication of the matrix fHxc
CNr×Nrwith the transposed Khatri-Rao product (also known as face-splitting product)27
matrix P
vc ={ρvc(r):=ψ
iv(r)ψic(r)} ∈ CNr×Nvc for the valence and conduction orbitals
(ψiv(r)and ψic(r)) in real space ({ri}Nr
i=1)
VHxc =P
vc fHxcP
vc
=P
vc(fH+fxc)P
vc
(7)
For the purpose of simulate, we just consider the local-density approximation (LDA)
functional28 in the KSDFT and LR-TDDFT calculation in this work. In that case, the
LDA exchange-correlation potential fxc is diagonal in real space ({ri}Nr
i=1), the exchange-
correlation product state fxcP
vc can be computed directly and efficiently in real space with
computational complexity of NrNvNcO(N3
e). For the Hartree operator apply onto product
state fHP
vc, we need to transform it into reciprocal space ({Gi}Ng
i=1whose Ngis the number
of grid points in reciprocal space), where the Hartree operator fHis diagonal. We denote it
by ˜
fH(G) = 4π/|G|2(Notice that ˜
fH(G=0) = 0) and the fHP
vc =˜
fH˜
P
vc can be efficiently
computed in reciprocal space. Where ˜
P
vc ={˜
ρvc(G):=ψ
iv(G)ψic(G)} ∈ CNg×Nvc. The total
number of grid points Nrin real space is determined from the kinetic energy cutoff Ecut
defined as (Nr)i=2EcutLi/π, where Liis the length of supercells along the i-th (x, y and z)
coordinate direction.
In this work, we utilize the Fast Fourier Transforms (FFTs) to do the transformation
between real ({ri}Nr
i=1) and reciprocal ({Gi}Ng
i=1) space. So the total number of reciprocal grid
points Ngis equal to the number of grid points Nr. In addition to performing NvNcO(N2
e)
FFTs to transfer from P
vc ={ρvc(r)}to ˜
P
vc ={˜
ρvc(G)}, the Hartree operator ˜
fHthat can be
applied on ˜
P
vc requires a computational complexity of NgNvNcO(N3
e). After above steps,
we apply FFTs inverse on Hartree product matrixes, then add the two results and multiply
corresponding conjugate product matrices P
vc to compute the Hartree-exchange-correlation
integrals.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 5
Table 1: Computational cost and memory usage in the LR-TDDFT calculation. Notice that
Nr1,000 ×No, and NvNcNoO(Ne)in the plane-wave basis set.
LR-TDDFT Computational cost Memory usage
FFT NvNcNrlogNrNrNvNc
VHxc product (Gemm) NrN2
vN2
cN2
vN2
c
Hdiagonalization (Syevd) N3
vN3
cN2
vN2
c
After constructing the LR-TDDFT Hamiltonian matrix H, the next step is to diagonalize
the LR-TDDFT Hamiltonian matrix Hexplicitly with an ultrahigh complexity of N3
vN3
c
O(N6
e), and to obtain the wavefunction expansion coefficients matrix Xand energies Λ. The
pseudocode of the LR-TDDFT calculation is shown in Algorithm 1.
Algorithm 1 The pseudocode for explicitly constructing and directly diagonalizing the LR-
TDDFT Hamiltonian in the LR-TDDFT calculation in the plane-wave basis set.
Input: Ground-state energies εiand wavefunctions ψi(r).
Output: excited state energies {λi}and wavefunction expansion coefficients {xi j}.
1: Initialize the P
vc in real space and transfer into reciprocal space ˜
P
vc O(NrNvNc+
NrlogNrNvNc)
2: Apply the exchange-correction potential to the P
vc in real space O(NrNvNc)
3: Apply the Hartree potential to the ˜
P
vc and transfer into real space to get fHP
vc
O(NrlogNrNvNc+NrNvNc)
4: Compute the Hartree-exchange-correlation integrals VHxc =P
vc(fHP
vc +fxcP
vc)in real
space O(NrN2
vN2
c)
5: Diagonalize the LR-TDDFT Hamiltonian Hand obtain the excitation energies {λi}and
wavefunctions’ expansion coefficients {xi j} → O(N3
vN3
c)
It should be noted that for large normalized plane-wave basis set, Ngis typically much
larger than Nvor Nc. The number Nocorresponding occupied orbitals is either Neor Ne/2
depending on how the spin is counted. The number of conduction orbitals Ncincluded in the
LR-TDDFT calculation is typically with a small Nv(the precise number is a free parameter to
be converged), where Nrand Ngis often much larger than 1,000×Nedue to the high accuracy
of plane-wave basis set. In this work, we compute the LR-TDDFT Hamiltonian matrix in the
real space and set Nr=Ngin real and reciprocal space. Table 1 summarizes the computational
cost and memory usage for explicitly constructing and directly diagonalizing the LR-TDDFT
Hamiltonian in the LR-TDDFT calculation under the plane-wave basis set.
2.2. Parallel implementation
The implementation of the LR-TDDFT calculation is based on a two-level parallelization
strategy which exploits the MPI parallel programming to deal with the matrix computation
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 6
and data communication, and combined with the multi-thread parallelism of OpenMP parallel
programming to share the memory usage in the LR-TDDFT calculation with the plane-wave
basis set.
2.2.1. MPI parallelism
In this section, we demonstrate an efficient parallel implementation of three different
types of data partition with the MPI parallelism for the LR-TDDFT calculation in the
planewave basis set as shown in figure 1.
Figure 1: Three different types of data partition with the MPI parallelism for the matrix
computation and data communication used in the LR-TDDFT calculation. (a) 1D column
block with the size of 1 ×NpMPI processor grids (Red box), (b) 1D row block with the size
of Np×1 MPI processor grids (Blue box) and (c) 2D cyclic block with the size of I ×J
MPI processor grids (Green box). Npis the number of MPI processes used in the LR-TDDFT
calculation and I ×J = Np.
When Npprocessors are used in the MPI parallelism, the wavefunctions, which are getted
from the ground-state DFT calculation, are stored in the 1D column block partition form as
shown in figure 1(a). In order to obtain face-splitting product P
νc, we should convert the
wavefunctions to 1D row partition as shown in figure 1(b). In the LR-TDDFT Hamiltonian
matrix, the Hartree operator is diagonal in reciprocal space and the exchange-correlation
operator is diagonal in real space. So we use the Fast Fourier Transforms (FFTs) to transform
the P
νcin real space to the ˜
P
νcin reciprocal space and acquire the Hartree operator. In order
to apply the FFTs, we should convert P
νcfrom 1D row partition to 1D column partition by
MPI Alltoall as shown in figure 1(a) and (b). After getting the LR-TDDFT Hamiltonian, we
diagonalize the Hamiltonian to obtain excitation energies and excitation wavefunctions. In the
diagonalization process, the 2D block cyclic partition shown in figure 1(c) is the most efficient
data distribution scheme for performing diagonalization implemented in the ScaLAPACK
software package. The conversion among 1D row block partition and 2D block cyclic partition
is by using the pdgemr2d subroutine (similar to MPI Alltoall) in the ScaLAPACK library.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 7
Figure 2: The chartflow of parallel implementation of the LR-TDDFT method in the PWDFT
software package. There are three different types (Red, blue and green boxes) of MPI
data partition for the matrix computation and data communication used in the LR-TDDFT
calculation.
More specifically, we describe the whole method of LR-TDDFT and various quantities in
the LR-TDDFT method with the corresponding storage formats in figure 2. Starting from Ψiν
and Ψicdistributed in the 1D column block partition, we first transform these wavefunctions
into the 1D row block partition and apply the Harmard product to get the P
νc. In order to
construct the LR-TDDFT Hamiltonian matrix, we send the P
νcseparately to two parts which
showed in figure 2. In the right part, we convert P
νcinto the 1D column block partition and
apply the Hartree operator to P
νc. After that we use inverse FFTs to transform the ˜
fH˜
P
νc
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 8
from reciprocal space back to real space in order to reduce one matrix product which is the
most expensive part. In the left part, the fxc can be calculated from density directly and apply
the xc kernel to P
νcin real space to get fxcP
νcand at the same time we add the fHP
νcand
fxcP
νcto get fHxcP
νc. Then we multiply fHxcP
νcand P
νc. Finally, we calculate the difference
between Kohn-Sham energy eigenvalues and add that to P
νcfxcP
νcto construct LRTDDFT
Hamiltonian matrix.
In order to implement the process of LR-TDDFT Hamiltonian’s diagonalization, we
form the Hamiltonian matrix in parallel within a 2D block cyclic distribution scheme, and
perform a parallel diagonalization (invoke the Syevd subroutine in the ScaLAPACK library)
of Hamiltonian on the 2D block cyclic grid. Finally, we reconstruct the excitation eigenvectors
with the pdgemr2d subroutine.
2.2.2. Hybrid MPI and OpenMP parallelism
The hybrid MPI-OpenMP implementation can overcome the memory issues and improve
the scaling where the shared memory space is on an individual node. This suitability arises
from the fact that MPI and OpenMP differ significantly in terms of the visibility of memory
between cores29. In the MPI parallelism, each process has its own memory space, and the
movement of data between processes is explicitly controlled using calls to the MPI application
programming interface (API), such as MPI Reduce and MPI Alltoall. In contrast, OpenMP
is a shared memory system, and all threads spawned by a given process have accessed to
the same memory. The relevance of hybrid approaches is currently increasing, as trends in
processor development focus on increasing shared-memory parallelism (more cores per node),
rather than faster processors.
The memory usage and time consuming in the LR-TDDFT calculation is spent in most of
matrix operations, such as the vector-vector, matrix-vector, and matrix-matrix multiplications
(Gemm), matrix diagonalization (Syevd), fast Fourier transform (FFT). All of these matrix
operations can be implemented through the Basic Linear Algebra Subprograms (BLAS)30,
Linear Algebra PACKage (LAPACK) and Fastest Fourier Transform in the West (FFTW)
libraries31. Fortunately, most of these subroutines in the BLAS, LAPACK and FFTW libraries
can be effectively multi-thread accelerated by the OpenMP parallelism. Furthermore, the data
communication will be expensive for large system, and then OpenMP can balance the load in
each threads automatically and shared memory can reduce communication between the MPI
processes.
In the LR-TDDFT calculation, matrix operations and data communication often utilize
the BLAS, LAPACK, FFT and for-loop statements, in the serially executed codes. These
paradigms are implemented through the addition of programmes, such as following examples
of OpenMP usage within the LR-TDDFT calculation
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 9
Figure 3: Two-level parallelization strategy that uses the MPI parallel programming
(The number of MPI processes is Np= 4) to deal with the matrix computation and
data communication, and combines with the multi-thread parallelism of OpenMP parallel
programming (The number of OpenMP threads is Nt= 2) to share the memory usage in the
LR-TDDFT calculation.
#pragma omp parallel
{
#pragma omp for schedule (dynamic,1)
For i = 1,numit
...
#pragma omp critical
{
Serially executed code
...
}
}
3. Results and discussion
The LR-TDDFT formulation is implemented in the PWDFT (Plane Wave Density Functional
Theory)24–26 software package, which is included in the DGDFT (Discontinuous Galerkin
Density Functional Theory)16–18 software package. DGDFT is a massively parallel electronic
structure software package for large-scale DFT calculation of tens of thousands of atoms,
which includes a self-contained module called PWDFT for performing conventional standard
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 10
Table 2: Computational parameters of seven bulk silicon systems (Si64, Si216, Si512 , Si1000,
Si1728, Si2744, and Si4096) used in the LR-TDDFT calculation, including the supercell length
L(˚
A), the grid numbers Nrin real space, the number Neof electrons, the numbers Nvand Nc
of selected valence and conduction orbitals.
Bulky silicon Systems L NrNeNvNc
Si64 10.86 216,000 256 128 128
Si216 16.29 681,472 864 128 128
Si512 21.72 1,643,032 2,048 128 128
Si1000 27.15 3,241,792 4,000 128 128
Si1728 32.58 5,451,776 6,912 128 128
Si2744 38.01 8,741,816 10,976 128 128
Si4096 43.44 12,812,904 16,384 128 128
planewave-based electronic structure calculation. We adopt the Hartwigsen-Goedecker-
Hutter (HGH) norm-conserving pseudopotentials32 and the LDA functional28 to describe
the electronic structures of the systems. All the calculation is carried out on the Cori
supercomputer at the National Energy Research Scientific Computing Center (NERSC). Each
node consists of two Intel processors with 32 processing cores in total and 64 gigabyte (GB)
of memory.
3.1. Computational efficiency
In this section, we measure the computational efficiency, including the strong and weak
scaling behavior of the LR-TDDFT calculation on bulky silicon systems. We consider seven
different bulk silicon systems contain 64, 216, 512, 1000, 1728, 2744, and 4096 silicon atoms
labeled by Si64, Si216, Si512, Si1000 , Si1728, Si2744 , and Si4096 , respectively. All the bulk
systems are closed shell system, and the number of occupied orbitals is No=Ne/2 in the
DFT calculation, where Neis the number of electrons in the systems. In the LRTDDFT
calculation, we choose the number of valence and conduction orbitals Nv=Nc= 128, and
thus the dimension of the LR-TDDFT Hamiltonian is Nvc = 128 ×128 = 16,384. The kinetic
energy cutoff is set to Ecut = 40.0 Ha for these seven bulk silicon systems.
For the MPI parallelism, we consider seven numbers of MPI processes, including Np=
32, 64, 128, 256, 512, 1,024, and 2,048 MPI tasks, respectively. For the OpenMP parallelism,
we consider six numbers of OpenMP threads, including Nt= 1, 2, 4, 8, 12, and 16 threads.
The maximum total number of processing core is 2,048 ×16 = 32,768 used in the LR-TDDFT
calculation.
3.1.1. Strong scaling
The most important criterion of the LR-TDDFT calculation is the parallel strong
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 11
scalability on the HPC architectures, which indicates how the parallel performance is for a
given system when increasing the numbers of processing cores. Figure 4 shows the change of
wallclock time of the LR-TDDFT calculation for the Si1000 system with respect to the number
of MPI processes (Np= 32, 64, 128, 256, 512, 1,024, and 2,048) with three different OpenMP
threads (Nt= 1, 4, and 12).
Figure 4: The parallel strong scaling performance of the LR-TDDFT calculation. (a) The
change of total time of the LR-TDDFT calculation for the Si1000 system with respect to the
number of MPI processes (Np= 32, 64, 128, 256, 512, 1,024, and 2,048) with three different
OpenMP threads (Nt= 1, 4, and 12), as well as (b) and (c) corresponding time of four sub-
parts (Gemm, FFT, Syevd and MPI Alltoall) with two different OpenMP threads (Nt= 1 and
12.)
Figure 4(a) shows that the total time of the LR-TDDFT for using OpenMP in our
program. OpenMP can balance the load between multithreads to accelerate the program,
so the shared threads can accelerate the total calculation. But because of the difference of
implementation speed between multithreads, OpenMP can not accelerate the speed illimitably.
In our test, the accelerate efficiency is nearly 30% when set 4 shared threads but the accelerate
efficiency is only 15% when use 12 shared threads. So 4 shared threads used in OpenMP will
be enough. Figure 4(b) and (c) show the strong scaling behavior for 1 and 12 shared threads.
The costly parts in LR-TDDFT are Gemm, FFT, Syevd and Alltoallv. In detail, when using
32 cores, the time for four expensive parts is 1257.32 s for Gemm, 101.93 and 48.17 s for FFT
and diagonalizing the LRTDDFT Hamiltonian, and 137.63 s for Alltoallv without OpenMP,
respectively. However, when using 2048 cores, the time is reduced to 21.4, 3.21, 22.41 and
2.19 s, respectively. According to the figure 4(b) and (c), It should be noticed that three cheap
parts, including Gemm, FFT and Alltoallv, show high parallel efficiencies up to 98.80%,
96.85% and 98.41%, respectively, when using 2,048 cores. But the parallel efficiency of the
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 12
expensive part to diagonalize the LR-TDDFT Hamiltonian matrix using the Syevd subroutine
is only 53.48% when using 2,048 cores.
3.1.2. Weak scaling
Another important criterion of the LR-TDDFT calculation is the parallel weak scalability
on the HPC architectures, which implies how the parallel performance is for a scaled problem
size with respect to the fixed number of cores. Figure 5 shows the change of wallclock time
of the LR-TDDFT calculation with respect to the number of silicon atoms with three different
MPI processes (Np= 512, 1,024, and 2,048) and single OpenMP thread (Nt= 1).
Figure 5: The parallel weak scaling performance of the LR-TDDFT calculation. (a) The
change of total time of the LR-TDDFT calculation with respect to the number of silicon atoms
with three different MPI processes (Np= 512, 1,024, and 2,048) and single OpenMP thread
(Nt= 1), as well as (b) and (c) corresponding time of four sub-parts (Gemm, FFT, Syevd and
MPI Alltoall) with two different MPI processes (Np= 512 and 1,024).
We demonstrate the performance of LR-TDDFT for weak scaling in 512 cores and 1,024
cores and figure 5(a) shows the total time. The total time of Si64 system are 20.54 and 19.07
s for 512 cores and 1,024 cores respectively. It turns to 431.19 and 233.46 s for Si4096 when
using 512 cores and 1,024 cores, respectively. From the result, we can see that it is closed to
ideal scaling. In detail, the three expensive parts, Gemm, FFT and MPI Alltoall, show linear
scaling according to the number of atoms. The part to diagonalize the LRTDDFT Hamiltonian
matrix shows an inertia for the size of the system because the choice of Nvc is inertia for the
number of atoms in the system. Therefore, this part will still be cheap even for large-scale
systems.
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 13
3.2. Excited states of water molecule adsorption on TiO2
Titanium dioxide (TiO2) with desirable optoelectronic properties (ideal band gap, high carrier
mobility, and strong visible light absorption) is a promising photocatalyst for water splitting,
which initiates the light-driven splitting of water into its elemental constituents.33,34 The
ground- and excited state electronic structures of water molecule adsorbed on TiO2have been
extensively studied by through the TDDFT calculation to understand the photocatalytic water
splitting process on TiO2.35–37 Here, we perform the LR-TDDFT calculation to study the
excited states of water molecule adsorption on rutile TiO2(110) surface from an excitonic
perspective.
Figure 6: Ground and excited states of single water molecule adsorption on rutile TiO2(110)
surface. Ground-state Projected density of states of TiO2with H2O (a) chemisorption and (b)
dissociative adsorption. The inset is the charge density difference. Green and blue regions
denote the charge accumulation and depletion. Pink dashed lines denote the valence band
maximum (VBM). Electron (green) and hole (blue) density of lowest three excited states for
(c-e) chemisorption and (f-h) dissociative adsorption for H2O on TiO2.
We use a 2 ×4 TiO2(110) supercell containing three O-Ti-O layers to study the ground-
and excited state electronic structures of H2O adsorbed on the TiO2(110) surface as shown
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 14
in figure 6. H2O can be chemically adsorbed or dissociated on the TiO2(110) surface. For
the chemisorption configuration, H2O is absorbed on the surface by forming a bond between
O and the four-coordinated Ti atom with the length value of 2.10 ˚
A. The charge density
difference (CDD) also proves the H2O can form a bond with the four-coordinated Ti atom
(Figure 6a). The projected density of states (PDOS) on H2O and TiO2indicates the orbital
hybridization between them. In the case of dissociated configuration, a hydrogen atom of
H2O is captured by the nearest bridge oxygen atom forming a O-H bond with the length
value of 1.00 ˚
A and also forms a hydrogen bond H···OH with the bong length value of 1.81
˚
A. The rest of H2O, OH, can also form a bond with the four-coordinated Ti atom with the
length value of 1.84 ˚
A, smaller than that in the case of chemisorption configuration. The
CDD clearly manifests the bond forming between H or OH and the bridge oxygen atom or the
four-coordinated Ti atom. the corresponding PDOS of H2O is expended to the valence area
of TiO2, which indicates the strong hybridization between them.
When water molecule adsorbed on rutile TiO2(110) surface, after a photoexcitation
taking place in TiO2, the photogenerated excited state relaxes towards lower-energy excited
states and finally reaches the lowest excited state. Figure 6(c-h) shows the electron and hole
density population of three lowest-lying excited states for chemisorption and dissociative
adsorption configurations of H2O on TiO2. For both the chemisorption and dissociative
adsorption, most of photoexcited electrons and holes are populated in TiO2, especially for
the hole states, which do not be populated in H2O. Therefore, the photocatalytic activity is
weak for single water molecule adsorption on rutile TiO2(110) surface.
For photoexcited electron states, few electrons are trapped in H2O in both of the
adsorption configurations. Furthermore, more electrons are trapped in H2O in case of
dissociative adsorption compared to chemisorption. It should be noticed that only the
lowest excited state can hold the photoexcited electrons trapped in H2O. Thus, after initially
promoted into a high-energy excited state Sn, it goes through Sn... S3S2S1, which
is a localized state a charge-transfer state in TiO2a charge-transfer state separately
localized on H2O and TiO2. We expect that when more water molecule adsorbed on rutile
TiO2(110) surface, the photocatalytic activity should be enhanced in the photocatalytic water
splitting process.
4. Conclusion
In summary, we describe a massively parallel implementation of large-scale linear-response
time-dependent density functional theory (LR-TDDFT) to compute the excitation energies
and wavefunctions in solid with the plane-wave basis set under the periodic boundary
condition in the PWDFT (Plane Wave Density Functional Theory) software package. We
adopt a two-level parallelization strategy that combines with the Message Passing Interface
(MPI) and Open Multi-Processing (OpenMP) parallel programming to deal with the matrix
computation and data communication of constructing and diagonalizing the LR-TDDFT
Hamiltonian. Our work provides the numerical evidence that the LR-TDDFT calculation can
scale up to 24,576 processing cores on modern heterogeneous supercomputers for studying
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 15
the excited state properties of bulky silicon systems that contain thousands of atoms (4,096
atoms). Enabling accurate and efficient predictions of large system’s response properties will
potentially open up a broad range of new applications in material science and photocatalysis
in the near future. We also show the LR-TDDFT calculation to study the photoinduced
charge separation of water molecule adsorption on rutile TiO2(110) surface from an excitonic
perspective.
Acknowledgments
This work is partly supported by the National Natural Science Foundation of China
(21688102, 21803066), by the Chinese Academy of Sciences Pioneer Hundred Tal-
ents Program (KJ2340000031), by the National Key Research and Development Pro-
gram of China (2016YFA0200604), the Anhui Initiative in Quantum Information Tech-
nologies (AHY090400), the Strategic Priority Research Program of Chinese Academy of
Sciences (XDC01040100), the Fundamental Research Funds for the Central Universities
(WK2340000091), the Supercomputer Application Project Trail Funding from Wuxi Jiang-
nan Institute of Computing Technology (BB2340000016), the Research Start-Up Grants
(KY2340000094) and the Academic Leading Talents Training Program (KY2340000103)
from University of Science and Technology of China. The authors thank the National En-
ergy Research Scientific Computing Center (NERSC), the Supercomputing Center of Chi-
nese Academy of Sciences, the Supercomputing Center of USTC, the National Supercomput-
ing Center in Wuxi, and Tianjin, Shanghai, and Guangzhou Supercomputing Centers for the
computational resources.
Appendix
Linear response time-dependent density functional theory is a very widely used method,
which applies whenever one is considering the response to a weak perturbation. In this
section, we give a brief overview of the basic linear-response formalism7.
The KS-DFT introduces an exact mapping of the interacting N-particle problem onto a
suitable effective noninteracting system. In the noninteracting system, the ground- and excited
states can be given by Slater determinant which constructed by KS orbitals. The response
function of the KS systems can given by that form
χ0(r,r0,ω) = lim
η0
ψ
iν(r)ψic(r)ψ
ic(r0)ψiν(r0)
ωωiνic+iηψ
iν(r)ψic(r)ψ
ic(r0)ψiν(r0)
ω+ωiνiciη(A.1)
where ψare the KS orbitals, and
ωiνic=εicεiν
where ωiνicare differences between KS energies and eigenvalues. And the effective potential
of every particle as that (except Hybrid functional):
Veff[ρ](r,t) = Vext[ρ](r,t) + Zd3r0ρ(r0,t)
|rr0|+Vxc[ρ](r,t)(A.2)
Hybrid MPI and OpenMP parallel implementation of large-scale LR-TDDFT 16
here, the first term is external potential, the second and third terms are Hatree potential and
exchange-correlation potentials.
If we write Vext(r,t) = V0
ext[ρ](r) + V0
ext[ρ1](r,t)θ(tt0), that means the external
perturbation is appeared from t0. We suppose as above we can write the first order perturbation
of the potential as following
V0
eff[ρ1](r,t) = V0
ext[ρ1](r,t) + Zd3r0ρ1(r0,t)
|rr0|+V0
xc[ρ1](r0,t)(A.3)
The last term is the exchange-correlation perturbation, use Taylor expansion we can only
consider the linear term of ρ1:
V0
xc[n1](r0,t) = Zdt0Zd3r0δVxc[ρ](r,t)
δ ρ (r0,t0)|ρ0ρ1(r0,t0)(A.4)
suppose
fxc(r,r0,t,t0) = δVxc[ρ](r,t)
δ ρ (r0,t0)|ρ0(A.5)
that is so-called time-dependent xc kernel. In adiabatic case, we can approximate:
fxc(r,r0,t,t0) = fxc(r,r0,tt0) = fxc(r,r0)δ(tt0)(A.6)
and the density response can be given:
ρ1(r,t) = Zdr0Zdt0χ0(r,r0,t,t0)Veff[ρ1](r0,t0)(A.7)
If we want to get eigen-mode of the system, we can choose the limit Vext 0:
V0
eff[ρ1](r,t) = Zd3r0Zdt 0(ρ1(r0,t0)
|rr0|+fxc(r,r0))δ(tt0)(A.8)
here, this equation is relate t and t0as tt0. so we can use Fourier transform to frequency
space and suppose fHxc
fHxc(r,r0) = 1
|rr0|+fxc(r,r0)(A.9)
so the density response is
ρ1(r,ω) = Zdr0χ0(r,r0,ω)Zdr00 fHxc(r0,r00)ρ1(r00,ω)(A.10)
after that we use some algebraic substitution to get the final equation. Suppose that
g(r,ω) = Zdr0fHxc(r,r0)ρ1(r0,ω)
then we can get
g(r,ω) = Zdr0Zdr00 fHxc(r,r0)χ0(r0,r00,ω)g(r00,ω)
REFERENCES 17
substitute χ0with eq.1 and if we consider poles in upper plane, suppose
Hiνic(ω) = Zdr0ψ
iν(r0)ψic(r0)g(r0,ω)(A.11)
and
Vjνjc,iνic=ZdrZdr0ψ
jνψjc(r)fHxc(r,r0)ψ
iν(r0)ψic(r0)
Xiνic=Hiνic
ωωiνic
Finally, the equation is
ωXjνjc=
iν,ic
(ωiνicδjνjνδjcic+Vjνjc,iνic)Xiνic(A.12)
That is Casida equation.
References
[1] Runge E and Gross E K U 1984 Phys. Rev. Lett. 52 997
[2] Beck T L 2000 Rev. Mod. Phys. 72 1041
[3] Onida G, Reining L and Rubio A 2002 Rev. Mod. Phys. 74 601
[4] Yabana K and Bertsch G F 1996 Phys. Rev. B54 4484
[5] Casida M E 1995 Recent Advances in Density Functional Methods:(Part I) 1155
[6] Sternheimer R M 1954 Phys. Rev. 96 951
[7] Dreuw A and Head-Gordon M 2005 Chem. Rev. 105 4009–4037
[8] Soler J M, Artacho E, Gale J D et al. 2002 J. Phys.: Condens. Matter 14 2745
[9] VandeVondele J, Borstnik U and Hutter J 2012 J.Chem.Theory Comput 83565–3573
ISSN 1549-9618
[10] Gillan M J, Bowler D R, Torralba A S and Miyazaki T 2007 Comput. Phys. Commun.
177 14–18
[11] Marek A, Blum V, Johanni R, Havu V et al. 2014 J. Phys: Condens. Matter 26 213201
[12] Genovese L, Neelov A, Goedecker S et al. 2008 J. Chem. Phys. 129 014109
[13] Qin X, Shang H, Xiang H, Li Z and Yang J 2015 Int. J. Quantum Chem. 115 647–655
[14] Qin X, Shang H, Xu L, Hu W, Yang J, Li S and Zhang Y 2019 Int. J. High Perform.
Comput. Appl. 34 159–168
[15] Shang H, Xu L, Wu B, Qin X, Zhang Y and Yang J 2020 Comput. Phys. Commun. 254
107204
[16] Lin L, Lu J, Ying L and E W 2012 J. Comput. Phys. 231 2140–2154
[17] Hu W, Lin L and Yang C 2015 J. Chem. Phys. 143 124110
[18] Hu W, Qin X, Jiang Q, Chen J, An H, Jia W, Li F, Liu X, Chen D, Liu F, Zhaoe Y and
Yang J 2020 Sci. Bull.
REFERENCES 18
[19] Valiev M, Bylaska E J, Govind N et al. 2010 Comput. Phys. Commun. 181 1477–1489
[20] Shao Y, Gan Z, Epifanovsky E et al. 2015 Mol. Phys. 113 184–215
[21] Apra1 E, Bylaska E J, de Jong W A et al. 2020 J. Chem. Phys. 152 184102
[22] Malcıoˇ
glu O B, Gebauer R, Rocca D and Baroni S 2011 Comput. Phys. Commun. 182
1744–1754
[23] Ge X, Binnie S J, Rocca D, Gebauer R and Baroni S 2014 Comput. Phys. Commun. 185
2080–2089
[24] Hu W, Lin L, Banerjee A S, Vecharynski E and Yang C 2017 J. Chem. Theory Comput.
13 1188–1198
[25] Hu W, Lin L and Yang C 2017 J. Chem. Theory Comput. 13 5420–5431
[26] Hu W, Lin L and Yang C 2017 J. Chem. Theory Comput. 13 5458–5467
[27] Khatri C G and Rao C R 1968 Sankhya 30 167–180
[28] Goedecker S, Teter M and Hutter J 1996 Phys. Rev. B54 1703
[29] Wilkinson K A, Hine N D M and Skylaris C K 2014 J. Chem. Theory Comput 10 4782–
4794
[30] Duff I, Heroux M and Pozo R 2002 ACM Trans. Math. Softw. 28 239–267
[31] Frigo M and Johnson S G 2005 Proceedings of the IEEE 93 216–231
[32] Hartwigsen C, Goedecker S and Hutter J 1998 Phys. Rev. B58 3641
[33] Tan S, Feng H, Ji Y, Wang Y, Zhao J, Zhao A, Wang B, Luo Y, Yang J and Hou J 2012
J. Am. Chem. Soc. 134 9978–9985
[34] Onda K, Li B, Zhao J, Jordan K D, Yang J and Petek H 2005 Science 308 1154–1158
[35] Sun H, Zheng Q, Lu W and Zhao J 2019 J. Phys. Condens. Matter 31 114004
[36] Migani A, Mowbray D J, Zhao J and Petek H 2015 J. Chem. Theory Comput. 11 239–251
[37] Sun H, Mowbray D J, Migani A, Zhao J, Petek H and Rubio A 2015 ACS Catal. 5
4242–4254
... Furthermore, a hybrid MPI/OpenMP parallelization can be also adopted for massively parallel calculations [59] on modern shared-memory supercomputers, where OpenMP is used for multi-threading parallelism inside a node and MPI is used for data communication between different nodes. This scheme can further reduce the data communication and thus is expected to improve the performance. ...
... Embedding of more accurate approaches into DFT by adding the exact exchange and the random phase approximation (RPA) [49], is able to systematically improve the results of pure DFT. In particular, the methods beyond DFT, e.g., the time-dependent density functional theory (TDDFT) [59] and GW [78], have successfully extended DFT to study excited-state properties. However, these methods have significantly higher computational complexity, which hinders their applications in large systems and aggravates the difficulty of DFT software for exascale computing. ...
Article
Full-text available
High performance computing (HPC) plays an essential role in enabling first-principles calculations based on the Kohn–Sham density functional theory (KS-DFT) for investigating quantum structural and electronic properties of large-scale molecules and solids in condensed matter physics, quantum chemistry and materials science. This review focuses on recent advances for HPC software development in large-scale KS-DFT calculations containing tens of thousands of atoms on modern heterogeneous supercomputers, especially for the HPC software with independent intellectual property rights supported on the Chinese domestic exascale supercomputers. We first introduce three various types of DFT software developed on modern heterogeneous supercomputers, involving PWDFT (Plane-Wave Density Functional Theory), HONPAS (Hefei Order-N Packages for Ab initio Simulations) and DGDFT (Discontinuous Galerkin Density Functional Theory), respectively based on three different types of basis sets (plane waves, numerical atomic orbitals and adaptive local basis functions). Then, we describe the theoretical algorithms and parallel implementation of these three software on modern heterogeneous supercomputers in detail. Finally, we conclude this review and propose several promising research fields for future large-scale KS-DFT calculations towards exascale supercomputers.
... The hybrid density functional calculations for complex-valued Kohn-Sham orbitals require more usage of computation and memory, which can be improved by performing OpenMP parallel implementation in our future work. 80 Therefore, this improved K-means clustering algorithm can accurately and efficiently accelerate large-scale and long-time ab initio molecular dynamics with complexvalued hybrid DFT calculations. ...
Article
Full-text available
Real-time time-dependent density functional theory (RT-TDDFT) is a powerful tool for predicting excited-state dynamics. Herein, we combine the adaptively compressed exchange (ACE) operator with interpolative separable density fitting (ISDF) algorithm to accelerate the hybrid functional calculations in RT-TDDFT (hybrid RT-TDDFT) dynamics simulations for molecular and periodic systems within plane wave basis sets. Under this low-rank representation, we demonstrate that the ACE-ISDF enabled hybrid RT-TDDFT can yield accurate excited-state dynamics, but much faster than conventional calculations. Furthermore, we describe a massively parallel implementation of ACE-ISDF enabled hybrid RT-TDDFT dynamics simulations containing thousands of atoms (1,728 atoms), which can scale up to 3,456 CPU cores on modern heterogeneous supercomputers.
Article
Full-text available
High performance computing (HPC) is a powerful tool to accelerate the Kohn-Sham density functional theory (KS-DFT) calculations on modern heterogeneous supercomputers. Here, we describe a massively parallel implementation of discontinuous Galerkin density functional theory (DGDFT) method on the Sunway TaihuLight supercomputer. The DGDFT method uses the adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field (SCF) iteration to solve the KS equations with high precision comparable to plane-wave basis set. In particular, the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution, task scheduling, and data communication schemes, and combines with the master-slave multi-thread heterogeneous parallelism of SW26010 processor, resulting in large-scale HPC KS-DFT calculations on the Sunway TaihuLight supercomputer. We show that the DGDFT method can scale up to 8,519,680 processing cores (131,072 core groups) on the Sunway TaihuLight supercomputer for studying the electronic structures of two-dimensional (2D) metallic graphene systems that contain tens of thousands of carbon atoms.
Article
Full-text available
Specialized computational chemistry packages have permanently reshaped the landscape of chemical and materials science by providing tools to support and guide experimental efforts and for the prediction of atomistic and electronic properties. In this regard, electronic structure packages have played a special role by using first-principle-driven methodologies to model complex chemical and materials processes. Over the past few decades, the rapid development of computing technologies and the tremendous increase in computational power have offered a unique chance to study complex transformations using sophisticated and predictive many-body techniques that describe correlated behavior of electrons in molecular and condensed phase systems at different levels of theory. In enabling these simulations, novel parallel algorithms have been able to take advantage of computational resources to address the polynomial scaling of electronic structure methods. In this paper, we briefly review the NWChem computational chemistry suite, including its history, design principles, parallel tools, current capabilities, outreach, and outlook.
Article
Full-text available
We present an overview of the onetep program for linear-scaling density functional theory (DFT) calculations with large basis set (plane-wave) accuracy on parallel computers. The DFT energy is computed from the density matrix, which is constructed from spatially localized orbitals we call Non-orthogonal Generalized Wannier Functions (NGWFs), expressed in terms of periodic sinc (psinc) functions. During the calculation, both the density matrix and the NGWFs are optimized with localization constraints. By taking advantage of localization, onetep is able to perform calculations including thousands of atoms with computational effort, which scales linearly with the number or atoms. The code has a large and diverse range of capabilities, explored in this paper, including different boundary conditions, various exchange–correlation functionals (with and without exact exchange), finite electronic temperature methods for metallic systems, methods for strongly correlated systems, molecular dynamics, vibrational calculations, time-dependent DFT, electronic transport, core loss spectroscopy, implicit solvation, quantum mechanical (QM)/molecular mechanical and QM-in-QM embedding, density of states calculations, distributed multipole analysis, and methods for partitioning charges and interactions between fragments. Calculations with onetep provide unique insights into large and complex systems that require an accurate atomic-level description, ranging from biomolecular to chemical, to materials, and to physical problems, as we show with a small selection of illustrative examples. onetep has always aimed to be at the cutting edge of method and software developments, and it serves as a platform for developing new methods of electronic structure simulation. We therefore conclude by describing some of the challenges and directions for its future developments and applications.
Article
Full-text available
We present an efficient way to compute the excitation energies in molecules and solids within linear response time-dependent density functional theory (LR-TDDFT). Conventional methods to construct and diagonalize the LR-TDDFT Hamiltonian require ultrahigh computational cost, limiting its optoelectronic applications to small systems. Our new method is based on the interpolative separable density fitting (ISDF) decomposition combined with implicitly constructing and iteratively diagonalizing the LR-TDDFT Hamiltonian, and only requires low computational cost to accelerate the LR-TDDFT calculations in the plane wave basis sets under the periodic boundary condition. We show that this method accurately reproduces excitation energies in fullerene (C60_{60}) molecule and bulk silicon Si64_{64} system with significantly reduced computational cost compared to conventional direct and iterative calculations. The efficiency of this ISDF method enables us to investigate the excited-state properties of liquid water absorption on MoS2_2 and phosphorene by using the LR-TDDFT calculations. Our computational results show that aqueous environment has a weak effect on low excitation energies but strong effect on high excitation energies of 2D semiconductors for photocatalytic water splitting.
Article
Full-text available
Hybrid density-functional calculation is one of the most commonly adopted electronic structure theories in computational chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-functionals. In our scheme, the most time-consuming step is the calculation of the electron repulsion integrals (ERIs) part, so creating an even distribution of these ERIs in parallel implementation is an issue of particular importance. Here, we present two static scalable distributed algorithms for the ERIs computation. Firstly, the ERIs are distributed over ERIs shell pairs. Secondly, the ERIs are distributed over ERIs shell quartets. In both algorithms, the calculation of ERIs is independent of each other, so the communication time is minimized. We show our speedup results to demonstrate the performance of these static parallel distributed algorithms in the Hefei Order-N packages for ab initio simulations.
Article
Full-text available
We report a combined non-local (PBE-TC-LRC) Density Functional Theory (DFT) and linear-response time-dependent DFT (LR-TDDFT) study of the structural, electronic, and optical properties of the cation-vacancy based defects in aluminosilicate (AlSi) imogolite nanotubes (Imo-NTs) that have been recently proposed on the basis of Nuclear Magnetic Resonance (NMR) experiments. Following numerical determination of the smallest AlSi Imo-NT model capable of accommodating the defect-induced relaxation with negligible finite-size errors, we analyse the defect-induced structural deformations in the NTs and ensuing changes in the NTs' electronic structure. The NMR-derived defects are found to introduce both shallow and deep occupied states in the pristine NTs' band gap (BG). These BG states are found to be highly localized at the defect site. No empty defect-state is modeled for any of the considered systems. LR-TDDFT simulation of the defects reveal increased low-energy optical absorbance for all but one defects, with the appearance of optically active excitations at energies lower than for the defect-free NT. These results enable interpretation of the low-energy tail in the experimental UV-vis spectra for AlSi NTs as being due to the defects. Finally, the PBE-TC-LRC-approximated exciton binding energy for the defects' optical transitions is found to be substantially lower (up to 0.8 eV) than for the pristine defect-free NT's excitations (1.1 eV).
Article
Full-text available
Solvated electrons are known to be the lowest energy charge transfer pathways at oxide/aqueous interface and the understanding of the electron transfer dynamics at the interface is fundamental for photochemical and photocatalytic processes. Taking anatase TiO2/H2O interface as a prototypical system, we perform time-dependent ab initio nonadiabatic molecular dynamics calculations to study the charge transfer dynamics of solvated electrons. For the static electronic properties, we find that the dangling H atoms can stabilize solvated electrons. A solvated electron band can be formed with one monolayer H2O adsorption. The energies of the solvated electron band minimum (SEBM) decrease when H2O adsorbs dissociatively. Moreover, the surface oxygen vacancies are also helpful for stabilizing the solvated electron band. For the dynamics behaviour, we find that the ultrafast charge transfer from SEBM to anatase TiO2 (1 0 1) surface at 100 K is mainly contributed by nonadiabatic mechanism. Comparing with rutile TiO2 (1 1 0) surface, the lifetime of solvated electron on anatase TiO2 (1 0 1) surface is longer, suggesting a better photocatalytic properties. Our results provide essential insights into the understanding of the charge transfer dynamics and the possible photocatalytic mechanism at oxide/aqueous interface.
Article
Understanding the microscopic mechanism of water photocatalysis on TiO2 is of great value in energy chemistry and catalysis. To date, it is still unclear how water photocatalysis occurs after the initial light absorption. Here we report the investigation of the photoinduced water dissociation and desorption on a R-TiO2(110) surface, at different wavelengths (from 250 to 330 nm), using temperature-programmed desorption and time-of-flight techniques. Primary photooxidation products, gas phase OH radicals and surface H atoms, were clearly observed at wavelengths of ≤290 nm. As the laser wavelength decreases from 290 to 250 nm, the relative yield of H2O oxidation increases significantly. Likewise, photoinduced H2O desorption was also observed in the range of 320-250 nm, and the relative yield of H2O desorption also increases with a decrease in wavelength. The strong wavelength-dependent H2O photooxidation and photodesorption suggest that the energy of charge carriers is important in these two processes. More importantly, the result raises doubt about the widely accepted photocatalysis model of TiO2 in which the excess energy of charge carriers is useless for photocatalysis. In addition, the H2O photooxidation is more likely initiated by nonthermalized holes and is accomplished on the ground state potential energy surface via a non-adiabatic decay process.
Article
This work presents a dynamic parallel distribution scheme for the Hartree–Fock exchange (HFX) calculations based on the real-space NAO2GTO framework. The most time-consuming electron repulsion integrals (ERIs) calculation is perfectly load-balanced with 2-level master–worker dynamic parallel scheme, the density matrix and the HFX matrix are both stored in the sparse format, the network communication time is minimized via only communicating the index of the batched ERIs and the final sparse matrix form of the HFX matrix. The performance of this dynamic scalable distributed algorithm has been demonstrated by several examples of large scale hybrid density-functional calculations on Tianhe-2 supercomputers, including both molecular and solid states systems with multiple dimensions, and illustrates good scalability.
Article
Using an advanced computational methodology implemented in CP2K, a non-local PBE0-TC-LRC density functional and the recently implemented linear response formulation of the Time-dependent Density Functional Theory equations, we test the interpretation of the optical absorption and photoluminescence signatures attributed by previous experimental and theoretical studies to O-vacancies in two widely used oxides - cubic MgO and monoclinic (m)-HfO2. The results obtained in large periodic cells including up to 1000 atoms emphasize the importance of accurate predictions of defect-induced lattice distortions. They confirm that optical transitions of O-vacancies in 0, +1, and +2 charge states in MgO all have energies close to 5 eV. We test the models of photoluminescence of O-vacancies proposed in the literature. The photoluminescence of VO+2 centers in m-HfO2 is predicted to peak at 3.7 eV and originate from radiative tunneling transition between a VO+1 center and a self-trapped hole created by the 5.2 eV excitation.