Firstprinciples calculations of electron states of a silicon nanowire with 100, 000 atoms on the K computer.
ABSTRACT Real space DFT (RSDFT) is a simulation technique most suitable for massivelyparallel architectures to perform firstprinciples electronicstructure calculations based on density functional theory. We here report unprecedented simulations on the electron states of silicon nanowires with up to 107,292 atoms carried out during the initial performance evaluation phase of the K computer being developed at RIKEN. The RSDFT code has been parallelized and optimized so as to make effective use of the various capabilities of the K computer. Simulation results for the selfconsistent electron states of a silicon nanowire with 10,000 atoms were obtained in a run lasting about 24 hours and using 6,144 cores of the K computer. A 3.08 petaflops sustained performance was measured for one iteration of the SCF calculation in a 107,292atom Si nanowire calculation using 442,368 cores, which is 43.63% of the peak performance of 7.07 petaflops.
 References (0)
 Cited In (10)

 "Parallel processing for RSDFT needs to consider whole process parallelism. As we mentioned, RSDFT needs to parallelize the orthogonalization routine and other routines, such as updating wave function using the CG method and updating potential fields, in addition to the eigensolver [2]. Although the number of processes may be too large for the eigensolver, it is necessary for the parallelization of other parts of RSDFT, and it is favorable to avoid any extra costs such as matrix redistribution. "

 "Parallel processing for RSDFT needs to consider whole process parallelism . As we mentioned, RSDFT needs to parallelize the orthogonalization routine and other routines, such as updating wave function using the CG method and updating potential fields, in addition to the eigensolver [2]. Although the number of processes may be too large for the eigensolver, it is necessary for the parallelization of other parts of RSDFT, and it is favorable to avoid any extra costs such as matrix redistribution. "
Article: A Communication Avoiding and Reducing Algorithm for Symmetric Eigenproblem for Very Small Matrices
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also fit the exascale computing requirements of current production runs of an application. To minimize communication time, we added several communication avoiding and communication reducing algorithms based on Message Passing Interface (MPI) nonblocking implementations. A performance evaluation with up to full nodes of the FX10 system indicates that (1) the MPI nonblocking implementation is 3x as efficient as the baseline implementation, (2) the hybrid MPI execution is 1.9x faster than the pure MPI execution, (3) our proposed solver is 2.3x and 22x faster than a ScaLAPACK routine with optimized blocking size and cycliccyclic distribution, respectively. 
 "Even the simplest restricted HartreeFock (RHF) method scales approximately cubically with the system size. There are ongoing efforts to reduce the scaling of quantummechanical (QM) methods [29] [30] and parallelize them efficiently [31] [32] [33] [34] [35] [36] [37] [38]. See, for example, the linearly scaling method developed by Challacombe and Schwegler [39], and the adaptive multiresolution method developed by Harrison, et al. [40]. "
Conference Paper: Heuristic static loadbalancing algorithm applied to the fragment molecular orbital method
[Show abstract] [Hide abstract]
ABSTRACT: In the era of petascale supercomputing, the importance of load balancing is crucial. Although dynamic load balancing is widespread, it is increasingly difficult to implement effectively with thousands of processors or more, prompting a second look at static loadbalancing techniques even though the optimal allocation of tasks to processors is an NPhard problem. We propose a heuristic static loadbalancing algorithm, employing fitted benchmarking data, as an alternative to dynamic load balancing. The problem of allocating CPU cores to tasks is formulated as a mixedinteger nonlinear optimization problem, which is solved by using an optimization solver. On 163,840 cores of Blue Gene/P, we achieved a parallel efficiency of 80% for an execution of the fragment molecular orbital method applied to model proteinligand complexes quantummechanically. The obtained allocation is shown to outperform dynamic load balancing by at least a factor of 2, thus motivating the use of this approach on other coarsegrained applications.High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for; 01/2012