First-principles calculations of electron states of a silicon nanowire with 100, 000 atoms on the K computer.
ABSTRACT Real space DFT (RSDFT) is a simulation technique most suitable for massively-parallel architectures to perform first-principles electronic-structure calculations based on density functional theory. We here report unprecedented simulations on the electron states of silicon nanowires with up to 107,292 atoms carried out during the initial performance evaluation phase of the K computer being developed at RIKEN. The RSDFT code has been parallelized and optimized so as to make effective use of the various capabilities of the K computer. Simulation results for the self-consistent electron states of a silicon nanowire with 10,000 atoms were obtained in a run lasting about 24 hours and using 6,144 cores of the K computer. A 3.08 peta-flops sustained performance was measured for one iteration of the SCF calculation in a 107,292-atom Si nanowire calculation using 442,368 cores, which is 43.63% of the peak performance of 7.07 peta-flops.
- SourceAvailable from: Takahiro Katagiri
- "Parallel processing for RSDFT needs to consider whole process parallelism. As we mentioned, RSDFT needs to parallelize the orthogonalization routine and other routines, such as updating wave function using the CG method and updating potential fields, in addition to the eigensolver . Although the number of processes may be too large for the eigensolver, it is necessary for the parallelization of other parts of RSDFT, and it is favorable to avoid any extra costs such as matrix re-distribution. "
[Show abstract] [Hide abstract]
- "Parallel processing for RSDFT needs to consider whole process parallelism . As we mentioned, RSDFT needs to parallelize the orthogonalization routine and other routines, such as updating wave function using the CG method and updating potential fields, in addition to the eigensolver . Although the number of processes may be too large for the eigensolver, it is necessary for the parallelization of other parts of RSDFT, and it is favorable to avoid any extra costs such as matrix re-distribution. "
ABSTRACT: In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also fit the exa-scale computing requirements of current production runs of an application. To minimize communication time, we added several communication avoiding and communication reducing algorithms based on Message Passing Interface (MPI) non-blocking implementations. A perfor-mance evaluation with up to full nodes of the FX10 system indicates that (1) the MPI non-blocking implementation is 3x as efficient as the baseline implementation, (2) the hybrid MPI execution is 1.9x faster than the pure MPI execution, (3) our proposed solver is 2.3x and 22x faster than a ScaLA-PACK routine with optimized blocking size and cyclic-cyclic distribution, respectively.
[Show abstract] [Hide abstract]
- "Even the simplest restricted Hartree-Fock (RHF) method scales approximately cubically with the system size. There are ongoing efforts to reduce the scaling of quantum-mechanical (QM) methods   and parallelize them efficiently        . See, for example, the linearly scaling method developed by Challacombe and Schwegler , and the adaptive multiresolution method developed by Harrison, et al. . "
ABSTRACT: In the era of petascale supercomputing, the importance of load balancing is crucial. Although dynamic load balancing is widespread, it is increasingly difficult to implement effectively with thousands of processors or more, prompting a second look at static load-balancing techniques even though the optimal allocation of tasks to processors is an NP-hard problem. We propose a heuristic static load-balancing algorithm, employing fitted benchmarking data, as an alternative to dynamic load balancing. The problem of allocating CPU cores to tasks is formulated as a mixed-integer nonlinear optimization problem, which is solved by using an optimization solver. On 163,840 cores of Blue Gene/P, we achieved a parallel efficiency of 80% for an execution of the fragment molecular orbital method applied to model protein-ligand complexes quantum-mechanically. The obtained allocation is shown to outperform dynamic load balancing by at least a factor of 2, thus motivating the use of this approach on other coarse-grained applications.High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for; 01/2012