Adaptable Particle-in-Cell algorithms for graphical processing units

Computer Physics Communications (Impact Factor: 3.11). 03/2011; 182(3):641-648. DOI: 10.1016/j.cpc.2010.11.009
Source: DBLP


Emerging computer architectures consist of an increasing number of shared memory computing cores in a chip, often with vector (SIMD) co-processors. Future exascale high performance systems will consist of a hierarchy of such nodes, which will require different algorithms at different levels. Since no one knows exactly how the future will evolve, we have begun development of an adaptable Particle-in-Cell (PIC) code, whose parameters can match different hardware configurations. The data structures reflect three levels of parallelism, contiguous vectors and non-contiguous blocks of vectors, which can share memory, and groups of blocks which do not. Particles are kept ordered at each time step, and the size of a sorting cell is an adjustable parameter. We have implemented a simple 2D electrostatic skeleton code whose inner loop (containing 6 subroutines) runs entirely on the NVIDIA Tesla C1060. We obtained speedups of about 16-25 compared to a 2.66 GHz Intel i7 (Nehalem), depending on the plasma temperature, with an asymptotic limit of 40 for a frozen plasma. We expect speedups of about 70 for an 2D electromagnetic code and about 100 for a 3D electromagnetic code, which have higher computational intensities (more flops/memory access).

Full-text preview

Available from:
  • Source
    • "Kong et al. [12] developed a 2D3V fully relativistic electromagnetic code on a NVIDIA GeForce GTX 280 graphics card and achieved speedup of 81x and 27x over an Intel Core 2 Duo E7200 2.53 GHz CPU using only a single core for cold plasma runs and extremely relativistic plasma runs, respectively. Decyk and Singh [5] developed a new parameterized PIC algorithms and data structures on a NVIDIA GeForce GTX 280 graphics card. They reported speedups of about 15-25 compared to an Intel Nehalem 2.66 GHz processor for a simple 2D electrostatic code. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work for single GPU computation Wong et al. (2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
    Full-text · Article · Jul 2014 · Computer Physics Communications
  • Source
    • "Both the collision-free and collision-resolving algorithms use the same field solvers. The 2D ES code solves Poisson's equation in Fourier space as described in [8]. At the time that paper was written, NVIDIA's real to complex (R2C) FFT was not optimized and performed poorly. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture.
    Full-text · Article · Mar 2014 · Computer Physics Communications
  • Source
    • "Some GPU based Poisson solvers can be found in the literature. In the work of Deck and Singh [6] the CUFFT library is used in the FFT based solution of the 2D Poisson's equation for periodic BCs. Rossinelli et al. [7] use the same technique for a 2D free BCs problem . "
    [Show abstract] [Hide abstract]
    ABSTRACT: A 3-dimensional GPU Poisson solver is developed for all possible combinations of free and periodic boundary conditions (BCs) along the three directions. It is benchmarked for various grid sizes and different BCs and a significant performance gain is observed for problems including one or more free BCs. The GPU Poisson solver is also benchmarked against two different CPU implementations of the same method and a significant amount of acceleration of the computation is observed with the GPU version.
    Full-text · Article · Nov 2012 · Computer Physics Communications
Show more