Adaptable Particle-in-Cell algorithms for graphical processing units

Computer Physics Communications (Impact Factor: 3.11). 03/2011; 182(3):641-648. DOI: 10.1016/j.cpc.2010.11.009
Source: DBLP


Emerging computer architectures consist of an increasing number of shared memory computing cores in a chip, often with vector (SIMD) co-processors. Future exascale high performance systems will consist of a hierarchy of such nodes, which will require different algorithms at different levels. Since no one knows exactly how the future will evolve, we have begun development of an adaptable Particle-in-Cell (PIC) code, whose parameters can match different hardware configurations. The data structures reflect three levels of parallelism, contiguous vectors and non-contiguous blocks of vectors, which can share memory, and groups of blocks which do not. Particles are kept ordered at each time step, and the size of a sorting cell is an adjustable parameter. We have implemented a simple 2D electrostatic skeleton code whose inner loop (containing 6 subroutines) runs entirely on the NVIDIA Tesla C1060. We obtained speedups of about 16-25 compared to a 2.66 GHz Intel i7 (Nehalem), depending on the plasma temperature, with an asymptotic limit of 40 for a frozen plasma. We expect speedups of about 70 for an 2D electromagnetic code and about 100 for a 3D electromagnetic code, which have higher computational intensities (more flops/memory access).

1 Follower
15 Reads
  • Source
    • "Both the collision-free and collision-resolving algorithms use the same field solvers. The 2D ES code solves Poisson's equation in Fourier space as described in [8]. At the time that paper was written, NVIDIA's real to complex (R2C) FFT was not optimized and performed poorly. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture.
    Computer Physics Communications 03/2014; 185(3):708–719. DOI:10.1016/j.cpc.2013.10.013 · 3.11 Impact Factor
  • Source
    • "Some GPU based Poisson solvers can be found in the literature. In the work of Deck and Singh [6] the CUFFT library is used in the FFT based solution of the 2D Poisson's equation for periodic BCs. Rossinelli et al. [7] use the same technique for a 2D free BCs problem . "
    [Show abstract] [Hide abstract]
    ABSTRACT: A 3-dimensional GPU Poisson solver is developed for all possible combinations of free and periodic boundary conditions (BCs) along the three directions. It is benchmarked for various grid sizes and different BCs and a significant performance gain is observed for problems including one or more free BCs. The GPU Poisson solver is also benchmarked against two different CPU implementations of the same method and a significant amount of acceleration of the computation is observed with the GPU version.
    Computer Physics Communications 11/2012; 184(8). DOI:10.1016/j.cpc.2013.02.024 · 3.11 Impact Factor
  • Source
    • "Nevertheless, there have been fairly successful efforts in porting explicit PIC algorithms to new computing architectures such as graphics processing units (GPUs) [5] [6] [7] [8] [9]. In [5], a 2D electrostatic PIC algorithm is implemented on a GPU. A particle data structure is proposed to match the GPU architecture so that the algorithm can be adapted to different problem configurations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 300 times faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of \sim 100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.
    Journal of Computational Physics 11/2011; 231(16). DOI:10.1016/ · 2.43 Impact Factor
Show more

Similar Publications


15 Reads
Available from