Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Computer Physics Communications (Impact Factor: 2.41). 12/2010; DOI: 10.1016/j.cpc.2010.07.049
Source: DBLP

ABSTRACT Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.Program summaryProgram title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL: obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Two-dimensional (2D) models are increasingly used for inundation assessment in situations involving large domains of millions of computational elements and long-time scales of several months. Practical applications often involve a compromise between spatial accuracy and computational efficiency and to achieve the necessary spatial resolution, rather fine meshes become necessary requiring more data storage and very long computer times that may become comparable to the real simulated process. The use of conventional 2D non-parallelized models (CPU based) makes simulations impractical in real project applications and improving the performance of such complex models constitutes an important challenge not yet resolved. We present the newest developments of the RiverFLO-2D Plus model based on a fourth-generation finite volume numerical scheme on flexible triangular meshes that can run on highly efficient Graphical Processing Units (GPU's). In order to reduce the computational load, we have implemented two strategies: OpenMP parallelization and GPU techniques. Since dealing with transient inundation flows the number of wet elements changes during the simulation, a dynamic task assignment to the processors that ensures a balanced work load has been included in the Open MP implementation. Our strict method to control volume conservation (errors of Order 10 -14 %) in the numerical modeling of the wetting/drying fronts involves a correction step that is not fully local, which requires special handling to avoid degrading the model. The efficiency of the model is demonstrated by means of results that show that the proposed method reduces the computational time by more than 30 times in comparison to equivalent CPU implementations. We present performance tests using the latest GPU hardware technology, that shows that the parallelization techniques implemented in RiverFLO-2D Plus can significantly reduce the Computational-Load/Hardware-Investment ratio by a factor of 200-300 allowing 2D model end-users to obtain the performance of a super computation infrastructure at a much lower cost.
    HIC2014, 11th International Conference on Hydroinformatics; 08/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a numerical method, based on a three-dimensional finite volume wave-propagation algorithm, for solving the Vlasov equation in a full six-dimensional (three spatial coordinates, three velocity coordinates) case in length scales comparable to the size of the Earth’s magnetosphere. The method uses Strang splitting to separate propagation in spatial and velocity coordinates, and is second-order accurate in spatial and velocity spaces and in time. The method has been implemented on general-purpose graphics processing units for faster computations and has been parallelised using the message passing interface.
    Parallel Computing 08/2013; 39(8):306–318. DOI:10.1016/j.parco.2013.05.001 · 1.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 235 million cells at a rate of over 1.2 gigacells per second using four Fermi-generation GPUs. The code is thoroughly benchmarked using three different systems, both high-performance and commodity-level systems.
    Applied Parallel and Scientific Computing - 10th International Conference, PARA 2010, Reykjavík, Iceland, June 6-9, 2010, Revised Selected Papers, Part II; 01/2010


Available from