Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Computer Physics Communications (Impact Factor: 3.11). 12/2010; 181(12):2164-2179. DOI: 10.1016/j.cpc.2010.07.049
Source: DBLP


Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM’s Cell Processor and NVIDIA’s CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides

Full-text preview

Available from:
  • Source
    • "The use of multiple CPU's was recently reported in [2],[3] or [4] and that of using GPU can be found in [5][6][7][8]. The GPU technology offers the performance of smaller clusters at a much lower cost [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Two-dimensional (2D) models are increasingly used for inundation assessment in situations involving large domains of millions of computational elements and long-time scales of several months. Practical applications often involve a compromise between spatial accuracy and computational efficiency and to achieve the necessary spatial resolution, rather fine meshes become necessary requiring more data storage and very long computer times that may become comparable to the real simulated process. The use of conventional 2D non-parallelized models (CPU based) makes simulations impractical in real project applications and improving the performance of such complex models constitutes an important challenge not yet resolved. We present the newest developments of the RiverFLO-2D Plus model based on a fourth-generation finite volume numerical scheme on flexible triangular meshes that can run on highly efficient Graphical Processing Units (GPU's). In order to reduce the computational load, we have implemented two strategies: OpenMP parallelization and GPU techniques. Since dealing with transient inundation flows the number of wet elements changes during the simulation, a dynamic task assignment to the processors that ensures a balanced work load has been included in the Open MP implementation. Our strict method to control volume conservation (errors of Order 10 -14 %) in the numerical modeling of the wetting/drying fronts involves a correction step that is not fully local, which requires special handling to avoid degrading the model. The efficiency of the model is demonstrated by means of results that show that the proposed method reduces the computational time by more than 30 times in comparison to equivalent CPU implementations. We present performance tests using the latest GPU hardware technology, that shows that the parallelization techniques implemented in RiverFLO-2D Plus can significantly reduce the Computational-Load/Hardware-Investment ratio by a factor of 200-300 allowing 2D model end-users to obtain the performance of a super computation infrastructure at a much lower cost.
    Full-text · Conference Paper · Aug 2014
  • Source
    • "Development of general-purpose graphics processing units (GPUs) during the past ten years promises large increases in available computation power, which may help to bring global-scale Vlasov–Maxwell simulations within the domain of tractable problems. Several simulations have been ported on GPUs, typically obtaining an order-of-magnitude speedup per GPU card, as compared to running the same algorithm on single CPU core [24] [25] [26] [27] [28] [29]. Reaching good performance on GPUs can be difficult for some algorithms, however, due to the extremely multi-threaded nature of the hardware. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a numerical method, based on a three-dimensional finite volume wave-propagation algorithm, for solving the Vlasov equation in a full six-dimensional (three spatial coordinates, three velocity coordinates) case in length scales comparable to the size of the Earth’s magnetosphere. The method uses Strang splitting to separate propagation in spatial and velocity coordinates, and is second-order accurate in spatial and velocity spaces and in time. The method has been implemented on general-purpose graphics processing units for faster computations and has been parallelised using the message passing interface.
    Full-text · Article · Aug 2013 · Parallel Computing
  • Source
    • "Rostrup and De Sterck [21] further present detailed optimization and benchmarking of shallow water simulations on clusters of multi-core CPUs, the Cell processor , and GPUs. Comparing the three, the GPUs offer the highest performance. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 235 million cells at a rate of over 1.2 gigacells per second using four Fermi-generation GPUs. The code is thoroughly benchmarked using three different systems, both high-performance and commodity-level systems.
    Full-text · Conference Paper · Jun 2010
Show more