Boosting the performance of computational fluid dynamics codes for interactive supercomputing.

Procedia CS 01/2010; 1:2055-2064. DOI: 10.1016/j.procs.2010.04.230
Source: DBLP

ABSTRACT An extreme form of pipelining of the Piecewise-Parabolic Method (PPM) gas dynamics code has been used to dramatically increase its performance on the new generation of multicore CPUs. Exploiting this technique, together with a full integration of the several data post-processing and visualization utilities associated with this code has enabled numerical experiments in computational fluid dynamics to be performed interactively on a new, dedicated system in our lab, with immediate, user controlled visualization of the resulting flows on the PowerWall display. The code restructuring required to achieve the necessary CPU performance boost, as well as the parallel computing methods and systems used to enable interactive flow simulation are described. Requirements for these techniques to be applied to other codes are discussed, and our plans for tools that will assist programmers to exploit these techniques are briefly described. Examples showing the capability of the new system and software are given for applications in turbulence and stellar convection.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the Laboratory for Computational Science & Engineering at the University of Minnesota, our team has used IBM Roadrunner triblades to build an interactive super-computing system connected to our 12-panel PowerWall display. We have completely redesigned our multifluid hydrodynamics code to run well on this system with fully integrated, prompt fluid flow visualization. Our code is based upon the Piecewise-Parabolic Method (PPM) and the Piecewise-Parabolic Boltzmann scheme (PPB) for moment–conserving advection of the individual fluid mixing fractions [1]. This new code is specially designed for the Roadrunner hardware and has been extensively tested both in our lab at the University of Minnesota and also at Los Alamos. Our approach represents an extreme design in which all computation is performed on the Cell processor accelerator, and the host Opteron is used only for I/O and networking service functions. This strategy avoids frequent data exchanges of small granularity between the host and the accelerator, and any processing stalls that the latency of such exchanges might cause. It also enables a simplified messaging strategy for transport of data between the accelerators. Finally, this approach enables fully asynchronous I/O at essentially no cost in accelerator processing time to the running application. Because our code makes no other use for the host's memory, we are able to place complete copies of the fundamental data for the computation on the host. While the accelerators carry the computation forward, the host can then simultaneously process its copy of the data into compressed formats for disk output or gradually stream its copy out to disk as a restart file. To consolidate such data streams so that only a small number of disk files are generated, we instantiate a small number of specialized MPI processes to receive the individual I/O streams from many hosts and to write them serially out to single disk files. Each such I/O process is assigned to between 32 and 512 worker processes, depending upon the characteristics of the computing system and the particular run underway. Typically, we find that the application computes between 50 and 500 time steps in the interval during which a single time level's fundamental data is dumped to disk in this fashion. Running interactively in our lab, we have one such I/O process streaming compressed data for flow visualization to the 12 workstations attached to our PowerWall display. These 12 workstations use their GPUs to construct volume-rendered images as the computation proceeds. The parameters of the flow visualization are under interactive user control through a graphical user interface (GUI) as the simulation runs. We are using this code in collaboration with colleagues at Los Alamos to study compressible turbulent mixing at accelerated multifluid interfaces. We are also using a separate version of the code to simulate entrainment of stably stratified hydrogen-rich fuel into the convection zone above the helium burning shell in a giant star near the end of its life.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction advection. This code has already been implemented on the IBM Cell processor and run at full scale on the Los Alamos Roadrunner machine. This implementation has involved a complete restructuring of the code that has been described in detail elsewhere. Here the lessons learned from that work are exploited to take advantage oftoday's latest generations of multi-core CPUs and many-core GPUs. The operations performed by this code are characterized in detail after being first decomposed into a series of individual code kernels to allow an implementation on GPUs. Careful implementations of this code for both CPUs and GPU sare then contrasted from a performance point of view. In addition, a single kernel that has many of the characteristics of the full application on CPUs has been built into a full, standalone, scalable parallel application. This single-kernel application shows the GPU at its best. In contrast, the full multi fluid gas dynamics application brings into play computational requirements that highlight the essential differences in CPU and GPU designs today and the different programming strategies needed to achieve the best performance for applications of this type on the two devices. The single kernel application code performs extremely well on both platforms. This application is not limited by main memory bandwidth on either device instead it is limited only by the computational capability of each. In this case, the GPU has the advantage, because it has more computational cores. The full multi fluid gas dynamics code is, however, of necessity memory bandwidth limited on the GPU, while it is still computational capability limited on the CPU. We believe that these codes provide a useful context for quantifying the costs a- - nd benefits of design decisions for these powerful new computing devices. Suggestions for improvements in both devices and codes based upon this work are offered in our conclusions.
    Application Accelerators in High-Performance Computing (SAAHPC), 2011 Symposium on; 08/2011
  • Source

Full-text (2 Sources)

Available from
May 21, 2014