Boosting the performance of computational fluid dynamics codes for interactive supercomputing.

Procedia Computer Science 05/2010; 1:2055-2064. DOI: 10.1016/j.procs.2010.04.230
Source: DBLP

ABSTRACT An extreme form of pipelining of the Piecewise-Parabolic Method (PPM) gas dynamics code has been used to dramatically increase its performance on the new generation of multicore CPUs. Exploiting this technique, together with a full integration of the several data post-processing and visualization utilities associated with this code has enabled numerical experiments in computational fluid dynamics to be performed interactively on a new, dedicated system in our lab, with immediate, user controlled visualization of the resulting flows on the PowerWall display. The code restructuring required to achieve the necessary CPU performance boost, as well as the parallel computing methods and systems used to enable interactive flow simulation are described. Requirements for these techniques to be applied to other codes are discussed, and our plans for tools that will assist programmers to exploit these techniques are briefly described. Examples showing the capability of the new system and software are given for applications in turbulence and stellar convection.

  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the Laboratory for Computational Science & Engineering at the University of Minnesota, our team has used IBM Roadrunner triblades to build an interactive super-computing system connected to our 12-panel PowerWall display. We have completely redesigned our multifluid hydrodynamics code to run well on this system with fully integrated, prompt fluid flow visualization. Our code is based upon the Piecewise-Parabolic Method (PPM) and the Piecewise-Parabolic Boltzmann scheme (PPB) for moment–conserving advection of the individual fluid mixing fractions [1]. This new code is specially designed for the Roadrunner hardware and has been extensively tested both in our lab at the University of Minnesota and also at Los Alamos. Our approach represents an extreme design in which all computation is performed on the Cell processor accelerator, and the host Opteron is used only for I/O and networking service functions. This strategy avoids frequent data exchanges of small granularity between the host and the accelerator, and any processing stalls that the latency of such exchanges might cause. It also enables a simplified messaging strategy for transport of data between the accelerators. Finally, this approach enables fully asynchronous I/O at essentially no cost in accelerator processing time to the running application. Because our code makes no other use for the host's memory, we are able to place complete copies of the fundamental data for the computation on the host. While the accelerators carry the computation forward, the host can then simultaneously process its copy of the data into compressed formats for disk output or gradually stream its copy out to disk as a restart file. To consolidate such data streams so that only a small number of disk files are generated, we instantiate a small number of specialized MPI processes to receive the individual I/O streams from many hosts and to write them serially out to single disk files. Each such I/O process is assigned to between 32 and 512 worker processes, depending upon the characteristics of the computing system and the particular run underway. Typically, we find that the application computes between 50 and 500 time steps in the interval during which a single time level's fundamental data is dumped to disk in this fashion. Running interactively in our lab, we have one such I/O process streaming compressed data for flow visualization to the 12 workstations attached to our PowerWall display. These 12 workstations use their GPUs to construct volume-rendered images as the computation proceeds. The parameters of the flow visualization are under interactive user control through a graphical user interface (GUI) as the simulation runs. We are using this code in collaboration with colleagues at Los Alamos to study compressible turbulent mixing at accelerated multifluid interfaces. We are also using a separate version of the code to simulate entrainment of stably stratified hydrogen-rich fuel into the convection zone above the helium burning shell in a giant star near the end of its life.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The combination of expert-tuned code expression and aggressive compiler optimizations is known to deliver the best achievable performance for modern multicore process-ors. The development and maintenance of these optimized code expressions is never trivial. Tedious and error-prone processes greatly decrease the code developer's willingness to adopt manually-tuned optimizations. In this paper, we describe a pre-compilation framework that will take a code expression with a much higher programmability and trans-form it into an optimized expression that would be much more difficult to produce manually. The user-directed, source-to-source transformations we implement rely heav-ily on the knowledge of the domain expert. The transform-ed output, in its optimized format, together with the optim-izations provided by an available compiler is intended to deliver exceptionally high performance. Three computa-tional fluid dynamics (CFD) applications are chosen to exemplify our strategy. The performance results show an average of 7.3× speedup over straightforward compilation of the input code expression to our pre-compilation frame-work. Performance of 7.1 Gflops/s/core on the Intel Nehal-em processor, 30% of the peak performance, is achieved using our strategy. The framework is seen to be successful for the CFD domain, and we expect that this approach can be extended to cover more scientific computation domains.

Full-text (2 Sources)

Available from
May 21, 2014