Boosting the performance of computational fluid dynamics codes for interactive supercomputing

Procedia Computer Science 05/2010; 1(1):2055-2064. DOI: 10.1016/j.procs.2010.04.230
Source: DBLP


An extreme form of pipelining of the Piecewise-Parabolic Method (PPM) gas dynamics code has been used to dramatically increase its performance on the new generation of multicore CPUs. Exploiting this technique, together with a full integration of the several data post-processing and visualization utilities associated with this code has enabled numerical experiments in computational fluid dynamics to be performed interactively on a new, dedicated system in our lab, with immediate, user controlled visualization of the resulting flows on the PowerWall display. The code restructuring required to achieve the necessary CPU performance boost, as well as the parallel computing methods and systems used to enable interactive flow simulation are described. Requirements for these techniques to be applied to other codes are discussed, and our plans for tools that will assist programmers to exploit these techniques are briefly described. Examples showing the capability of the new system and software are given for applications in turbulence and stellar convection.

Download full-text


Available from: James Greensky,
1 Follower
48 Reads
  • Source
    • "The Intel Westmere CPU is able to prefetch all the data needed to update a single grid briquette of 4 3 cells for each of 2 threads running on each of its 6 CPU cores during only 43% of the time it takes that thread to update the previous grid briquette in a sequence along the direction of the 1-D pass. The massive pipelining of our fluid dynamics algorithm that makes this possible has been described in detail elsewhere [2] [3] [4] [5]. It results in an overall computational intensity of 33.5 flops per main device memory word read or written, which is sufficient to allow the CPU core to run without waiting on data transfers from or to its main device memory. "
  • Source
    • "Such codes consist of a large number of bursts of computation expressed as individual doubly or triply nested loops running over an entire computational domain for a network node. According to the transformations described in [2] [3] [4] [5] [6], our translator will in-line, align, fuse, and restructure working memory, under programmer directive control, to accelerate the code execution. These transformations involve, critically, restructuring of the data layout in node main memory, as specified by the programmer via values of Fortran code parameters and by directives to the translator tool. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the Laboratory for Computational Science & Engineering at the University of Minnesota, our team has used IBM Roadrunner triblades to build an interactive super-computing system connected to our 12-panel PowerWall display. We have completely redesigned our multifluid hydrodynamics code to run well on this system with fully integrated, prompt fluid flow visualization. Our code is based upon the Piecewise-Parabolic Method (PPM) and the Piecewise-Parabolic Boltzmann scheme (PPB) for moment–conserving advection of the individual fluid mixing fractions [1]. This new code is specially designed for the Roadrunner hardware and has been extensively tested both in our lab at the University of Minnesota and also at Los Alamos. Our approach represents an extreme design in which all computation is performed on the Cell processor accelerator, and the host Opteron is used only for I/O and networking service functions. This strategy avoids frequent data exchanges of small granularity between the host and the accelerator, and any processing stalls that the latency of such exchanges might cause. It also enables a simplified messaging strategy for transport of data between the accelerators. Finally, this approach enables fully asynchronous I/O at essentially no cost in accelerator processing time to the running application. Because our code makes no other use for the host's memory, we are able to place complete copies of the fundamental data for the computation on the host. While the accelerators carry the computation forward, the host can then simultaneously process its copy of the data into compressed formats for disk output or gradually stream its copy out to disk as a restart file. To consolidate such data streams so that only a small number of disk files are generated, we instantiate a small number of specialized MPI processes to receive the individual I/O streams from many hosts and to write them serially out to single disk files. Each such I/O process is assigned to between 32 and 512 worker processes, depending upon the characteristics of the computing system and the particular run underway. Typically, we find that the application computes between 50 and 500 time steps in the interval during which a single time level's fundamental data is dumped to disk in this fashion. Running interactively in our lab, we have one such I/O process streaming compressed data for flow visualization to the 12 workstations attached to our PowerWall display. These 12 workstations use their GPUs to construct volume-rendered images as the computation proceeds. The parameters of the flow visualization are under interactive user control through a graphical user interface (GUI) as the simulation runs. We are using this code in collaboration with colleagues at Los Alamos to study compressible turbulent mixing at accelerated multifluid interfaces. We are also using a separate version of the code to simulate entrainment of stably stratified hydrogen-rich fuel into the convection zone above the helium burning shell in a giant star near the end of its life.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The combination of expert-tuned code expression and aggressive compiler optimizations is known to deliver the best achievable performance for modern multicore process-ors. The development and maintenance of these optimized code expressions is never trivial. Tedious and error-prone processes greatly decrease the code developer's willingness to adopt manually-tuned optimizations. In this paper, we describe a pre-compilation framework that will take a code expression with a much higher programmability and trans-form it into an optimized expression that would be much more difficult to produce manually. The user-directed, source-to-source transformations we implement rely heav-ily on the knowledge of the domain expert. The transform-ed output, in its optimized format, together with the optim-izations provided by an available compiler is intended to deliver exceptionally high performance. Three computa-tional fluid dynamics (CFD) applications are chosen to exemplify our strategy. The performance results show an average of 7.3× speedup over straightforward compilation of the input code expression to our pre-compilation frame-work. Performance of 7.1 Gflops/s/core on the Intel Nehal-em processor, 30% of the peak performance, is achieved using our strategy. The framework is seen to be successful for the CFD domain, and we expect that this approach can be extended to cover more scientific computation domains.
Show more