Scientific Programming

Published by IOS Press
Print ISSN: 1058-9244
A number of scientific applications are performance-limited by expressions that repeatedly call costly elementary functions. Lookup table (LUT) optimization accelerates the evaluation of such functions by reusing previously computed results. LUT methods can speed up applications that tolerate an approximation of function results, thereby achieving a high level of fuzzy reuse. One problem with LUT optimization is the difficulty of controlling the tradeoff between performance and accuracy. The current practice of manual LUT optimization adds programming effort by requiring extensive experimentation to make this tradeoff, and such hand tuning can obfuscate algorithms. In this paper we describe a methodology and tool implementation to improve the application of software LUT optimization. Our Mesa tool implements source-to-source transformations for C or C++ code to automate the tedious and error-prone aspects of LUT generation such as domain profiling, error analysis, and code generation. We evaluate Mesa with five scientific applications. Our results show a performance improvement of 3.0 × and 6.9 × for two molecular biology algorithms, 1.4 × for a molecular dynamics program, 2.1 × to 2.8 × for a neural network application, and 4.6 × for a hydrology calculation. We find that Mesa enables LUT optimization with more control over accuracy and less effort than manual approaches.
In this two-dimensional example of SAMR, a root grid has two sub-grids with one-half the mesh spacing and one sub-grid has an additional sub-sub-grid with even higher resolution. The tree structure on the left represents how these data are stored, while on the right we show the resulting composite solution. 
In these frames we show a zoom into the star forming region. Each panel shows a slice of the logarithm of the gas density magnified by a factor of ten relative to the previous frame, starting at the upper left. 
Radial profiles of mass-weighted spherical about the densest point in the cloud of various physical quantities at seven different output times. Panel A shows the evolution of the particle number density in cm −3 as a function of radius at redshift 19 (solid line), nine Myr later (dotted lines with circles), 0.3 Myr later (dashed line), 3 × 10 4 years later (long dashed line), 3 × 10 3 years later (dot-dashed line), 1.5 × 10 3 years later (solid line) and finally 200 years later (dotted line with circles). The two lines between 10 −2 and 200 pc give the DM mass density in GeV cm −3 at z=19 and the final time, respectively. Panel B gives the enclosed gas mass as a function of radius. In C the mass fractions of atomic hydrogen and molecular hydrogen are shown. Panel D and E illustrate the temperature evolution and the mass weighted radial velocity of the baryons, respectively. The bottom line with filled symbols in panel E shows the negative value of the local speed of sound at the final time. In all panels equal output times correspond to equal line styles. The upper x-axis in panel B gives the radius in astronomical units.
As an entry for the 2001 Gordon Bell Award in the "special" category, we describe our 3-d, hybrid, adaptive mesh re.nement (AMR) code Enzo designed for high-resolution, multiphysics, cosmological structure formation simulations. Our parallel implementation places no limit on the depth or complexity of the adaptive grid hierarchy, allowing us to achieve unprecedented spatial and temporal dynamic range. We report on a simulation of primordial star formation which develops over 8000 subgrids at 34 levels of re.nement to achieve a local refinement of a factor of 1012 in space and time. This allows us to resolve the properties of the first stars which form in the universe assuming standard physics and a standard cosmological model. Achieving extreme resolution requires the use of 128-bit extended precision arithmetic (EPA) to accurately specify the subgrid positions. We describe our EPA AMR implementation on the IBM SP2 Blue Horizon system at the San Diego Supercomputer Center.
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7<sup>n</sup>) for multiplying 2<sup>n</sup>×2<sup>n</sup> matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4<sup>n</sup>). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented
The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power consumption and reduced hardware cost. Dynamic compilation makes it possible to bridge the gap between the two technologies, but special attention must be paid to software instruction scheduling, a must for the VLIW architectures. We have implemented JIST, a Virtual Machine and JIT compiler for Java Bytecode targeted to a VLIW processor. We show the impact of various optimizations on the performance of code compiled with JIST through the experimental study on a set of benchmark programs. We report significant speedups, and increments in the number of instructions issued per cycle up to 50% with respect to the non-scheduling version of the JIT compiler. Further optimizations are discussed.
Sending an application message
Prototypes for the Q-Kernel Entry Points
This paper presents an overview of PUMA (Performance-oriented, User-managed Messaging Architecture), a message passing kernel. Message passing in PUMA is based an portals/spl minus/an opening in the address space of an application process. Once an application process has established a portal, other processes can write values into the portal using a simple send operation. Because messages are written directly into the address space of the receiving process, there is no need to buffer messages in the PUMA kernel and later copy them into the applications address space. PUMA consists of two components: the quintessential kernel (Q-Kernel) and the process control thread (PCT). While the PCT provides management decisions, the Q-Kernel controls access and implements the policies specified by the PCT.< >
The problems of misleading performance reporting and the evident lack of careful refereeing in the supercomputing field are discussed in detail. Included are some examples that have appeared in recently published scientific papers. Some guidelines for reporting performance are presented, the adoption of which would raise the level of professionalism and reduce the level of confusion in the field of supercomputing
Sequential Nelder-Mead reflection
Concurrent Nelder-Mead reflection
First RSCS search direction-3D case
This paper describes a method of parallelisation of the popular Nelder-Mead simplex optimization algorithms that can lead to enhanced performance on parallel and distributed computing resources. A reducing set of simplex vertices are used to derive search directions generally closely aligned with the local gradient. When tested on a range of problems drawn from real-world applications in science and engineering, this reducing set concurrent simplex (RSCS) variant of the Nelder-Mead algorithm compared favourably with the original algorithm, and also with the inherently parallel multidirectional search algorithm (MDS). All algorithms were implemented and tested in a general-purpose, grid-enabled optimization toolset.
Recent results from the Planck satellite [9] compared with light-cone output from 2HOT. We present our numerical simulation results in the same HEALPix 1 [21] Mollewide projection of the celestial 
An illustration of background subtraction, which greatly improves the performance of the treecode algorithm for nearly uniform mass distributions (such as large-volume cosmological simulations, especially at early times). The bodies inside cell a interact with the bodies and cells inside the gray shaded area as usual. Bodies inside cell a interact with all other cells (b, for example) after the background contribution of a uniform density cube is subtracted from the multipole expansion. Empty cell c (which would be ignored in the usual algorithm) must have its background contribution subtracted as well. The background contribution of the gray shaded area to the calculated force and potential of the bodies in a is removed analytically.
An illustration of the relevant distances used in the multipole expansion and error bound equations.
We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k ($2^{18}$) processors. We present error analysis and scientific application results from a series of more than ten 69 billion ($4096^3$) particle cosmological simulations, accounting for $4 \times 10^{20}$ floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracy and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.
A simple mesh and its Sieve representation.
Initial distributed triangular mesh. 
Redistributed triangular mesh. 23 
The distributed triangular mesh. 
Partition Section, with circular partition points and rectangular Sieve point data.
We have developed a new programming framework, called Sieve, to support parallel numerical PDE algorithms operating over distributed meshes. We have also developed a reference implementation of Sieve in C++ as a library of generic algorithms operating on distributed containers conforming to the Sieve interface. Sieve makes instances of the incidence relation, or \emph{arrows}, the conceptual first-class objects represented in the containers. Further, generic algorithms acting on this arrow container are systematically used to provide natural geometric operations on the topology and also, through duality, on the data. Finally, coverings and duality are used to encode not only individual meshes, but all types of hierarchies underlying PDE data structures, including multigrid and mesh partitions. In order to demonstrate the usefulness of the framework, we show how the mesh partition data can be represented and manipulated using the same fundamental mechanisms used to represent meshes. We present the complete description of an algorithm to encode a mesh partition and then distribute a mesh, which is independent of the mesh dimension, element shape, or embedding. Moreover, data associated with the mesh can be similarly distributed with exactly the same algorithm. The use of a high level of abstraction within the Sieve leads to several benefits in terms of code reuse, simplicity, and extensibility. We discuss these benefits and compare our approach to other existing mesh libraries. Comment: 36 pages, 22 figures
In the area of network performance and discovery, network tomography focuses on reconstructing network properties using only end-to-end measurements at the application layer. One challenging problem in network tomography is reconstructing available bandwidth along all links during multiple source/multiple destination transmissions. The traditional measurement procedures used for bandwidth tomography are extremely time consuming. We propose a novel solution to this problem. Our method counts the fragments exchanged during a BitTorrent broadcast. While this measurement has a high level of randomness, it can be obtained very efficiently, and aggregated into a reliable metric. This data is then analyzed with state-of-the-art algorithms, which reliably reconstruct logical clusters of nodes inter-connected by high bandwidth, as well as bottlenecks between these logical clusters. Our experiments demonstrate that the proposed two-phase approach efficiently solves the presented problem for a number of settings on a complex grid infrastructure.
An architectural view of Chemora. Chemora consists of three major components: The Cactus-Carpet computational infrastructure, CaKernel programming abstractions, and the Kranc code generator. Chemora takes a physics model described in a high level Equation Description Language and produces highly optimized code suitable for parallel execution on heterogeneous systems.
Visualization of a binary black hole system 
Weak-scaling test for McLachlan code performed on the Cane and Datura clusters. (n)p((m)t) stands for n processes per node using m threads each. (no) x-split stands for (not) dividing domain along the x axis. Smaller numbers are better, and ideal weak scaling corresponds to a horizontal line. The benchmark scales well on these platforms.
Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.
Representation of a horizontal data domain decomposition. The thin lines demarcate model grid boxes. The thick lines indicate processor boundaries. In this case, the model data are divided among 4 processors. 
point operations (in billions), run-time, total performance, and per pe performance for a 3 hour run of the 576 × 
In the 1990's computer manufacturers are increasingly turning to the development of parallel processor machines to meet the high performance needs of their customers. Simultaneously, atmospheric scientists study weather and climate phenomena ranging from hurricanes to El Nino to global warming that require increasingly fine resolution models. Here, implementation of a parallel atmospheric general circulation model (GCM) which exploits the power of massively parallel machines is described. Using the horizontal data domain decomposition methodology, this FORTRAN 90 model is able to integrate a 0.6 deg. longitude by 0.5 deg. latitude problem at a rate of 19 Gigaflops on 512 processors of a Cray T3E 600; corresponding to 280 seconds of wall-clock time per simulated model day. At this resolution, the model has 64 times as many degrees of freedom and performs 400 times as many floating point operations per simulated day as the model it replaces.
A template-based generic programming approach was presented in a previous paper that separates the development effort of programming a physical model from that of computing additional quantities, such as derivatives, needed for embedded analysis algorithms. In this paper, we describe the implementation details for using the template-based generic programming approach for simulation and analysis of partial differential equations (PDEs). We detail several of the hurdles that we have encountered, and some of the software infrastructure developed to overcome them. We end with a demonstration where we present shape optimization and uncertainty quantification results for a 3D PDE application.
An approach for incorporating embedded simulation and analysis capabilities in complex simulation codes through template-based generic programming is presented. This approach relies on templating and operator overloading within the C++ language to transform a given calculation into one that can compute a variety of additional quantities that are necessary for many state-of-the-art simulation and analysis algorithms. An approach for incorporating these ideas into complex simulation codes through general graph-based assembly is also presented. These ideas have been implemented within a set of packages in the Trilinos framework and are demonstrated on a simple problem from chemical engineering.
BSML integration with PSE execution environment. The BSML parser generator creates parsers that handle input ports of each component. Execution manager controls the execution of a model instance that consists of components, model instance data, and model instance metadata. Figure 1 partially defines one such instance.  
We describe a binding schema markup language (BSML) for describing data interchange between scientific codes. Such a facility is an important constituent of scientific problem solving environments (PSEs). BSML is designed to integrate with a PSE or application composition system that views model specification and execution as a problem of managing semistructured data. The data interchange problem is addressed by three techniques for processing semistructured data: validation, binding, and conversion. We present BSML and describe its application to a PSE for wireless communications system design.
Understanding the functioning of a neural system in terms of its underlying circuitry is an important problem in neuroscience. Recent developments in electrophysiology and imaging allow one to simultaneously record activities of hundreds of neurons. Inferring the underlying neuronal connectivity patterns from such multi-neuronal spike train data streams is a challenging statistical and computational problem. This task involves finding significant temporal patterns from vast amounts of symbolic time series data. In this paper we show that the frequent episode mining methods from the field of temporal data mining can be very useful in this context. In the frequent episode discovery framework, the data is viewed as a sequence of events, each of which is characterized by an event type and its time of occurrence and episodes are certain types of temporal patterns in such data. Here we show that, using the set of discovered frequent episodes from multi-neuronal data, one can infer different types of connectivity patterns in the neural system that generated it. For this purpose, we introduce the notion of mining for frequent episodes under certain temporal constraints; the structure of these temporal constraints is motivated by the application. We present algorithms for discovering serial and parallel episodes under these temporal constraints. Through extensive simulation studies we demonstrate that these methods are useful for unearthing patterns of neuronal network connectivity.
The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.
Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
In this paper we present Menhir a compiler for generating sequential or parallel code from the Matlab language. The compiler has been designed in the context of using Matlab as a specification language. One of the major features of Menhir is its retargetability that allows generating parallel and sequential C or Fortran code. We present the compilation process and the target system description for Menhir. Preliminary performances are given and compared with MCC, the MathWorks Matlab compiler.
We describe the use and implementation of a polyshift function PSHIFT for circular shifts and end--off shifts. Polyshift is useful in many scientific codes using regular grids, such as finite difference codes in several dimensions, multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as Quantum Chromodynamics (QCD) calculations. Our implementation of the PSHIFT function on the Connection Machine systems CM--2 and CM--200 offers a speedup of up to a factor of 3--4 compared to CSHIFT when the local data motion within a node is small. The PSHIFT routine is included in the Connection Machine Scientific Software Library (CMSSL). 1 Introduction Efficient and minimal data motion is critical for high performance in most computer architectures. The polyshift function presented in this paper addresses this issue. The impact of the data motion on performance depends upon the memory architecture of the system. Memory systems have been slower t...
Detailed algorithms for all--to--all broadcast and reduction are given for arrays mapped by binary or binary--reflected Gray code encoding to the processing nodes of binary cube networks. Algorithms are also given for the local computation of the array indices for the communicated data, thereby reducing the demand for communications bandwidth. For the Connection Machine system CM--200, Hamiltonian cycle based all--to--all communication algorithms yield a performance that is a factor of two to ten higher than the performance offered by algorithms based on trees, butterfly networks, or the Connection Machine router. The peak data rate achieved for all--to--all broadcast on a 2048 node Connection Machine system CM--200 is 5.4 Gbytes/sec when no reordering is required. If the time for data reordering is included, then the effective peak data rate is reduced to 2.5 Gbytes/sec. 1 Introduction We consider two forms of all--to--all communication in multiprocessor, distributed memory architect...
IFS RAPS 4.0 T213L31 Benchmark. Semi-Lagrangian. Thin line represents ideal scaling. 
IFS T106L19 12 hr forecast timings (secs) on SX-4/32M cluster.
MC2 Performance on SX-4M: Single (bottom) versus multi-node (top).
NEC SX-4/32 performance degradation due to crossbar. Top: Predicted Mflops/sec rate from R × P A assuming R = 900Mflops/sec. Bottom: Observed Mflops/sec on an SX-4/32 node.
The NEC SX-4M cluster and Fujitsu VPP700 supercomputers are both based on custom vector processors using low-power CMOS technology. Their basic architectures and programming models are however somewhat di#erent. Amulti-node SX4M cluster contains up to 32 processors per shared memory node, with a maximum of 16 nodes connected via the proprietary NEC IXS #bre channel crossbar network. A hybrid combination of inter-node MPI message-passing with intra-node multitasking or threads is possible. The Fujitsu VPP700 is a fully distributed-memory vector machine with a scalable crossbar interconnect which also supports MPI. The parallel performance of the MC2 model for high-resolution mesoscale forecasting over large domains and of the IFS RAPS 4.0 benchmark are presented for several di#erent machine con#gurations. These include an SX-4#32 with 8 GB main memory unit #MMU#, an SX-4#32M cluster #SX-4#16, 8 GB MMU + SX-4#16, 4 GB MMU# and up to 80 PE's of the VPP700. 1 Computational Sc...
Multilayer tool infrastructure. 
Memory accesses of some pages on node 1 (SOR code). 
The detailed access character of Page 65 on node 2 (SOR code). 
Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.
shows the time course of a typical experiment in its exploitation period with the computation, communication and blocking phases. (data size = 10GB, communication speed = 10MB/s, #slaves=3, average chunk size =1MB, H=1, lbe=1, the length of the exploration period = 100s and the number of periods = 10)
We report on the improvements that can be achieved by applying machine learning techniques, in particular reinforcement learning, for the dynamic load balancing of parallel applications. The applications being considered here are coarse grain data intensive applications. Such applications put high pressure on the interconnect of the hardware. Synchronization and load balancing in complex, heterogeneous networks need fast, flexible, adaptive load balancing algorithms. Viewing a parallel application as a one-state coordination game in the framework of multi-agent reinforcement learning, and by using a recently introduced multi-agent exploration technique, we are able to improve upon the classic job farming approach. The improvements are achieved with limited computation and communication overhead.
Comparison of pe idle time for diierent scheduling algorithms (10-Queen).  
One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The Adaptive Contracting Within Neighborhood (ACWN) is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes, and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adapti...
Ratio of gradient/function evaluation
. The numerical methods employed in the solution of many scientific computing problems require the computation of derivatives of a function f : R n !R m . Both the accuracy and the computationalrequirements of the derivative computation are usually of critical importance for the robustness and speed of the numerical solution. ADIFOR (Automatic Differentiation In FORtran) is a source transformation tool that accepts Fortran 77 code for the computation of a function and writes portable Fortran 77 code for the computation of the derivatives. In contrast to previous approaches, ADIFOR views automatic differentiation as a source transformation problem. ADIFOR employs the data analysis capabilities of the ParaScope Parallel ProgrammingEnvironment, which enable us to handle arbitrary Fortran 77 codes and to exploit the computationalcontext in the computation of derivatives. Experimental results show that ADIFOR can handle real-life codes and that ADIFOR-generated codes are competitive wit...
The abstract mathematical theory of partial differential equations (PDEs) is formulated in terms of manifolds, scalar fields, tensors, and the like, but these algebraic structures are hardly recognizable in actual PDE solvers. The general aim of the Sophus programming style is to bridge the gap between theory and practice in the domain of PDE solvers. Its main ingredients are a library of abstract datatypes corresponding to the algebraic structures used in the mathematical theory and an algebraic expression style similar to the expression style used in the mathematical theory. Because of its emphasis on abstract datatypes, Sophus is most naturally combined with object-oriented languages or other languages supporting abstract datatypes. The resulting source code patterns are beyond the scope of current compiler optimizations, but are sufficiently specific for a dedicated source-to-source optimizer. The limited, domain-specific, character of Sophus is the key to success here. Th...
Functions in the Abstract Data and Communication Layer 
Predefined Event Handlers in MESSIAHS 
The messiahs project is investigating mechanisms that support task placement in heterogeneous, distributed, autonomous systems. messiahs provides a substrate on which scheduling algorithms can be implemented. These mechanisms were designed to support diverse task placement and load balancing algorithms. As part of this work, we have constructed an interface layer to the underlying mechanisms. This includes the messiahs Interface Language (MIL) and a library of function calls for constructing distributed schedulers. This paper gives an overview of messiahs, describes two sample interface layers in detail, and gives example implementations of well-known algorithms from the literature built using these layers. 1 Introduction Recent initiatives in high-speed, heterogeneous computing have spurred renewed interest in large-scale distributed systems, and the desire for better utilization of existing resources has contributed to this movement. A typical departmental computing environment al...
The Computational Grid is a promising platform for the efficient execution of parameter sweep applications over large parameter spaces. To achieve performance on the Grid, such applications must be scheduled so that shared data files are strategically placed to maximize reuse, and so that the application execution can adapt to the deliverable performance potential of target heterogeneous, distributed and shared resources. Parameter sweep applications are an important class of applications and would greatly benefit from the development of Grid middleware that embeds a scheduler for performance and targets Grid resources transparently. In this paper we describe a user-level Grid middleware project, the AppLeS Parameter Sweep Template (APST), that uses application-level scheduling techniques [1] and various Grid technologies to allow the efficient deployment of parameter sweep applications over the Grid. We discuss...
Relationship between versions of TRED2
. This paper critically examines current parallel programming practice and optimising compiler development. The general strategies employed by compiler and programmer to optimise a Fortran program are described, and then illustrated for a specific case by applying them to a well known scientific program, TRED2, using the KSR-1 as the target architecture. Extensive measurement is applied to the resulting versions of the program, which are compared with a version produced by a commercial optimising compiler, KAP. The compiler strategy significantly outperforms KAP, and does not fall far short of the performance achieved by the programmer. Following the experimental section each approach is critiqued by the other. Perceived flaws, advantages and common ground are outlined, with an eye to improving both schemes. Keywords: Distributed Shared Memory, Expert Programmer, Parallelising Compiler, Performance Analysis, Linear Algebra. 1 Introduction Obtaining high performance from parallel compu...
The Navy Operational Global Atmospheric Prediction System (NOGAPS) includes a state-of-the-art spectral forecast model similar to models run a several major operations numerical weather prediction (NWP) centers around the world. The model, developed by the Naval Research Laboratory (NRL) in Monterey, California, has run operational at the Fleet Numerical Meteorological and Oceanographic Center (FNMOC) since 1982, and most recently is being run on a Cray C90 in a multi-tasked configuration. Typically the multi-tasked code runs on 6 or 12 processors with overall parallel efficiency of about 90is T159L24, but other operational and research applications run at significantly lower resolutions. A scalable NOGAPS forecast model has been developed by NRL in anticipation of a FNMOC C90 replacement in about 2001, as well as for current NOGAPS research requirements to run on DOD High-Performance Computing (HPC) scalable systems. The model is designed to run with message passing (MPI). Model desi...
The PHG of matrix-matrix-multiplication. Solid edges mean realized templates for vertical pattern recognition; dashed edges for horizontal pattern recognition along cross edges. Solid cycles mean templates for unblocking or eliminating semantically invariant conditionals: dashed cycles represent templates for loop rerolling or integration of initializers.
P ARAMAT pattern recognition tool.
A Summary of the Patterns Included in the Current Version of the Basic PARAMAT Pattern Library
The overall structure of a distributed memory back-end for P ARA:VIA T.
This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Loop Patterns 
Techniques Applied to Each Code 
. Automatic parallelization is usually believed to be less effective at exploiting implicit parallelism in sparse/irregular programs than in their dense/regular counterparts. However, not much is really known because there have been few research reports on this topic. In this work, we have studied the possibility of using an automatic parallelizing compiler to detect the parallelism in sparse/irregular programs. The study with a collection of sparse/irregular programs led us to some common loop patterns. Based on these patterns three new techniques were derived that produced good speedups when manually applied to our benchmark codes. More importantly, these parallelization methods can be implemented in a parallelizing compiler and can be applied automatically. 1 Introduction Sparse computations are implementations of linear algebra algorithms that store and operate on the nonzero array elements only. These algorithms are usually complex, use sophisticated data structures, a...
The schematic flowchart of the ADI solver in ARC3D.
Timings of ARC3D with varying number of thread groups for a given total of 64 threads.
Timings from the outer level parallelization of ARC3D.
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
Highest attained performance
Asymptotic performance of CRS Matrix-vector product
Asymptotic performance of Symmetrically stored CRS ILU solve
Asymptotic performance of CRS ILU solve
Introduction The traditional performance measurement for computers on scienti#c application has been the Linpack benchmark #2#, which evaluates the e#ciency with which a machine can solve a dense system of equations. Since this operation allows for considerable reuse of data, it is possible to show performance #gures a sizeable percentage of peak performance, even for machines with a severe unbalance between memory and processor speed. In practice, sparse linear systems are equally important, and for these the question of data reuse is more complicated. Sparse systems can be solved by direct or iterative methods, and especially for iterative methods one can say that there is little or no reuse of data. Thus, such operations will have a performance bound by the slower of the processor and the memory, in practice: the memory. We aim to measure the performance of a representative sample of iterative techniques on any given machine; we are not interested in comparing, say, one precondit
this report: David Bailey (NASA Ames Research Center) , Michael Berry (University of Tennessee), Jack Dongarra (University of Tennessee/Oak Ridge National Laboratory), Vladimir Getov (University of Southampton), Tom Haupt (Syracuse University), Tony Hey (University of Southampton), Roger Hockney (University of Southampton), and David Walker (Oak Ridge National Laboratory). The following PARKBENCH participants were instrumental in defining/promoting the effort, attending meetings, and providing helpful comments and suggestions: Ed Brocklehurst (National Physical Laboratory), Koushik Ghosh (Cray Research), Charles Grassl (Cray Research) , Ed Kushner (Intel SSD), Brian LaRose (Hewlett Packard), Todd Letsche (University of Tennessee), David Mackay (Intel SSD), Joanne Martin (IBM), Ramesh Natarajan (IBM, Yorktown Heights), Bodo Parady (Sun Microsystems), Robert Pennington (Pittsburgh Supercomputing Center), Philip Tannenbaum (NEC), Pearl Wang (George Mason University/US Geological Survey), and Patrick Worley (Oak Ridge National Laboratory). Special thanks are also due to Jack Dongarra in his role of host at our meetings in Knoxville, and to Mike Berry who has served valiantly as secretary at our meetings and produced excellent minutes in difficult circumstances. This publication, and the earlier report could not have been produced without the dedication of Roger Hockney, Mike Berry and Vladimir Getov who devoted many hours in turning a collection of individual contributions into a coherent L a T E X document that was fit for publication.
A Projections timeline view of two timesteps before optimizing the multicast. 
A Projections timeline view of two timesteps after optimizing the multicast. 
We present an optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD. With an object-based hybrid force and spatial decomposition scheme, and an aggressive measurement-based predictive load balancing framework, we have attained speeds and speedups that are much higher than any reported in literature so far. The paper first summarizes the broad methodology we are pursuing, and the basic parallelization scheme we used. It then describes the optimizations that were instrumental in increasing performance, and presents performance results on benchmark simulations. 1 Introduction Understanding the structure and function of biomolecules such as proteins and DNA is crucial to our ability to understand the mechanisms of diseases, drugs, and normal life processes. With the experimental determination of structures for an increasing set of proteins it has become possible to em...
shows that adding one more term to the approximation costs less when the loop
shows the convergence and time for rolled and unrolled loops, If the loop
The most time consuming part of an N-body simulation is computing the components of the accelerations of the particles. On most machines the slowest part of computing the acceleration is in evaluating r^(-3/2) which is especially true on machines that do the square root in software. This note shows how to cut the time for this part of the calculation by a factor of 3 or more using standard Fortran.
Comparative Performance of Four Versions of Premix. Times were obtained using an eight-processor Alliant FX/80 with the FX/FORTRAN parallelizing compiler.
Inverse Execution Times Versus Number of Processors. Times for original and optimized parallel versions of Premix were obtained on an Alliant FX/80. An ideal performance improvement line is included for comparison.
We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, whichwe describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.
Class specific optimizations are compiler optimizations specified by the class implementor to the compiler. hey allow the compiler to take advantage of the semantics of the particular class so as to produce better code. ptimizations of interest include the strength reduction of class::array address calculations, elimination of large temporaries, and the placement of asynchronous send/recv calls so as to achieve computation /communication overlap. e will outline our progress towards the implementation of a C compiler capable of incorporating class specific optimizations. ntro uction During the implementation of complex systems in C++, particularly numerical ones, the implementor typically encounters performance problems of varying difficulty. These difficulties usually relate to the lack of semantic understanding the C++ compiler has of the user defined classes. This problem was recently studied in [2] where the potential solution of class based optimizations was put forth. A class bas...
This graph compares SPH3D performance results for a Cray C-90, Intel Paragon, IBM SP2, and an Alpha workstation farm running PVM. Measurements were gathered as described in Table 3. The number of processors for a particular machine was chosen to provide performance roughly comparable to a single processor of a Cray C-90; processor numbers are given in parentheses.
LPARX is a software development tool for implementing dynamic, irregular scientific applications, such as multilevel multilevel finite difference methods and particle methods, on high performance MIMD parallel architectures. It supports coarse grain data parallelism and gives the application complete control over specifying arbitrary block decompositions. LPARX provides structural abstraction, representing data decompositions as first-class objects that can be manipulated and modified at run-time. LPARX, implemented as a C++ class library, is currently running on diverse MIMD platforms, including the Intel Paragon, Cray C-90, IBM SP2, and networks of workstations running under PVM. Software may be developed and debugged on a single processor workstation. 1 Introduction An outstanding problem in scientific computation is how to manage the complexity of converting mathematical descriptions of dynamic, irregular numerical algorithms into high performance applications software. Non-unifo...
shows a high-level overview of the framework. As shown in the figure, each numerical code consists of one or more Computations. Each Computation provides the data and the solution for a single problem. Within a computation is a Controller and a Computational Space. The Computational Space (or "Space") is where the actual computation takes place; the Controller is an object that starts (and continues) the computation by sending messages to the Space. The effect is to move the outer loop of a numeric computation into the Controller. Since Spaces contain the data being manipulated, they are responsible for managing their own input and output.
Key ownership relations  
An environment for framework-based programming  
Frameworks are reusable object-oriented designs for domain-specific programs. In our estimation, frameworks are the key to productivity and reuse. However, frameworks require increased support from the programming environment. A framework based environment must include design aides and project browsers that can mediate between the user and the framework. A framework-based approach also places new requirements on conventional tools such as compilers. This paper explores the impact of object-oriented frameworks upon a programming environment, in the context of object-oriented finite element and finite difference codes. The role of tools such as design aides and project browsers are discussed, and the impact of a framework-based approach upon compilers is examined. Examples are drawn from our prototype C++-based environment. 1.0 Frameworks Object-oriented scientific programming aims to harness the power of object-oriented design and representation to the task of scientific computing. Th...
Time redistribution between M.O.M. subroutines varying the number of timesteps Parallelization: the code has been passed in the Power Fortran Accelerator (pfa) for obtaining a parallel version of the code.
Weather forecast limited area models, wave models and ocean models run commonly on vector machines or on MPP systems. Recently shared memory multiprocessor systems with ccNUMA architecture (SMP-ccNUMA) have been shown to deliver very good performances on many applications. It is important to know that the SMP-ccNUMA systems perform and scale well even for the above mentioned models and that a relatively simple effort is needed to parallelize the codes on these systems due to the availability of OpenMP as standard shared memory paradigm. This paper will deal with the implementation on a SGI Origin 2000 of a weather forecast model (LAMBO -- Limited Area Model Bologna, the NCEP ETA model adapted to the Italian territory), a wave model (WA.M. -- Wave Model, on the Mediterranean Sea and on the Adriatic Sea) and an ocean model (M.O.M. -- Modular Ocean Model, used with data assimilation). These three models were written for vector machines, so the paper will describe the technique used to port a vector code to a SMP-ccNUMA architecture. Another aspect covered by this paper are the performances that these models have on these systems.
Distributed applications are complex by nature, so it is essential that there be effective software development tools to aid in the construction of these programs. Commonplace "middleware" tools, however, often impose a tradeoff between programmer productivity and application performance. For instance, many CORBA IDL compilers generate code that is too slow for high-performance systems. More importantly, these compilers provide inadequate support for sophisticated patterns of communication. We believe that these problems can be overcome, thus making IDL compilers and similar middleware tools useful for a broader range of systems. To this end we have implemented Flick, a flexible and optimizing IDL compiler, and are using it to produce specialized high-performance code for complex distributed applications. Flick can produce specially "decomposed" stubs that encapsulate different aspects of communication in separate functions, thus providing application programmers with fine-...
In this paper, we present some compiler optimization techniques for explicit parallel programs using OpenMP API. To enable optimizations across threads, we designed dataflow analysis techniques in which interaction between threads is effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We show algorithms for reaching de nitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is a compiler-directed software DSM system where aggressive compiler optimizations for software-implemented coherence scheme are crucial to obtain good performance. We also show optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. We consider compiler optimizations are bene cial for performance portability across various platforms and non-expert programmers. Experimental results for the coherency optimization in a compilerdirected software DSM system shows that aggressive compiler optimizations are quite effective for a shared-write intensive program because coherenceinduced communication volume in such a program is much larger than the those for shared-read intensive programs.
Translation of GOTO statements.  
Translation strategies in the f2j project. default, the FORTRAN program is translated to Java source code, but the -jas switch will direct f2java to generate Jasmin opcode instead. The name of the FORTRAN file is transformed into the class name, with appropriate capitalizations following established Java programming conventions. Thus, the LAPACK driver dgesv.f is translated to The -p option allows the user to specify a package name for the generated source code. For example, f2java -p org.netlib.blas file.f would generate a file named, as usual, with the following package specification: package org.netlib.blas;  
The JLAPACK project provides the LAPACK numerical subroutines translated from their subset Fortran 77 source into class files, executable by the Java Virtual Machine (JVM) and suitable for use by Java programmers. This makes it possible for Java applications or applets, distributed on the World Wide Web (WWW) to use established legacy numerical code that was originally written in Fortran. The translation is accomplished using a special purpose Fortran-to-Java (source-to-source) compiler. The LAPACK API will be considerably simplified to take advantage of Java's object-oriented design. This report describes the research issues involved in the JLAPACK project, and its current implementation and status.
Finite element class hierarchy  
The conventional wisdom in the scientific computing community is that the best way to solve largescale numerically-intensive scientific problems on today's parallel MIMD computers is to use Fortran or C programmed in a data-parallel style using low-level message-passing primitives. This approach inevitably leads to non-portable codes and extensive development time, and restricts parallel programming to the domain of the expert programmer. We believe that these problems are not inherent to parallel computing but are the result of the programming tools used. We will show that comparable performance can be achieved with little effort if better tools that present higher level abstractions are used. The vehicle for our demonstration is a 2D electromagnetic finite element scattering code we have implemented in Mentat, an object-oriented parallel processing system. We briefly describe the application, Mentat, the implementation, and present performance results for both a Mentat and a hand-cod...
. Modern dialects of Fortran enjoy wide use and good support on highperformance computers as performance-oriented programming languages. By providing the ability to express nested data parallelism, modern Fortran dialects enable irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular portions of the application. Since performance of nested data-parallel computation is unpredictable and often poor using current compilers, we investigate threading and flattening, two source-to-source transformation techniques that can improve performance and performance stability. For experimental validation of these techniques, we explore nested data-parallel implementations of the sparse matrixvector product and the Barnes-Hut n-body algorithm by hand-coding threadbased (using OpenMP directives) and flattening-based versions of these algorithmsandevaluatingtheirperformanceonanSGIOrigin2000andanNEC SX-...
Our goal is to apply the software engineering advantages of object-oriented programming to the raw power of massively parallel architectures. To do this we have constructed a hierarchy of C++ classes to support the data-parallel paradigm. Feasibility studies and initial coding can be supported by any serial machine that has a C++ compiler. Parallel execution requires an extended Cfront, which understands the data-parallel classes and generates C* code. (C* is a data-parallel superset of ANSI C developed by Thinking Machines Corporation.) This approach provides potential portability across parallel architectures and leverages the existing compiler technology for translating data-parallel programs onto both SIMD and MIMD hardware. 1 Introduction The data-parallel programming model is based upon the simultaneous execution of the same operation across a set of data [9]. Most scientific and engineering problems, and many others, have data-parallel solutions. The model's single locu...
Network computing seeks to utilize the aggregate resources of many networked computers to solve a single problem. In so doing it is often possible to obtain supercomputer performance from an inexpensive local area network. The drawback is that network computing is complicated and error prone when done by hand, especially if the computers have different operating systems and data formats and are thus heterogeneous. HeNCE (Heterogeneous Network Computing Environment) is an integrated graphical environment for creating and running parallel programs over a heterogeneous collection of computers. It is built on a lower level package called PVM. The HeNCE philosophy of parallel programming is to have the programmer graphically specify the parallelism of a computation and to automate, as much as possible, the tasks of writing, compiling, executing, debugging, and tracing the network computation. Key to HeNCE is a graphical language based on directed graphs that describe the parallelism and dat...
Top-cited authors
Ewa Deelman
  • University of Southern California
Gurmeet Singh
Daniel S. Katz
  • University of Illinois, Urbana-Champaign
Yolanda Gil
  • University of Southern California
Mei-Hui Su
  • University of Southern California