## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

The spectral transform method is a standard numerical technique for solving par- tial differential equations on a sphere and is widely used in atmospheric circulation models. Re- cent research has identified several promising algorithms for implementing this method on mas- sively parallel computers; however, no detailed comparison of the different algorithms has previ- ously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel com- puters. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair compar- ison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but we also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.

To read the full-text of this research,

you can request a copy directly from the authors.

... For example, Figure 4 shows two variants for computing the 3D DFT. The first algorithm represents the so-called slab pencil decomposition [19][20][21], where the 3D DFT is decomposed into a 2D DFT followed by a batch of multiple 1D DFTs. The second algorithm represents the pencil-pencil-pencil decomposition [19][20][21], where the 3D DFT is decomposed into three batches of 1D DFTs, where each 1D DFT is applied in the three dimensions. ...

... The first algorithm represents the so-called slab pencil decomposition [19][20][21], where the 3D DFT is decomposed into a 2D DFT followed by a batch of multiple 1D DFTs. The second algorithm represents the pencil-pencil-pencil decomposition [19][20][21], where the 3D DFT is decomposed into three batches of 1D DFTs, where each 1D DFT is applied in the three dimensions. The slab-pencil decomposition views the input (output) column vectors x (y) as 2D matricesx (ỹ) of size (n 0 n 1 ) ×n 2 . ...

... Under this assumption, the p processors are connected via a hyper-cube topology and the communication requires loд(p) stages, where each two processors exchange n 2p data points. Equation 20 represents the lower bound of the all-to-all collective if a bucket algorithm is used. Given this implementation, the p processors are connected via ring topology, where each processor has a left and a right neighbor. ...

Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various data layouts to enable some of the parallelism. However, in the era of exascale, where systems have thousand of nodes and intricate network topologies, flexibility and parallel efficiency are key aspects all multi-dimensional DFT frameworks need to have in order to map and scale the computation appropriately. In this work, we present a flexible framework, built on the Redistribution Operations and Tensor Expressions (ROTE) framework, that facilitates the development of a family of parallel multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes within a single framework, 2) exploiting the two different parallelization schemes to different degrees and 3) using different data layouts to distribute the data across the compute nodes. We demonstrate the need of a versatile framework and thus a need for a family of parallel multi-dimensional DFT algorithms on the K-Computer, where we show almost linear strong scaling results for problem sizes of 1024^3 on 32k compute nodes.

... Communication cost model. All of our analysis makes use of a commonlyused [16,31,8,3,14] communication cost model that is as useful as it is simple: each process is assumed to only be able to simultaneously send and receive a single message at a time, and, when the message consists of n units of data (e.g., double-precision floating-point numbers), the time to transmit such a message between any two processes is α + βn [19,2]. The α term represents the time required to send an arbitrarily small message and is commonly referred to as the message latency, whereas 1/β represents the number of units of data which can be transmitted per unit of time once the message has been initiated. ...

... While this may seem overly restrictive, a large class of important transforms falls into this category, most notably: the Fourier transform, where Φ(x, y) = 2πx · y, backprojection [13], hyperbolic Radon transforms [20], and Egorov operators, which then provide a means of efficiently applying Fourier Integral Operators [7]. Due to the extremely special (and equally delicate) structure of Fourier transforms, a number of highly-efficient parallel algorithms already exist for both uniform [16,26,12] and non-uniform [27] Fourier transforms, and so we will instead concentrate on more sophisticated kernels. We note that the high-level communication pattern and costs of the parallel 1D FFT mentioned in [16] are closely related to those of our parallel 1D butterfly algorithm. ...

... Due to the extremely special (and equally delicate) structure of Fourier transforms, a number of highly-efficient parallel algorithms already exist for both uniform [16,26,12] and non-uniform [27] Fourier transforms, and so we will instead concentrate on more sophisticated kernels. We note that the high-level communication pattern and costs of the parallel 1D FFT mentioned in [16] are closely related to those of our parallel 1D butterfly algorithm. Algorithm 2.3 was instantiated in the new DistButterfly library using blackbox, user-defined phase functions, and the low-rank approximations and translation operators introduced in [7]. ...

The butterfly algorithm is a fast algorithm which approximately evaluates a
discrete analogue of the integral transform \int K(x,y) g(y) dy at large
numbers of target points when the kernel, K(x,y), is approximately low-rank
when restricted to subdomains satisfying a certain simple geometric condition.
In d dimensions with O(N^d) source and target points, when each appropriate
submatrix of K is approximately rank-r, the running time of the algorithm is at
most O(r^2 N^d log N). A parallelization of the butterfly algorithm is
introduced which, assuming a message latency of \alpha and per-process inverse
bandwidth of \beta, executes in at most O(r^2 N^d/p log N + (\beta r N^d/p +
\alpha)log p) time using p processes. This parallel algorithm was then
instantiated in the form of the open-source DistButterfly library for the
special case where K(x,y)=exp(i \Phi(x,y)), where \Phi(x,y) is a black-box,
sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q
demonstrate impressive strong-scaling results for important classes of phase
functions. Using quasi-uniform sources, hyperbolic Radon transforms and an
analogue of a 3D generalized Radon transform were respectively observed to
strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater
than 90% and 82% efficiency, respectively. These experiments at least partially
support the theoretical argument that, given p=O(N^d) processes, the
running-time of the parallel algorithm is O((r^2 + \beta r + \alpha)log N).

... The model is based on global hydrostatic primitive equations on sphere and uses the spectral transform method [11,12] in the horizontal directions with the use of the Gaussian grid and the finite-difference method in the vertical direction with the sigma coordinates. It predicts such variables as horizontal winds, temperatures, ground surface pressure, specific humidity, and cloud water. ...

... There have been many algorithms proposed and examined on various computers so far about the parallelization of the spectral transform method [12,13,14]. Here the outline of the Legendre transform (LT) that makes up the core part of the spectral transform will be given by following Foster [12]. ...

... There have been many algorithms proposed and examined on various computers so far about the parallelization of the spectral transform method [12,13,14]. Here the outline of the Legendre transform (LT) that makes up the core part of the spectral transform will be given by following Foster [12]. ...

A spectral atmospheric general circulation model called AFES (AGCM for Earth Simulator) was developed and optimized for the architecture of the Earth Simulator (ES). The ES is a massively parallel vector supercomputer that consists of 640 processor nodes interconnected by a single stage crossbar network with its total peak performance of 40.96 Tflops was achieved for a high resolution simulation (T1279L96) with AFES by utilizing the full 640-node configuration of the ES. The resulting computing efficiency is 64.9% of the peak performance, well surpassing that of conventional weather/climate applications having just 25-50% efficiency even on vector parallel computers. This remarkable performance proves the effectiveness of the ES as a viable means for practical applications.

... casting systems (Barros et al. 1995). Previous investigations of these methods have focused on the parallel aspects of either shared memory vector implementations (Gärtel et al. 1995), or on the details of distributed memory implementations on MPPs (Foster and Worley 1994;Dent et al. 1995;Hammond et al. 1995). Perhaps because of the communications requirements of transposition-based message passing implementations, and the poor capabilities of commodity interconnect fabrics until quite recently, little attention has been paid to studying the spherical harmonic transform method on commodity clusters. ...

... In order to compute the Legendre polynomials coefficients on the fly using the recurrence relation (15), the paired longitudinal wavenumbers are then distributed across the processing elements (PEs) in contiguous blocks of wavenumbers. We emphasize that, while most previous work has focused on more highly parallel 2D decompositions (Foster and Worley 1994), such fine-grain decompositions require expensive low-latency-high-bandwidth networks, like the Cray T3E. Our strategy here is different. ...

The practical question of whether the classical spectral transform method, widely used in atmospheric modeling, can be efficiently implemented on inexpensive commodity clusters is addressed. Typically, such clusters have limited cache and memory sizes. To demonstrate that these limitations can be overcome, the authors have built a spherical general circulation model dynamical core, called BOB (''Built on Beowulf''), which can solve either the shallow water equations or the atmospheric primitive equations in pressure coordinates. That BOB is targeted for computing at high resolution on modestly sized and priced commodity clusters is reflected in four areas of its design. First, the associated Legendre polynomials (ALPs) are computed ''on the fly'' using a stable and accurate recursion relation. Second, an identity is employed that eliminates the storage of the derivatives of the ALPs. Both of these algorithmic choices reduce the memory footprint and memory bandwidth requirements of the spectral transform. Third, a cache-blocked and unrolled Legendre transform achieves a high performance level that resists deterioration as resolution is increased. Finally, the parallel implementation of BOB is transposition-based, employing load-balanced, one-dimensional decompositions in both latitude and wavenumber. A number of standard tests is used to compare BOB's performance to two well-known codes—the Parallel Spectral Transform Shallow Water Model (PSTSWM) and the dynamical core of NCAR's Community Climate Model CCM3. Compared to PSTSWM, BOB shows better timing results, particularly at the higher resolutions where cache effects become important. BOB also shows better performance in its comparison with CCM3's dynamical core. With 16 processors, at a triangular spectral truncation of T85, it is roughly five times faster when computing the solution to the standard Held-Suarez test case, which involves 18 levels in the vertical. BOB also shows a significantly smaller memory footprint in these comparison tests.

... The horizontal resolution typically used for climate simulations in the U.S. research community is T85 with 26 vertical levels, which requires a 256 x 128 horizontal grid Worley and Drake 2005] . The parallel algorithm used for these high resolution studies and benchmarking was given in [Foster and Worley 1997]. The FFT algorithm used was given in [Temperton 1983], a Fortran code specifically designed for vector computation of multiple (blocked) fast Fourier transforms. ...

... A performance model of the parallel spectral transform can be developed to estimate the time for a multi-level calculation. The computational operation counts and communication cost estimates are based on a model in [Foster and Worley 1997] for a one dimensional decomposition and modified by Rich Loft (NCAR) to reflect a simple transpose between FFT and Legendre transform phases including vertical levels. The time for the FFT, the Legendre transform and the communication overhead are estimated using machine-dependant rate constants a,b,d, and e. ...

A collection of MATLAB classes for computing and using spherical harmonic transforms is pre-sented. Methods of these classes compute differential operators on the sphere and are used to solve simple partial differential equations in a spherical geometry. The spectral synthesis and analysis algorithms using fast Fourier transforms and Legendre transforms with the associated Legendre functions are presented in detail. A set of methods associated with a spectral field class provides spectral approximation to the differential operators ·, ×, , and 2 in spherical geometry. Laplace inversion and Helmholtz equation solvers are also methods for this class. The use of the class and methods in MATLAB are demonstrated by the solution of the barotropic vorticity equa-tion on the sphere. A survey of alternative algorithms is given and implementations for parallel high performance computers are discussed in the context of global climate and weather models.

... In both estimates and simulations, the transposition strategy appears no less efficient on realistic massively parallel computers than the best alternative static domain decomposition based parallelization strategy. The geometric idea of transposition is illustrated in the pictures in (III), and results of comparison benchmarks are reported in Foster and Worley (1994). In that reference, the authors also develop an elaborate communication strategy for domain decomposition that attains the same asymptotic efficiency as the transposition strategy. ...

... The transposition strategy is the parallelization strategy that in the asymptotic limit has the smallest data volume to communicate of all parallelization strategies for any implicit time stepping scheme, for spectral and grid point models alike, as explained in subsection 5.3.1 above. Since the current research, a thorough analysis and a careful parameterized implementation has been made of the two principal families of parallelization strategies for global atmospheric models -transposition versus static domain decomposition based and , while finding various parallel computers on which each of the numerous versions and combinations of the strategies belonging to each family proves to be the most efficient, the authors find the transposition strategy to be a robust choice on virtually all current computers (Foster and Worley (1994), Worley, Foster and Toonen (1994)). Model benchmarking tests initiated in (III) were expanded to a full two-dimensional version of the IFS model in Gärtel et al. (1994-1), see also Gärtel et al. (1994-2), and eventually to the operational version of IFS (Barros et al. (1994). ...

Diss. -- Lappeenrannan teknillinen korkeakoulu.

... In this work, we use the transpose algorithm, which has performed better for large sizes in earlier work [8]. Gupta and Kumar [11], and Foster and Worley [9] review both methods. ...

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a $n^3$ problem with up to $n^2$ PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a $512^3$ complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

... The inverse Fourier transform can be easily computed using the forward FFT engine adding a 1/N scaling factor and conjugating the imaginary part; we won't debate the in-31 Chapter 3. Three-dimensional Fast Fourier Transform → P 1 P 0 P 3 P 2 At a global level, with a 2D data domain decomposition, the X transform can proceed independently on each processing node because data on the X dimension of the grid resides entirely in the local host memory and each one has its own assigned portion of the array. When data is non-local, that means that it is divided across processor boundaries, the most efficient approach ( [57], [18]) is to reorganize the data array by a global transposition. This is called the transpose method, in opposition to the distributed method, where the 1D transform is performed in parallel with data exchange occurring at 3.2. ...

In the field of High Performance Computing, communications among processes represent a typical bottleneck for massively parallel scientific applications. Object of this research is the development of a network interface card with specific offloading capabilities that could help large scale simulations in terms of communication latency and scalability with the number of computing elements. In particular this work deals with the development of a double precision floating point complex arithmetic unit with a parallel-pipelined architecture, in order to implement a massively parallel computing system tailored for three dimensional Fast Fourier Transform.

... This can be done in parallel using sequential 1D FFTs. See [7] for alternative parallel methods such as the binary exchange method and a comparison between these methods. ...

We present a parallel algorithm for the fast Fourier transform (FFT) in higher dimensions. This algorithm generalizes the cyclic-to-cyclic one-dimensional parallel algorithm to a cyclic-to-cyclic multidimensional parallel algorithm while retaining the property of needing only a single all-to-all communication step. This is under the constraint that we use at most $\sqrt{N}$ processors for an FFT on an array with a total of $N$ elements, irrespective of the dimension $d$ or shape of the array. The only assumption we make is that $N$ is sufficiently composite. Our algorithm starts and ends in the same distribution. We present our multidimensional implementation FFTU which utilizes the sequential FFTW program for its local FFTs, and which can handle any dimension $d$. We obtain experimental results for $d\leq 5$ using MPI on up to 4096 cores of the supercomputer Snellius, comparing FFTU with the parallel FFTW program and with PFFT. These results show that FFTU is competitive with the state-of-the-art and that it allows to use a larger number of processors, while keeping communication limited to a single all-to-all operation. For arrays of size $1024^3$ and $64^5$, FFTU achieves a speedup of a factor 149 and 176, respectively, on 4096 processors.

... Handling the collective communications underlying distributed-memory FFT computation can be achieved using different approaches (refer to [77,78] for more information). The most effective strategy already in use in many high performance FFT libraries is the so-called "the transpose transform". ...

The complexity of the physical mechanisms involved in ultra-high intensity laser-plasma interaction requires the use of particularly heavy PIC simulations. At the heart of these computational codes, high-order pseudo-spectral Maxwell solvers have many advantages in terms of numerical accuracy. This numerical approach comes however with an expensive computational cost. Indeed, existing parallelization methods for pseudo-spectral solvers are only scalable to few tens of thousands of cores, or induce an important memory footprint, which also hinders the scaling of the method at large scales. In this thesis, we developed a novel, arbitrarily scalable, parallelization strategy for pseudo-spectral Maxwell's equations solvers which combines the advantages of existing parallelization techniques. This method proved to be more scalable than previously proposed approaches, while ensuring a significant drop in the total memory use.By capitalizing on this computational work, we conducted an extensive numerical and theoretical study in the field of high order harmonics generation on solid targets. In this context, when an ultra-intense (I>10¹⁶W.cm⁻²) ultra-short (few tens of femtoseconds) laser pulse irradiates a solid target, a reflective overdense plasma mirror is formed at the target-vacuum interface. The subsequent laser pulse non linear reflection is accompanied with the emission of coherent high order laser harmonics, in the form of attosecond X-UV light pulses (1 attosecond = 10⁻¹⁸s). For relativistic laser intensities (I>10¹⁹ W.cm⁻²), the plasma surface is curved under the laser radiation pressure. And the plasma mirror acts as a focusing optics for the radiated harmonic beam. In this thesis, we investigated feasible ways for producing isolated attosecond light pulses from relativistic plasma-mirror harmonics, with the so called attosecond lighthouse effect. This effect relies introducing a wavefront rotation on the driving laser pulse in order to send attosecond pulses emitted during different laser optical cycles along different directions. In the case of high order harmonics generated in the relativistic regime, the plasma mirror curvature significantly increases the attosecond pulses divergence and prevents their separation with the attosecond lighthouse scheme. For this matter, we developed two harmonic divergence reduction techniques, based on tailoring the laser pulse phase or amplitude profiles in order to significantly inhibit the plasma mirror focusing effect and allow for a clear separation of attosecond light pulses by reducing the harmonic beam divergence. Furthermore, we developed an analytical model to predict optimal interaction conditions favoring attosecond pulses separation. This model was fully validated with 2D and 3D PIC simulations over a broad range of laser and plasma parameters. In the end, we show that under realistic laser and plasma conditions, it is possible to produce isolated attosecond pulses from Doppler harmonics.

... Several approaches have been considered to efficiently parallelise spectral transforms between physical and spectral space (e.g. Foster & Worley 1997). ...

We present a new pseudo-spectral open-source code nicknamed pizza. It is dedicated to the study of rapidly-rotating Boussinesq convection under the 2-D spherical quasi-geostrophic approximation, a physical hypothesis that is appropriate to model the turbulent convection that develops in planetary interiors. The code uses a Fourier decomposition in the azimuthal direction and supports both a Chebyshev collocation method and a sparse Chebyshev integration formulation in the cylindrically-radial direction. It supports several temporal discretisation schemes encompassing multi-step time steppers as well as diagonally-implicit Runge-Kutta schemes. The code has been tested and validated by comparing weakly-nonlinear convection with the eigenmodes from a linear solver. The comparison of the two radial discretisation schemes has revealed the superiority of the Chebyshev integration method over the classical collocation approach both in terms of memory requirements and operation counts. The good parallelisation efficiency enables the computation of large problem sizes with $\mathcal{O}(10^4\times 10^4)$ grid points using several thousands of ranks. This allows the computation of numerical models in the turbulent regime of quasi-geostrophic convection characterised by large Reynolds $Re$ and yet small Rossby numbers $Ro$. A preliminary result obtained for a strongly supercritical numerical model with a small Ekman number of $10^{-9}$ and a Prandtl number of unity yields $Re\simeq 10^5$ and $Ro \simeq 10^{-4}$. pizza is hence an efficient tool to study spherical quasi-geostrophic convection in a parameter regime inaccessible to current global 3-D spherical shell models.

... An introduction and theoretical comparison can be found in [8]. In this paper, we restrict ourselves to transpose algorithms that need much less data to be exchanged [9] and have direct support in many software libraries, e.g. FFTW [5]. ...

The 3D fast Fourier transform (FFT) is the heart of many simulation methods. Although the efficient parallelisation of the FFT has been deeply studied over last few decades, many researchers only focused on either pure message passing (MPI) or shared memory (OpenMP) implementations. Unfortunately, pure MPI approaches cannot exploit the shared memory within the cluster node and the OpenMP cannot scale over multiple nodes.
This paper proposes a 2D hybrid decomposition of the 3D FFT where the domain is decomposed over the first axis by means of MPI while over the second axis by means of OpenMP. The performance of the proposed method is thoroughly compared with the state of the art libraries (FFTW, PFFT, P3DFFT) on three supercomputer systems with up to 16k cores. The experimental results show that the hybrid implementation offers 10-20% higher performance and better scaling especially for high core counts.

... The CORAL set of benchmark codes [18], intended for HPC vendors, include a set of " Skeleton Benchmarks, " but this term is used to refer to benchmarks that each focus on a specific platform characteristic, unlike our application skeletons. Additional examples are from Kerbyson et al. [19], who used simplified version of parallel MPI applications to study Blue Gene systems; and Worley et al. [20] [21], who studied a parallel spectral transform shallow water model by implementing the real spectral transform in what is otherwise a synthetic code that replicates a range of different communication structures as found in different parallelizations of climate models. Similarly, Prophesy [22] is an infrastructure that helps in performance modeling of applications on parallel and distributed systems through a relational database that allows for the recording of performance data, system features and application details. ...

Computer scientists who work on tools and systems to support eScience (a variety of parallel and distributed) applications usually use actual applications to prove that their systems will benefit science and engineering (e.g., improve application performance). Accessing and building the applications and necessary data sets can be difficult because of policy or technical issues, and it can be difficult to modify the characteristics of the applications to understand corner cases in the system design. In this paper, we present the Application Skeleton, a simple yet powerful tool to build synthetic applications that represent real applications, with runtime and I/O close to those of the real applications. This allows computer scientists to focus on the system they are building; they can work with the simpler skeleton applications and be sure that their work will also be applicable to the real applications. In addition, skeleton applications support simple reproducible system experiments since they are represented by a compact set of parameters.

... The spectral code is parallelised using a so-called 2-D decomposition (Foster and Worley, 1997;Kanamitsu et al., 2005). In a 2-D decomposition, two of the three dimensions are divided across the processors, and so there is a column and row of processors, with the columns divided across one dimension and the rows across another. ...

The IGCM4 (Intermediate Global Circulation Model version 4) is a global
spectral primitive equation climate model whose predecessors have extensively
been used in areas such as climate research, process modelling and
atmospheric dynamics. The IGCM4's niche and utility lies in its speed and
flexibility allied with the complexity of a primitive equation climate model.
Moist processes such as clouds, evaporation, atmospheric radiation and soil
moisture are simulated in the model, though in a simplified manner compared
to state-of-the-art global circulation models (GCMs). IGCM4 is a parallelised model, enabling
both very long integrations to be conducted and the effects of higher
resolutions to be explored. It has also undergone changes such as alterations
to the cloud and surface processes and the addition of gravity wave drag.
These changes have resulted in a significant improvement to the IGCM's
representation of the mean climate as well as its representation of
stratospheric processes such as sudden stratospheric warmings. The IGCM4's
physical changes and climatology are described in this paper.

... The model is similar though not as detailed as the ones in Ayala and Wang [2013], Kerbyson and Barker [2011] and Kerbyson et al. [2013]. Since different runtime FFT algorithms on each machine are used, and since the slab decomposition is used for a small number of process and the pencil decomposition is used for more than 512 processes, and since there are different algorithms for performing an all to all exchange, the best of which will depend on the size of the problem being solved, the computer being used and the number and location of the processors being used on that computer (see Foster and Worley [1997]), a more detailed model is not developed. For small p and fixed N, the runtime decreases close to linearly, and once p is large enough the runtime starts to increase again due to communication costs. ...

The cubic Klein-Gordon equation is a simple but non-trivial partial
differential equation whose numerical solution has the main building blocks
required for the solution of many other partial differential equations. In this
study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve
the Klein-Gordon equation and strong scaling of the code is examined on
thirteen different machines for a problem size of 512^3. The results are useful
in assessing likely performance of other parallel fast Fourier transform based
programs for solving partial differential equations. The problem is chosen to
be large enough to solve on a workstation, yet also of interest to solve
quickly on a supercomputer, in particular for parametric studies. Unlike other
high performance computing benchmarks, for this problem size, the time to
solution will not be improved by simply building a bigger supercomputer.

... FFTW [15]. In addition, several FFT algorithms have been proposed for distributed machines (for example see [14,36,37]). For computing 3D FFTs, the key challenge is in dividing the data across the processes. ...

We discuss the fast solution of the Poisson problem on a unit cube. We
benchmark the performance of the most scalable methods for the Poisson problem:
the Fast Fourier Transform (FFT), the Fast Multipole Method (FMM), the
geometric multigrid (GMG) and algebraic multigrid (AMG). The GMG and FMM are
novel parallel schemes using high-order approximation for Poisson problems
developed in our group. The FFT code is from P3DFFT library and AMG code from
ML Trilinos library. We examine and report results for weak scaling, strong
scaling, and time to solution for uniform and highly refined grids. We present
results on the Stampede system at the Texas Advanced Computing Center and on
the Titan system at the Oak Ridge National Laboratory. In our largest test
case, we solved a problem with 600 billion unknowns on 229,379 cores of Titan.
Overall, all methods scale quite well to these problem sizes. We have tested
all of the methods with different source distributions. Our results show that
FFT is the method of choice for smooth source functions that can be resolved
with a uniform mesh. However, it loses its performance in the presence of
highly localized features in the source function. FMM and GMG considerably
outperform FFT for those cases.

... The original model is based on the three- (6) dimensional global hydrostatic primitive equations. The spectral transform method [12] is applied to discretize in the horizontal direction and a finite-difference method in the vertical direction with the use of sigma coordinates. AFES predicts such variables as horizontal winds, temperatures, ground-level pressure, specific humidity, and cloud water at grid points generated around the entire process. ...

Two major developments in the infrastructure of the computational science and engineering research in Japan are reviewed. Both of these developments, resulting from the recent construction of a high-speed backbone network and a huge vector parallel computer, will surely change the scene of the computational science and engineering researches. The first one is the ITBL (Information-Technology Based Laboratory) project, where R&D are made to realize a virtual research environment over the network. Here, basic software tools for distributed environments have been developed to solve science and engineering problems. The second one is the Earth Simulator project. In this project, a huge SMP-cluster vector parallel system was developed, which will undoubtedly give a great impact on the numerical simulations in the areas, for example, the climate modeling. Furthermore, activities in large-scale numerical simulations, which are carried out in various application fields and have a potential for further integration of the above systems, are presented.

... Performance models for a specific given application domain, which presents performance bounds for implicit CFD codes have also been considered [15]. The efficiency of the spectral transform method on parallel computers has been evaluated by Foster [9]. Kerbyson et al. provide an analytical model for the application SAGE [17]. ...

Exascale systems are predicted to have approximately one billion cores,
assuming Gigahertz cores. Limitations on affordable network topologies for
distributed memory systems of such massive scale bring new challenges to the
current parallel programing model. Currently, there are many efforts to
evaluate the hardware and software bottlenecks of exascale designs. There is
therefore an urgent need to model application performance and to understand
what changes need to be made to ensure extrapolated scalability. The fast
multipole method (FMM) was originally developed for accelerating N-body
problems in astrophysics and molecular dynamics, but has recently been extended
to a wider range of problems, including preconditioners for sparse linear
solvers. It's high arithmetic intensity combined with its linear complexity and
asynchronous communication patterns makes it a promising algorithm for exascale
systems. In this paper, we discuss the challenges for FMM on current parallel
computers and future exascale architectures, with a focus on inter-node
communication. We develop a performance model that considers the communication
patterns of the FMM, and observe a good match between our model and the actual
communication time, when latency, bandwidth, network topology, and multi-core
penalties are all taken into account. To our knowledge, this is the first
formal characterization of inter-node communication in FMM, which validates the
model against actual measurements of communication time.

... As mentioned earlier, in a spectral model a full dimension in one of two horizontal directions (East-West [X] and North-South [Y] directions in Fig. 1) is needed for the FFT computation. Similar method has been introduced in several previous studies (Oikawa, 2001;Juang et al., 2001;Foster and Worley, 1997;Barros et al, 1995;Skalin and Biorge, 1997). First, rows along X and Y directions in the spectral space are divided depending on the number of working processors while the vertical column has a full dimension (upper left corner cube in Fig. 1). ...

... In the horizontal direction we use a one-dimensional decomposition. This is in contrast to earlier work on two-dimensional decompostions by Foster and Worley (1994). Although the 2D decomposition allows for a finer grain decomposition this also results in considerably more network traffic, requiring expensive low-latency-high-bandwidth networks. ...

... In the rst part of this section, we discussed machine performance without reference to the algorithms being used on diierent problem size/machine type/machine size conngurations. Yet there is considerable variability in the performance of diierent algorithms (Foster and Worley 1994; Worley and Foster 1994), and average performance would have been considerably worse if we had restricted ourselves to a single algorithm. Factors that can aaect performance include the choice of FFT algorithm, LT algorithm, aspect ratio, the protocols used for data transfer, and memory requirements. ...

Massively parallel processing (MPP) computer systems use high-speed interconnection networks to link hundreds or thousands of RISC microprocessors. With each microprocessor having a peak performance of 100 Mflops/sec or more, there is at least the possibility of achieving very high performance. However, the question of exactly how to achieve this performance remains unanswered. MPP systems and vector multiprocessors require very different coding styles. Different MPP systems have widely varying architectures and performance characteristics. For most problems, a range of different parallel algorithms is possible, again with varying performance characteristics. In this paper, we provide a detailed, fair evaluation of MPP performance for a weather and climate modeling application. Using a specially designed spectral transform code, we study performance on three different MPP systems: Intel Paragon, IBM SP2, and Cray T3D. We take great care to control for performance differences due to var...

... [31] The transform between physical and Fourier/ Chebyshev space is a global operation which requires communication among the processors. Different approaches have been proposed for the parallelization of such transforms [Foster and Worley, 1997]. Here, we use a transpose-based algorithm. ...

Numerical simulations of the process of convection and magnetic field generation in planetary cores still fail to reach geophysically realistic control parameter values. Future progress in this field depends crucially on efficient numerical algorithms which are able to take advantage of the newest generation of parallel computers. Desirable features of simulation algorithms include (1) spectral accuracy, (2) an operation count per time step that is small and roughly proportional to the number of grid points, (3) memory requirements that scale linear with resolution, (4) an implicit treatment of all linear terms including the Coriolis force, (5) the ability to treat all kinds of common boundary conditions, and (6) reasonable efficiency on massively parallel machines with tens of thousands of processors. So far, algorithms for fully self-consistent dynamo simulations in spherical shells do not achieve all these criteria simultaneously, resulting in strong restrictions on the possible resolutions. In this paper, we demonstrate that local dynamo models in which the process of convection and magnetic field generation is only simulated for a small part of a planetary core in Cartesian geometry can achieve the above goal. We propose an algorithm that fulfills the first five of the above criteria and demonstrate that a model implementation of our method on an IBM Blue Gene/L system scales impressively well for up to O(104) processors. This allows for numerical simulations at rather extreme parameter values.

... This transpose method is named 2-Dimensional decomposition, because one of the dimensions is fixed but the other two are distributed. It has been studied by many authors (e.g., Foster and Worley 1997;Barros et al. 1995;Skalin and Bjorge 1997), and has been widely used in many global spectral models. Since the RSM code structure is very similar to the GSM which uses the transpose method for parallelization, the same method was adopted for RSM parallelization. ...

... Example PSTSWM Input Files a) Problem b) Algorithm c) Measurements sums of spectral harmonics and inverse real FFTs. Each of these steps is presented in mathematical detail in[34]. ...

Thesis (Ph.D.)--Illinois Institute of Technology, 2005. Includes bibliographical references (leaves 239-247).

... What distinguishes our work from prior research is our framework for providing useful, accurate performance modeling and performance understanding that is tractable for a wide variety of machines and applications. Previous work either developed very detailed models for performance45678, concentrated on tool development910, was very specific to a given application domain111213, or focused on integrating compilation with scalability analysis [14]. Additionally, previous work by Worley [15] evaluated specific machines via benchmarking. ...

This paper presents a performance modeling methodology that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on a class of High Performance Computing PETSc kernels. The method yields insight into the factors that affect performance and scalability on parallel computers.

... The 2-D decomposition data transposition strategy utilizes a 2-D model data structures, meaning a single dimension of data structure in memory for each processor. This method has been applied successfully in several parallel atmospheric models (Foster and Worley 1997;Barros et al. 1995;Skalin and Bjorge 1997), including the Global Spectral Model. 2-D can be used up to the number of product of two smallest dimensions in all directions and all spaces, except with any prime number of processors (Juang and Kanamitsu 2001). ...

The Regional Spectral Model (RSM) is a nested primitive equation spectral model used by U.S. operational centers and international research communities to perform weather forecasts and climate prediction on a regional scale. In this paper, we present the development of an efficient parallel RSM with a message passing paradigm. Our model employs robust and efficient 1-D and 2-D decomposition strategies and incorporates promising parallel algorithms to deal with complicated perturbation architecture and ensure portability using hybrid MPI and openMP. We also achieve bit reproducibility when our parallel RSM is compared to the sequential code. Performance tests were performed on an IBM SP, Compaq, NEC SX-6 and the Earth Simulator and our results show good scalability at over a thousand processors.

... These numerical difficulties usually lead to use of the spectral method as in [2]. But the global transposition of data on a network of processors needed in the spectral method makes them difficult to parallelize [5]. ...

We present a nonoverlapping domain decomposition method with local Fourier basis applied to a model problem in liquid flames. The introduction of domain decomposition techniques in this paper is for numerical and parallel efficiency purposes when one requires a large number of grid points to catch complex structures. We obtain then a high-order accurate domain decomposition method that allows us to generalize our previous work on the use of local Fourier basis to solve combustion problems with nonperiodic boundary conditions (M. Garbey and D. Tromeur-Dervout, J. Comput. Phys.145, 316 (1998)). Local Fourier basis methodology fully uses the superposition principle to split the searched solution in a numerically computed part and an analytically computed part. Our present methodology generalizes the Israeli et al. (1993, J. Sci. Comput. 8, 135) method, which applies domain decomposition with local Fourier basis to the Helmholtz's problem. In the present work, several new difficulties occur. First, the problem is unsteady and nonlinear, which makes the periodic extension delicate to construct in terms of stability and accuracy. Second, we use a streamfunction biharmonic formulation of the incompressible Navier–Stokes equation in two space dimensions: The application of domain decomposition with local Fourier basis to a fourth-order operator is more difficult to achieve than for a second-order operator. A systematic investigation of the influence of the method's parameters on the accuracy is done. A detail parallel MIMD implementation is given. We give an a priori estimate that allows the relaxation of the communication between processors for the interface problem treatment. Results on nonquasi-planar complex frontal polymerization illustrate the capability of the method.

... In this method, one dimension is kept fixed while the other two dimensions are decomposed, and the spectral method can be applied in parallel in the fixed dimension. After this, the system is transposed before applying the algorithm in another direction (13). The Vlasov solver uses periodic boundary conditions in configuration space, where a pseudo-spectral method is employed to calculate derivatives accurately. ...

We present a parallelized algorithm for solving the time-dependent Vlasov–Maxwell system of equations in the four-dimensional phase space (two spatial and velocity dimensions). One Vlasov equation is solved for each particle species, from which charge and current densities are calculated for the Maxwell equations. The parallelization is divided into two different layers. For the first layer, each plasma species is given its own processor group. On the second layer, the distribution function is domain decomposed on its dedicated resources. By separating the communication and calculation steps, we have met the design criteria of good speedup and simplicity in the implementation.

... The slave tasks contain four distinct phases: data receiving, CPU bound computing, I/O bound computing and data sending (to the master). The second one, namely B, is the PSTSWM (Parallel Spectral Transform Shallow Water Model) [16]. The PSTSWM is a message-passing benchmark code that solves the nonlinear shallow water equations on a rotating sphere using the spectral transform method. ...

A new approach for acquiring knowledge of parallel applications regarding resource usage and for searching similarity on workload
traces is presented. The main goal is to improve decision making in distributed system software scheduling, towards a better
usage of system resources. Resource usage patterns are defined through runtime measurements and a self-organizing neural network
architecture, yielding an useful model for classifying parallel applications. By means of an instance-based algorithm, it
is produced another model which searches for similarity in workload traces aiming at making predictions about some attribute
of a new submitted parallel application, such as run time or memory usage. These models allow effortless knowledge updating
at the occurrence of new information. The paper describes these models as well as the results obtained applying these models
to acquiring knowledge in both synthetic and real applications traces.

... Instead, a dynamics– physics coupler (dp_coupling) would be used to move data between data structures representing the dynamics state and the physics state. In previous work (Drake et al., 1995Drake et al., , 1999 Foster et al., 1996; Foster and Worley 1997), significant effort has been expended to determine data structures and domain decompositions that work well with both the dynamics and the physics, in order to minimize memory requirements, to avoid the cost of buffer copies, and/or to avoid the cost of interprocess communication when execution moves between the dynamics and the physics during each time step of the algorithm. With the decision to decouple physics and dynamics data structures , a global design was no longer necessarily advantageous . ...

Community models for global climate research, such as the Community Atmospheric Model, must perform well on a variety of computing systems. Supporting diverse research interests, these computationally demanding models must be efficient for a range of problem sizes and processor counts. In this paper we describe the data structures and associated infrastructure developed for the physical parameterizations that allow the Community Atmospheric Model to be tuned for vector or non-vector systems, to provide load balancing while minimizing communication overhead, and to exploit the optimal mix of distributed Message Passing Interface (MPI) processes and shared OpenMP threads.

... In the distributed memory architecture, the remapping procedure implements a global all-to-all exchange of data blocks with size (Nx, Ny/P, Nz/P), where Nx (Ny, Nz) is the number of grid points along the ix (iy, iz) direction, and P is the total number of distributed processors. The global exchange performs essentially a pairwise block exchange where the local 3-D arrays on each processor are viewed as 1-D array of blocks (Bokhari 1991;Foster and Worley 1997). This exchange involves all-toall communication. ...

Century-long global climate simulations at high resolutions generate large amounts of data in a parallel architecture. Currently, the community atmosphere model (CAM), the atmospheric component of the NCAR community climate system model (CCSM), uses sequential I/O which causes a serious bottleneck for these simulations. We describe the parallel I/O development of CAM in this paper. The parallel I/ O combines a novel remapping of 3-D arrays with the paral- lel netCDF library as the I/O interface. Because CAM history variables are stored in disk file in a different index order than the one in CPU resident memory because of parallel decom- position, an index reshuffle is done on the fly. Our strategy is first to remap 3-D arrays from its native decomposition to z- decomposition on a distributed architecture, and from there write data out to disk. Because z-decomposition is consist- ent with the last array dimension, the data transfer can occur at maximum block sizes and, therefore, achieve maximum I/ O bandwidth. We also incorporate the recently developed parallel netCDF library at Argonne/Northwestern as the col- lective I/O interface, which resolves a long-standing issue because netCDF data format is extensively used in climate system models. Benchmark tests are performed on several platforms using different resolutions. We test the perform- ance of our new parallel I/O on five platforms (SP3, SP4, SP5, Cray X1E, BlueGene/L) up to 1024 processors. More than four realistic model resolutions are examined, e.g. EUL T85 (~1.4°), FV-B (2° × 2.5°), FV-C (1° × 1.25°), and FV-D (0.5° × 0.625°) resolutions. For a standard single history out- put of CAM 3.1 FV-D resolution run (multiple 2-D and 3-D arrays with total size 4.1 GB), our parallel I/O speeds up by a factor of 14 on IBM SP3, compared with the existing I/O; on IBM SP5, we achieve a factor of 9 speedup. The estimated time for a typical century-long simulation of FV D-resolution on IBM SP5 shows that the I/O time can be reduced from more than 8 days (wall clock) to less than 1 day for daily out- put. This parallel I/O is also implemented on IBM BlueGene/ L and the results are shown, whereas the existing sequential I/O fails due to memory usage limitation.

Numerical weather prediction (NWP) is one of the first applications of scientific computing and remains an insatiable consumer of high-performance computing today. In the face of a half-century’s exponential and sometimes disruptive growth in HPC capability, major weather services around the world continuously develop and incorporate new meteorological research into large expensive operational forecasting software suites. At the heart is the weather model itself: a computational fluid dynamics core on a spherical domain with physics. The mapping of a planar grid to the geometry of the earth’s atmosphere presents one of a number of challenges for stability, accuracy, and computational efficiency that lead to development of the major numerical formulations and grid systems in use today: grid point, globally spectral, and finite/spectral element. Significant challenges await in the exascale era: large memory working sets, overall low computational intensity, load imbalance, and a fundamental lack of weak scalability in the face of critical real-time forecasting speed requirements. This chapter provides a history of weather and climate modeling on high-performance computing systems, a discussion of each of the major types of model dynamics formulations, grids, and model physics, and directions going forward on emerging HPC architectures.

The IGCM4 (Intermediate Global Circulation Model version 4) is a global
spectral primitive equation climate model whose predecessors have
extensively been used in fields such as climate dynamics, processes
modelling, and atmospheric dynamics. The IGCM4's niche and utility lies in
its parallel spectral dynamics and fast radiation scheme. Moist processes
such as clouds, evaporation, and soil moisture are simulated in the model,
though in a simplified manner compared to state-of-the-art GCMs. The latest
version has been parallelised, which has led to massive speed-up and enabled
much higher resolution runs than would be possible on one processor. It has
also undergone changes such as alterations to the cloud and surface
processes, and the addition of gravity wave drag. These changes have
resulted in a significant improvement to the IGCM's representation of the
mean climate as well as its representation of stratospheric processes such
as sudden stratospheric warmings. The IGCM4's physical changes and
climatology are described in this paper.

The computation on the polar regions plays an crucial role in the design of global numerical weather prediction (NWP) models, which details itself in the following two aspects: the particular treatment of polar regions in the model's dynamic framework and the load-balancing problem caused by the parallel data partitioning strategies. The latter has become the bottleneck of massive parallelization of NWP models. To address this problem, a novel spherical data partitioning algorithm based on the weighted-equal-area approach is proposed. The weight describes the computational distribution across the entire sphere. The new algorithm takes the collar amount and the weight function as its parameters and performs the spherical partitioning as follows: the north and the south polar regions are partitioned into a singular subdomain; then the remaining sphere surface is partitioned into some collars along the latitude; and finally each collar is partitioned into subdomains along the longitude. This partitioning method can result in two polar caps plus a number of collars with increasing partition counts as we approach the equator. After a theoretical analysis of the quality relevant to the partition performed by the algorithm, we take the PSTSWM, which is a spectral shallow water model based on the spherical harmonic transform technique, as our test-bed to validate our method. The preliminary results indicate that the algorithm can result in good parallel load balance and a promising prospect can be expected for its application within the global atmospheric model of GRAPES.

The development of fully parallelized regional spectral model (RSM) was described. The vertical layer bands are transposed into latitude bands using the MPI routines. The computation of the linear dynamics such as semi-implicit and time filter was performed in the spectral space. It was found that the RSM-MPI is twice faster with respect to old RSM-MPI and could use about factor of 5 larger dimensions.

An atmospheric general circulation model (AGCM) for climate studies was developed for the Earth Simulator (ES). The model is called AFES which is based on the CCSR/NIES AGCM and is a global three-dimensional hydrostatic model using the spectral transform method. AFES is optimized for the architecture of the ES. We achieved the high sustained performance by the execution of AFES with T1279L96 resolution on the ES. The performance of 26.58 Tflops was achieved the execution of the main time step loop using all 5120 processors (640 nodes) of the ES. This performance corresponds to 64.9% of the theoretical peak performance 40.96 Tflops. The T1279 resolution, equivalent to about 10 km grid intervals at the equator, is very close to the highest resolution in which the hydrostatic approximation is valid. To our best knowledge, no other model simulation of the global atmosphere has ever been performed with such super high resolution. Currently, such a simulation is possible only on the ES with AFES. In this paper we describe optimization method, computational performance and calculated result of the test runs.

Fourier and related transforms are a family of algorithms widely employed in diverse areas of computational science, notoriously difficult to scale on high-performance parallel computers with a large number of processing elements (cores). This paper introduces a popular software package called P3DFFT which implements fast Fourier transforms (FFTs) in three dimensions in a highly efficient and scalable way. It overcomes a well-known scalability bottleneck of three-dimensional (3D) FFT implementations by using two-dimensional domain decomposition. Designed for portable performance, P3DFFT achieves excellent timings for a number of systems and problem sizes. On a Cray XT5 system P3DFFT attains 45% efficiency in weak scaling from 128 to 65,536 computational cores. Library features include Fourier and Chebyshev transforms, Fortran and C interfaces, in- and out-of-place transforms, uneven data grids, and single and double precision. P3DFFT is available as open source at http://code.google.com/p/p3dfft/. This paper discusses P3DFFT implementation and performance in a way that helps guide the user in making optimal choices for parameters of their runs.

Fast, accurate computation of geophysical fluid dynamics is often very challenging. This is due to the complexity of the PDEs themselves and their initial and boundary conditions. There are several practical advantages to using a relatively new numerical method, the spectral-element method (SEM), over standard methods. SEM combines spectral-method high accuracy with the geometric flexibility and computational efficiency of finite-element methods. This paper is intended to augment the few descriptions of SEM that aim at audiences besides numerical-methods specialists. Advantages of SEM with regard to flexibility, accuracy, and efficient parallel performance are explained, including sufficient details that readers may estimate the benefit of applying SEM to their own computations. The spectral element atmosphere model (SEAM) is an application of SEM to solving the spherical shallow-water or primitive equations. SEAM simulated decaying Jovian atmospheric shallow-water turbulence up to resolution T1067, producing jets and vortices consistent with Rhines theory. SEAM validates the Held-Suarez primitive equations test case and exhibits excellent parallel performance. At T171L20, SEAM scales up to 292 million floating-point operations per second (Mflops) per processor (29% of supercomputer peak) on 32 Compaq ES40 processors (93% efficiency over using 1 processor), allocating 49 spectral elements/processor. At T533L20, SEAM scales up to 130 billion floating-point operations per second (Gflops) (8% of peak) and 9 wall clock minutes per model day on 1024 IBM POWER3 processors (48% efficiency over 16 processors), allocating 17 spectral elements per processor. Local element-mesh refinement with 300% stretching enables conformally embedding T480 within T53 resolution, inside a region containing 73% of the forcing but 6% of the area. Thereby the authors virtually reproduced a uniform-mesh T363 shallow-water computation, at 94% lower cost.

The coupling of a semi-Lagrangian treatment of horizontal advection with a semi-implicit treatment of gravitational oscillations permits longer timesteps than those allowed by a semi-implicit Eulerian scheme. The timestep is then limited by the stability of the explicit treatment of vertical advection. To remove this stability constraint, we propose the use of the semi-Lagrangian advection scheme for both horizontal and vertical advection. This is done in the context of the Canadian regional finite-element weather forecast model that includes a parameterization of the most relevant sub-grid scale processes. It is shown that the three-dimensional semi-Lagrangian scheme produces stable, accurate integrations using timesteps that far exceed the stability limit for the Eulerian model. -Authors

The HIRLAM (high resolution limited area modelling) limited-area atmospheric model was originally developed and optimized for shared memory vector-based computers, and has been used for operational weather forecasting on such machines for several years. This paper describes the algorithms applied to obtain a highly parallel implementation of the model, suitable for distributed memory machines. The performance results presented indicate that the parallelization effort has been successful, and the Norwegian Meteorological Institute will run the parallel version in production on a Cray T3E.

We evaluate dynamic data remapping on cluster of SMP architectures under OpenMP, MPI, and hybrid paradigms. Traditional method
of multi-dimensional array transpose needs an auxiliary array of the same size and a copy back stage. We recently developed
an inplace method using vacancy tracking cycles. The vacancy tracking algorithm outperforms the traditional 2-array method
as demonstrated by extensive comparisons. Performance of multi-threaded parallelism using OpenMP are first tested with different
scheduling methods and different number of threads. Both methods are then parallelized using several parallel paradigms. At
node level, pure OpenMP outperforms pure MPI by a factor of 2.76 for vacancy tracking method. Across entire cluster of SMP
nodes, by carefully choosing thread numbers, the hybrid MPI/OpenMP implementation outperforms pure MPI by a factor of 3.79
for traditional method and 4.44 for vacancy tracking method, demonstrating the validity of the parallel paradigm of mixing
MPI with OpenMP.

Run time variability of parallel application codes continues to be a significant challenge in clusters. We are studying run time variability at the communication level from the perspective of the application, focusing on the network. To gain insight into this problem our earlier work developed a tool to emulate parallel applications and in particular their communication. This framework, called parallel application communication emulation (PACE) has produced interesting insights regarding network performance in NOW clusters. A parallel application run time sensitivity evaluation (PARSE) function has been added to the PACE framework to study the run time effects of controlled network performance degradation. This paper introduces PARSE and presents experimental results from tests conducted on several widely used parallel benchmarks and application code fragments. The results suggest that parallel applications can be classified in terms of their sensitivity to network performance variation

Reconfigurable computing offers the promise of performing computations in hardware to increase performance and efficiency while retaining much of the flexibility of a software solution. Recently, the capacities of reconfigurable computing devices, like field programmable gate arrays, have risen to levels that make it possible to execute 64b floating-point operations. SRC Computers has designed the SRC-6 MAPstation to blend the benefits of commodity processors with the benefits of reconfigurable computing. In this paper, we describe our effort to accelerate the performance of several scientific applications on the SRC-6. We describe our methodology, analysis, and results. Our early evaluation demonstrates that the SRC-6 provides a unique software stack that is applicable to many scientific solutions and our experiments reveal the performance benefits of the system.

This book explains how to use the bulk synchronous parallel (BSP) model to design and implement parallel algorithms in the areas of scientific computing and big data. Furthermore, it presents a hybrid BSP approach towards new hardware developments such as hierarchical architectures with both shared and distributed memory. The book provides a full treatment of core problems in scientific computing and big data, starting from a high-level problem description, via a sequential solution algorithm to a parallel solution algorithm and an actual parallel program written in the communication library BSPlib. Numerical experiments are presented for parallel programs on modern parallel computers ranging from desktop computers to massively parallel supercomputers. The introductory chapter of the book gives a complete overview of BSPlib, so that the reader already at an early stage is able to write his/her own parallel programs. Furthermore, it treats BSP benchmarking and parallel sorting by regular sampling. The next three chapters treat basic numerical linear algebra problems such as linear system solving by LU decomposition, sparse matrix-vector multiplication (SpMV), and the fast Fourier transform (FFT). The final chapter explores parallel algorithms for big data problems such as graph matching. The book is accompanied by a software package BSPedupack, freely available online from the author’s homepage, which contains all programs of the book and a set of test programs.

A collection of MATLAB classes for computing and using spherical harmonic transforms is presented. Methods of these classes compute differential operators on the sphere and are used to solve simple partial differential equations in a spherical geometry. The spectral synthesis and analysis algorithms using fast Fourier transforms and Legendre transforms with the associated Legendre functions are presented in detail. A set of methods associated with a spectral field class provides spectral approximation to the differential operators ∇�, ∇×, ∇, and ∇2 in spherical geom- etry. Laplace inversion and Helmholtz equation solvers are also methods for this class. The use of the class and methods in MATLAB is demonstrated by the solution of the barotropic vortic- ity equation on the sphere. A survey of alternative algorithms is given and implementations for parallel high performance computers are discussed in the context of global climate and weather models. Categories and Subject Descriptors: F.2.1 (Analysis of Algorithms and Problem Complex- ity): Numerical Algorithms and Problems—Computation of transforms; G.4 (Mathematical Software) ; G.1.8 (Numerical Analysis) Partial Differential Equations; J.2 (Physical Sciences and Engineering): —Earth and atmosphere science General Terms: Algorithms, Theory

The spectral transform method used in climate and weather models is known to be computationally intensive. Typically accounting for more than 90% of the execution time of a serial model, it is poised to benefit from computational parallelism. Since dimensionally global transforms impact parallel performance, it is important to establish the realizable parallel efficiency of the spectral transform. To this end, this paper quantitatively characterizes the parallel characteristics of the spectral transform within an atmospheric modeling context. It comprehensively characterizes and catalogs a baseline of operations required for the spectral transform. While previous investigations of the spectral transform method have offered highly idealized analyses that are abstract and simplified in terms of orders of computational magnitude, this research provides a detailed model of the computational complexity of the spectral transform, validated by empirical results. From this validated quantitative analysis, an operational closed-form expression characterizes spectral transform performance in terms of general processor parameters and atmospheric data dimensions. These generalized statements of the computational requirements for the spectral transform can serve as a basis for exploiting parallelism.

The best approach to parallelize multidimensional FFT algorithms has long been under debate. Distributed transposes are widely used, but they also vary in communication policies and hence performance. In this work we analyze the impact of different redistribution strategies on the performance of parallel FFT, on various machine architectures. We found that some redistribution strategies were consistently superior, while some others were unexpectedly inferior. An in-depth investigation into the reasons for this behavior is included in this work. Copyright © 2001 John Wiley & Sons, Ltd.

SUMMARY The problem of understanding how Earth's magnetic field is generat ed is one of the foremost challenges in modern science. It is believed to be generated by a dynamo process, where the complex motions of an electrically conducting fluid provide the inductive action to susta in the field against the effects of dissipation. Current dynamo simulations, based on the numerical approximation to the governing equations of magnetohydrodynamics, cannot reach the very rapid rotation rates and low viscosities (i.e. low Ekman number) of Earth due to limitations in available computing power. Using a pseudospectral method, the most widely-used method for simulating the geodynamo, computational requirements needed to run simulations in an 'Earth-like' parameter regime are explored theoretically by ap proximating operation counts, memory requirements, and communication costs in the asymptotic limit of large problem size. Theoretical scalings are tested using numerical calculations. For asymptotically large problems the spherical transform is shown to be the limiting step within the pseudospectral method; memory requirements and communication costs

This study measures the effects of changes in message latency and bandwidth for production-level codes on a current generation tightly coupled MPP, the Intel Paragon. Messages are sent multiple times to study the application sensitivity to variations in bandwidth and latency. This method preserves the effects of contention on the interconnection network.Two applications are studied: PCTH, a shock physics code developed at Sandia National Laboratories; and PSTSWM, a spectral shallow water code developed at Oak Ridge National Laboratory and Argonne National Laboratory. These codes are significant in that PCTH is a ‘full physics’ application code in production use, while PSTSWM serves as a parallel algorithm test bed and benchmark for production codes used in atmospheric modeling. They are also significant in that the message-passing behavior differs significantly between the two codes, each representing an important class of scientific message-passing applications. © 1998 John Wiley & Sons, Ltd.

A composite mesh finite-difference method using overlapping stereographic coordinate systems is compared to transform methods based on scalar and vector spherical harmonics. The methods are compared in terms of total computer time, memory requirements, and execution rates for relative accuracy requirements of two and four digits in a five-day forecast. The computational requirements of the three methods were well within an order of magnitude of one another. In most of the cases that are examined, the time step was limited by accuracy rather than stability. This problem can be overcome by the use of a higher order time integration scheme, but at the expense of an increase in the memory requirements. -from Authors

This report presents the details of the governing equations, physical parameterizations, and numerical algorithms defining the version of the NCAR Community Climate Model designated CCM3. The material provides an overview of the major model components, and the way in which they interact as the numerical integration proceeds. As before, it is our objective that this model provide NCAR and the university research community with a reliable, well documented atmospheric general circulation model. This version of the CCM incorporates significant improvements to the physics package, new capabilities such as the incorporation of a slab ocean component, and a number of enhancements to the implementation (e.g., the ability to integrate the model on parallel distributed-memory computational platforms). We believe that collectively these improvements provide the research community with a significantly improved atmospheric modeling capability.

Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2m
-point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.
Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.

In a multiprocessor with distributed storage the data structures have a significant impact on the communication complexity. In this paper we present a few algorithms for performing matrix transposition on a Boolean n-cube. One algorithm performs the transpose in a time proportional to the lower bound both with respect to communication start-ups and to element transfer times. We present algorithms for transposing a matrix embedded in the cube by a binary encoding, a binary-reflected Gray code encoding of rows and columns, or combinations of these two encodings. The transposition of a matrix when several matrix elements are identified to a node by consecutive or cyclic partitioning is also considered and lower bound algorithms given. Experimental data are provided for the Intel iPSC and the Connection Machine.

The development and use of three-dimensional computer models of the earth's climate are discussed. The processes and interactions of the atmosphere, oceans, and sea ice are examined. The basic theory of climate simulation which includes the fundamental equations, models, and numerical techniques for simulating the atmosphere, oceans, and sea ice is described. Simulated wind, temperature, precipitation, ocean current, and sea ice distribution data are presented and compared to observational data. The responses of the climate to various environmental changes, such as variations in solar output or increases in atmospheric carbon dioxide, are modeled. Future developments in climate modeling are considered. Information is also provided on the derivation of the energy equation, the finite difference barotropic forecast model, the spectral transform technique, and the finite difference shallow water waved equation model.

Implementations of climate models on scalable parallel computer systems can suffer from load imbalances because of temporal and spatial variations in the amount of computation required for physical parameterizations such as solar radiation and convective adjustment. We have developed specialized techniques for correcting such imbalances. These techniques are incorporated in a general-purpose, programmable load-balancing library that allows the mapping of computation to processors to be specified as a series of maps generated by a programmer-supplied load-balancing module. The communication required to move from one map to another is performed automatically by the library, without programmer intervention. In this paper, we describe the load-balancing problem and the techniques that we have developed to solve it. We also describe specific load-balancing algorithms that we have developed for PCCM2, a scalable parallel implementation of the Community Climate Model, and present experimental...

This paper is a brief overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a messagepassing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing library, the code can be easily ported to other multiprocessors supporting a message-passing programming paradigm, or run on machines distributed across a network. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor to do the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using...

this paper, we review the various parallel algorithms used in PCCM2 and the work done to arrive at a validated model. 2. THE NCAR CCM Over the past decade, the NCAR Climate and Global Dynamics Division has provided a comprehensive, three-dimensional global atmospheric model to university and NCAR scientists for use in the analysis and understanding of global climate. Because of its widespread use, the model was designated a Community Climate Model (CCM). The most recent version of the CCM, CCM2, was released to the research community in October 1992 (Hack et al. 1992; Bath, Rosinski, and Olson 1992). This incorporates improved physical representations of a wide range of key climate processes, including clouds, radiation, moist convection, the planetary boundary layer, and transport. Changes to the pa- 0.1 1 10 100 1000 1 10 100 1000

Choice of an appropriate strategy for balancing load in climate models running on par-allel processors depends on the nature and size of inherent imbalances. Physics routines of the NCAR Community Climate Model were instrumented to produce per-cell load data for each time step, revealing load imbalance resulting from surface type, polar night, weather patterns, and the earth's terminator. Hourly, daily, and annual cycles in processor perfor-mance over the model grid were also uncovered. Data from CCM1 suggested a number of static processor allocation strategies.

The communication performance of the i860-based Intel DELTA mesh supercomputer is compared with the Intel iPSC/860 hypercube and the Ncube 6400 hypercube. Single and multiple hop communication bandwidth and latencies are measured. Concurrent communication speeds and speed under network load are also measured. File I/O performance of the mesh-attached Concurrent File System is measured.

A one-level, global, spectral model using the primitive equations is formulated in terms of a concise form of the prognostic equations for vorticity and divergence. The model integration incorporates a grid transform technique to evaluate nonlinear terms; the computational efficiency of the model is found to be far superior to that of an equivalent model based on the traditional interaction coefficients. The transform model, in integrations of 116 days, satisfies principles of conservation of energy, angular momentum, and square potential vorticity to a high degree.

This book explains how many major scientific algorithms can be used on large parallel machines. Based on five years of research on hypercubes, the book concentrates on practically motivated model problems, that serve to illustrate generic algorithmic and decomposition techniques. The authors include results for hypercube-class concurrent computers with up to 128 nodes, and the principles behind the extrapolation to much larger systems are described.

The spectral transform method is a standard numerical technique used to solve partial differential equations on the sphere in global climate modeling. In particular, it is used in CCM1 and CCM2, the Community Climate Models developed at the National Center for Atmospheric Research. This paper describes initial experiences in parallelizing a program that uses the spectral transform method to solve the non-linear shallow water equations on the sphere, showing that an efficient implementation is possible on the Intel iPSC/860. The use of PICL, a portable instrumented communication library, and Paragraph, a performance visualization tool, in tuning the implementation is also described.
The Legendre transform and the Fourier transform comprise the computational kernel of the spectral transform method. This paper is a case study of parallelizing the Legendre transform. For many problem sizes and numbers of processors, the spectral transform method can be parallelized efficiently by parallelizing only the Legendre transform.

A suite of seven test cases is proposed for the evaluation of numerical methods intended for the solution of the shallow water equations in spherical geometry. The shallow water equations exhibit the major difficulties associated with the horizontal dynamical aspects of atmospheric modeling on the spherical earth. These cases are designed for use in the evaluation of numerical methods proposed for climate modeling and to identify the potential trade-offs which must always be made in numerical modeling. Before a proposed scheme is applied to a full baroclinic atmospheric model it must perform well on these problems in comparison with other currently accepted numerical methods. The cases are presented in order of complexity. They consist of advection across the poles, steady state geostrophically balanced flow of both global and local scales, forced nonlinear advection of an isolated low, zonal flow impinging on an isolated mountain, Rossby-Haurwitz waves, and observed atmospheric states. One of the cases is also identified as a computer performance/algorithm efficiency benchmark for assessing the performance of algorithms adapted to massively parallel computers.

GMD and ECMWF (European Centre for Medium-Range Weather Forecasts) joined forces some months ago in order to parallelize ECMWF's production code for medium-range weather forecasts, the Integrated Forecasting System (IFS). Meanwhile, the first milestone of this cooperation has been reached: The 2D model of the IFS, which contains already all relevant data structures and algorithmic components of the corresponding 3D models, has been parallelized and run successfully on quite a large variety of differential parallel machines. Performance measurements confirm the expected parallel efficiencies of up to 80% and more. This paper discusses the parallelization strategy employed and gives a survey on the performance results obtained on the parallel systems.

We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multi-computers. PCCM2 incorporates parallel spectral transform, semi-Lagrangian transport, and load balancing algorithms. We present detailed performance results on the IBM SP2 and Intel Paragon. These results provide insights into the scalability of the individual parallel algorithms and of the parallel model as a whole.

One issue which is central in developing a general purpose FFT subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. In this paper we present an FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications. We have also addressed the problem of rearranging the data after computing the FFT. We have evaluated the performance of our implementation on a distributed memory parallel machine, the Intel iPSC/860.

In a hypercube multiprocessor with distributed memory, messages have a street address and an apartment number, i.e., a hypercube node address and a local memory address. Here we describe an optimal algorithm for performing the communication described by exchanging the bits of the node address with that of the local address. These exchanges occur typically in both matrix transposition and bit reversal for the fast Fourier transform.

Several multiprocessor FFTs are developed in this paper for both vector multiprocessors with shared memory and the hypercube. Two FFTs for vector multiprocessors are given that compute an ordered transform and have a stride of one except for a single ‘link’ step. Since multiple FFTs provide additional options for both vectoriation and distribution we show that a single FFT can be performed in terms of two multiple FFTs and development of algorithms that minimize interprocessor communication. On a hypercube of dimension d the unordered FFT requires d + 1 parallel transmissions. The ordered FFT requires from 1.5d + 2 to 2d + 1 parallel transmissions depending on the length of the sequence. It is also shown that a class of orderings called index-digit permutations which includes matrix transposition, the perfect shuffle, and digit reversal can be performed with less than or equal to 1.5d parallel transmissions.

The performance of the Intel iPSC/860 hypercube and the Ncube 6400 hypercube are compared with earlier hypercubes from Intel and Ncube. Computation and communication performance for a number of low-level benchmarks are presented for the Intel iPSC/1, iPSC/2, and iPSC/860 and for the Ncube 3200 and 6400. File I/O performance of the iPSC/860 and Ncube 6400 are compared.

Analytic and empirical studies are presented that allow the parallel performance, and hence the scalability, of the spectral transform method to be quantified on different parallel computer architectures. Both the shallow-water equations and complete GCMs are considered. Results indicate that for the shallow-water equations, parallel efficiency is generally poor because of high communication requirements. It is predicted that for complete global climate models, the parallel efficiency will be significantly better; nevertheless, projected teraflop computers will have difficulty achieving accpetable throughput necessary for long-term regional climate studies. -from Authors

A modified version of the Fast Fourier Transform is developed and described. This version is well adapted for use in a special-purpose computer designed for the purpose. It is shown that only three operators are needed. One operator replaces successive pairs of data points by their sums and differences. The second operator performs a fixed permutation which is an ideal shuffle of the data. The third operator permits the multiplication of a selected subset of the data by a common complex multiplier.
If, as seems reasonable, the slowest operation is the complex multiplications required, then, for reasonably sized date sets—e.g. 512 complex numbers—parallelization by the method developed should allow an increase of speed over the serial use of the Fast Fourier Transform by about two orders of magnitude.
It is suggested that a machine to realize the speed improvement indicated is quite feasible.
The analysis is based on the use of the Kronecker product of matrices. It is suggested that this form is of general use in the development and classification of various modifications and extensions of the algorithm.

Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2m-point FFT.
This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.
Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly two gigaflops.

The original Cooley-Tukey FFT was published in 1965 and presented for sequences with length N equal to a power of two. However, in the same paper they noted that their algo- rithm could be generalized to composite N in which the length of the sequence was a pro- duct of small primes. In 1967, Bergland presented an algorithm for composite N and vari- ants of his mixed radix FFT are currently in wide use. In 1968, Bluestein presented an FFT for arbitrary N including large primes. However, for composite N , Bluestein's FFT was not competitive with Bergland's FFT. Since it is usually possible to select a composite N , Bluestein's FFT did not receive much attention. Nevertheless because of its minimal com- munication requirements, the Bluestein FFT may be the algorithm of choice on multipro- cessors, particularly those with the hypercube architecture. In contrast to the mixed radix FFT, the communication pattern of the Bluestein FFT maps quite well onto the hypercube. With P = 2d processors, an ordered Bluestein FFT requires 2d communication cycles with packet length N ∕2P which is comparable to the requirements of a power of two FFT. For fine-grain computations, the Bluestein FFT requires 20log2N computational cycles. Although this is double that required for a mixed radix FFT, the Bluestein FFT may nevertheless be preferred because of its lower communication costs. For most values of N it is also shown to be superior to another alternative, namely parallel matrix multiplication.

We outline a unified approach for building a library of collective
communication operations that performs well on a cross-section of
problems encountered in real applications. The target architecture is a
two-dimensional mesh with worm-hole routing, but the techniques also
apply to higher dimensional meshes and hypercubes. We stress a general
approach, addressing the need for implementations that perform well for
various sized vectors and grid dimensions, including non-power-of-two
grids. This requires the development of general techniques for building
hybrid algorithms. Finally, the approach also supports collective
communication within a group of nodes, which is required by many
scalable algorithms. Results from the Intel Paragon system are included

Fairness is an important issue when benchmarking parallel
computers using application codes. The best parallel algorithm on one
platform may not be the best on another. While it is not feasible to
re-evaluate parallel algorithms and reimplement large codes whenever new
machines become available, it is possible to embed algorithmic options
into codes that allow them to be “tuned” for a particular
machine without requiring code modifications. We describe a code in
which such an approach was taken. PSTSWM was developed for evaluating
parallel algorithms for the spectral transform method in atmospheric
circulation models. Many levels of runtime-selectable algorithmic
options are supported. We discuss these options and our evaluation
methodology. We also provide empirical results from a number of parallel
machines, indicating the importance of tuning for each platform before
making a comparison

Two complete exchange algorithms for meshes are given. The
modified quadrant exchange algorithm is based on the quadrant exchange
algorithm and it is well suited for square meshes with a power of two
rows and columns. The store-and-forward complete exchange algorithm is
suitable for meshes of arbitrary size. A pipelined broadcast algorithm
for meshes is also presented. This new algorithm, called the double hop
broadcast, can broadcast long messages at slightly lower cost than the
edge-disjoint fence algorithm because it uses routing trees of lower
height. This shows that there is still room for improvement in the
design of pipelined broadcast algorithms for meshes

The scalability of the parallel fast Fourier transform (FFT) algorithm on mesh- and hypercube-connected multicomputers is analyzed. The hypercube architecture provides linearly increasing performance for the FFT algorithm with an increasing number of processors and a moderately increasing problem size. However, there is a limit on the efficiency, which is determined by the communication bandwidth of the hypercube channels. Efficiencies higher than this limit can be obtained only if the problem size is increased very rapidly. Technology-dependent features, such as the communication bandwidth, determine the upper bound on the overall performance that can be obtained from a P-processor system. The upper bound can be moved up by either improving the communication-related parameters linearly or increasing the problem size exponentially. The scalability analysis shows that the FFT algorithm cannot make efficient use of large-scale mesh architectures. The addition of such features as cut-through routing and multicasting does not improve the overall scalability on this architecture

The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture

Given a vector of N elements, the perfect shuffle of this vector is a permutation of the elements that are identical to a perfect shuffle of a deck of cards. Elements of the first half of the vector are interlaced with elements of the second half in the perfect shuffle of the vector. We indicate by a series of examples that the perfect shuffle is an important interconnection pattern for a parallel processor. The examples include the fast-Fourier transform (FFT), polynomial evaluation, sorting, and matrix transposition. For the FFT and sorting, the rate of growth of computational steps for algorithms that use the perfect shuffle is the least known today, and is somewhat better than the best rate that is known for versions of these algorithms that use the interconnection scheme used in the ILLIAC IV. Copyright © 1971 by The Institute of Electrical and Electronics Engineers, Inc.

This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models designed to run on large-scale parallel computers due to the large number of existing serial and moderately parallel implementations. We present analytic and empirical studies that allow us to quantify the parallel performance, and hence the scalability, of the spectral transform method on different parallel computer architectures. We consider both the shallow-water equations and complete GCMs. Our results indicate that for the shallow-water equations parallel efficiency is generally poor because of high communication requirements. We predict that for complete global climate models, the parallel efficiency will be significantly better; nevertheless, projected Teraflop computers will have difficulty achieving acceptable throughput necessary for long-term regional climate studies. 1 Introduction Current ...

The spectral transform method is a standard numerical technique for solving partial differential equations on the sphere and is widely used in global climate modeling. In this paper, we outline different approaches to parallelizing the method and describe experiments that we are conducting to evaluate the efficiency of these approaches on parallel computers. The experiments are conducted using a testbed code that solves the nonlinear shallow water equations on a sphere, but are designed to permit evaluation in the context of a global model. They allow us to evaluate the relative merits of the approaches as a function of problem size and number of processors. The results of this study are guiding ongoing work on PCCM2, a parallel implementation of the Community Climate Model developed at the National Center for Atmospheric Research. 1 Introduction Parallel algorithms for computing the spectral transform method used in climate models can be divided into two general classes. Tr...

Implementation of the NCAR CCM2 on the Connection Machine

- R D Loft
- R K Sato

R. D. Loft and R. K. Sato, Implementation of the NCAR CCM2 on the Connection
Machine, in Parallel Supercomputing in Atmospheric Science: Proceedings of the Fifth
ECMWF Workshop on Use of Parallel Processors in Meteorology, G.-R. Ho man and
T. Kauranne, eds., World Scienti c Publishing Co. Pte. Ltd., Singapore, 1993, pp. 371{
393.

On the parallelization of global spectral Eulerian shallow-water models

- S Barros

S. Barros and K. T, On the parallelization of global spectral Eulerian shallow-water
models, in Parallel Supercomputing in Atmospheric Science: Proceedings of the Fifth
ECMWF Workshop on Use of Parallel Processors in Meteorology, G.-R. Ho man and
T. Kauranne, eds., World Scienti c Publishing Co. Pte. Ltd., Singapore, 1993, pp. 36{43.

The ECMWF model on the Cray Y-MP8, in The Dawn of Massively Parallel Processing in

- D Dent

D. Dent, The ECMWF model on the Cray Y-MP8, in The Dawn of Massively Parallel
Processing in Meteorology, G.-R. Hooman and D. K. Maretis, eds., Springer-Verlag, Berlin,
1990.

- G C Fox
- M A Johnson
- G A Lyzenga
- S W Otto
- J K Salmon
- D W Walker

G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W.
Walker, Solving Problems on Concurrent Processors, vol. 1, Prentice-Hall, Englewood
Cli s, NJ, 1988.

Scalability estimates of parallel spectral atmospheric models

- T Kauranne
- S Barros

T. Kauranne and S. Barros, Scalability estimates of parallel spectral atmospheric models, in Parallel Supercomputing in Atmospheric Science: Proceedings of the Fifth ECMWF
Workshop on Use of Parallel Processors in Meteorology, G.-R. Ho man and T. Kauranne,
eds., World Scienti c Publishing Co. Pte. Ltd., Singapore, 1993, pp. 312{328.

On the parallelization of global spectral Eulerian shallowwater models

- S Barros And T
- Kauranne

S. BARROS AND T. KAURANNE, On the parallelization of global spectral Eulerian shallowwater models, in Parallel Supercomputing in Atmospheric Science: Proceedings of the
Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology, G.-R. Hoffman
and T. Kauranne, eds., World Scientific, Singapore, 1993, pp. 36-43.

An efficient, one-level, primitive-equation spectral model, Monthly Weather Review

- W Bourke

W. BOURKE, An efficient, one-level, primitive-equation spectral model, Monthly Weather Review, 102 (1972), pp. 687-701.