Article

Making a Supercomputer Do What You Want: High-Level Tools for Parallel Programming

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The uses of high level tools for parallel programming to do large and complex computing with existing computing software on the desktop, are discussed. Parallel computing gives the user maximum control and the best performance with careful tuning. Task-specific libraries such as PLAPACK and ScaLAPACK are used for dense linear algebra. Intermediate codes are implemented to translate and redistribute data in the format of libraries because particular application requires data to be distributed in a different format. Desktop simulation tools, such as Mathematica and Matlab are implemented for high performance computers. These tools are made on top of the low level communication, task-specific, and application-specific libraries that help high level operators to do complex computations. High-level tools also help to experiment with different solvers and, automatically take care of data redistributions in parallel operations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The Buffon-Laplace needle algorithm is a simple Monte Carlo algorithm for the approximate calculation of π. It has been used in previous articles in this magazine ([5],[6]) to illustrate the use of parallel software development tools. The algorithm is based on the estimation of the probability, P (l, a, b), that a needle of length l thrown on a two-dimensional grid with cells of length a in one direction and b in the other direction will intersect at a least one line ...
Article
The advent of affordable parallel computers such as Beowulf PC clusters and, more recently, multicore PCs has been highly beneficial for a large number of scientists and smaller institutions that might not otherwise have access to substantial computing facilities. However, there hasn't been an analogous progress in the development and dissemination of parallel software-scientists need the expertise to develop parallel codes and must invest a significant amount of time in the development of tools even for the most common data-analysis tasks. The authors describe the Beowulf analysis symbolic interface (BASIN), a multiuser parallel data analysis and visualization framework. BASIN is aimed at providing scientists with a suite of parallel libraries for astrophysical data analysis along with general tools for data distribution and parallel operations on distributed data that lets them easily develop new parallel libraries for their specific tasks.
... The ScaLAPACK library involves complex distributed data structures that our proposed interfaces, PyScaLA-PACK, hide from the final user, making them easier to use. Other projects with similar purposes are, e.g., Parallel Matlab [6], Net-Solve [7], or Star-P [8,9]. Some of the results in this paper have been presented in preliminary form in [10]. ...
Article
In many high performance engineering and scientific applications there is a need to use parallel software libraries. Researchers behind these applications find it difficult to understand the interfaces to these libraries because they carry arguments that are related to the parallel environment and performance in addition to arguments related to the problem at hand. In this paper we introduce the use of high level user interfaces for ScaLAPACK. Concretely, a Python-based interface to ScaLAPACK is proposed. Numerical experiments comparing traditional programming practices with our proposed approach are presented. These experiments evaluate not only the performance of the Python interfaces but also how user friendlier they are, compared to the original calls, and show that PyScaLAPACK does not hinder the performance deliverance of ScaLAPACK. Finally, an example of a real scientific application code, whose functionality can be prototyped or extended with the use of PyScaLAPACK, is presented.
Preprint
Full-text available
Existing power modelling research focuses on the model rather than the process for developing models. An automated power modelling process that can be deployed on different processors for developing power models with high accuracy is developed. For this, (i) an automated hardware performance counter selection method that selects counters best correlated to power on both ARM and Intel processors, (ii) a noise filter based on clustering that can reduce the mean error in power models, and (iii) a two stage power model that surmounts challenges in using existing power models across multiple architectures are proposed and developed. The key results are: (i) the automated hardware performance counter selection method achieves comparable selection to the manual method reported in the literature, (ii) the noise filter reduces the mean error in power models by up to 55%, and (iii) the two stage power model can predict dynamic power with less than 8% error on both ARM and Intel processors, which is an improvement over classic models.
Conference Paper
Full-text available
This paper reports partial results of a project to determine in which situations the Python language offers advantages over traditional languages in the development of High Performance Applications. Resumo. Este artigo relata resultados parciais de um projeto para determinar em que situações a linguagem Python oferece vantagens em relação às linguagens tradicionais no desenvolvimento de Aplicações de Alto Desempenho.
Conference Paper
The high performance MATLABreg user now has more choices than ever. Interactive Supercomputing's Star-P embraces this new world where, as an example, a MATLAB user who never wants to leave MATLAB might sit next to a C++ programmer at the office and both surf the Internet for the latest high speed FFT written in yet another language. The MATLAB of the past now becomes one browser into a bigger computational world. HPC users need this bigger world. Other "browsers" can be imagined. The open Star-P platform gives users options never before available to programmers who have traditionally enjoyed living exclusively inside a MATLAB environment
Conference Paper
Full-text available
We will present a cost-effective and flexible realization of High Performance Computing (HPC) clustering and its potential in solving computationally intensive problems incomputer vision. The featured software foundation to support the parallel programming is the GNU Parallel Knoppix package with Message Passing Interface (MPI) based Octave, Python and C interface capabilities. The implementation is especially of interest in applications where the main objective is to reuse the existing hardware infrastructure and to maintain the overall budget cost. We will present the benchmark results and compare and contrast the performances of Octave and MATLAB.
Conference Paper
Based on the WINDOWS system, the fast parallel MATLAB algorithm is used in calculating the RCS pattern of perfect electric conductor (PEC) with the method of moments (MoM). The computation of the impedance matrix and its inverse matrix is divided to the parallel computers. Each computer will finish its tasks. Meanwhile the necessary data are exchanged in the sharing space built at the basis of local area network. All the numerical examples prove that the calculated speed of the new technique is quicker and is able to deal with the problem of the largely electric object, which is unable to be calculated by a personal computer.
Article
Full-text available
The combination of the Python language and the bulk synchronous parallel computing model make developing and testing parallel programs a much more pleasurable experience.
Article
Full-text available
This survey article reviews the history and current importance of Krylov subspace iteration algorithms
Article
Full-text available
Cilk (pronounced "silk") is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "critical path" of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the ?Socrates c...
Article
Full-text available
The true costs of high performance computing are currently dominated by software. Addressing these costs requires shifting to high productivity languages such as Matlab. MatlabMPI is a Matlab implementation of the Message Passing Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation of MatlabMPI is that it implements the widely used MPI ``look and feel'' on top of standard Matlab file I/O, resulting in an extremely compact (~250 lines of code) and ``pure'' implementation which runs anywhere Matlab runs, and on any heterogeneous combination of computers. The performance has been tested on both shared and distributed memory parallel computers (e.g. Sun, SGI, HP, IBM and Linux). MatlabMPI can match the bandwidth of C based MPI at large message sizes. A test image filtering application using MatlabMPI achieved a speedup of ~300 using 304 CPUs and ~15% of the theoretical peak (450 Gigaflops) on an IBM SP2 at the Maui High Performance Computing Center. In addition, this entire parallel benchmark application was implemented in 70 software-lines-of-code (SLOC) yielding 0.85 Gigaflop/SLOC or 4.4 CPUs/SLOC, which are the highest values of these software price performance metrics ever achieved for any application. The MatlabMPI software will be available for download.
Article
Following the initial release of LAPACK and the emerging importance of distributed memory computing, work began on adapting LAPACK to distributed-memory architectures. Since porting software efficiently from one distributed-memory architecture to another is a challenging task, this work is an effort to establish standards for library development in the varied world of distributed-memory computing. ScaLAPACK is an acronym for Scalable Linear Algebra PACKage, or Scalable LAPACK. As in LAPACK, the ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. (For distributed-memory machines, the memory hierarchy includes the off-processor memory of other processors, in addition to the hierarchy of registers, cache, and local memory on each processor.) The fundamental building block of the ScaLAPACK library is a distributed-memory version of the Level 1,2, and 3 BLAS, called the PBLAS (Parallel BLAS). The PBLAS are in turn built on the BLAS for computation on single nodes and on a set of Basic Linear Algebra Communication Subprograms (BLACS) for communication tasks that arise frequently in parallel linear algebra computations. For optimal performance, it is necessary, first, that the BLAS be implemented efficiently on the target machine, and second, that an efficient version of the BLACS be available. Versions of the BLACS exist for both MPI and PVM, as well as versions for the Intel series (NX), IBM SP series (MPL), and Thinking Machines CM-5 (CMMD). A vendor-optimized version of the BLACS is available for the Cray T3 series. Thus, ScaLAPACK is portable on any computer or network of computers that supports MPI or PVM (as well as the aforementioned native message-passing protocols). Most of the ScaLAPACK code is written in standard Fortran 77; the PBLAS and the BLACS are written in C, but with Fortran 77 interfaces. The first ScaLAPACK software was written in 1989–1990, and the appearance of the code has undergone many changes since then in our pursuit to resemble and enable code reuse from LAPACK. The first public release (version 1.0) of ScaLAPACK occurred on February 28, 1995, and subsequent releases occurred in 1996.
Article
The thesis of this extended abstract is simple. High productivity comes from high level infrastructures. To measure this, we introduce a methodology that goes beyond the tradition of timing software in serial and tuned parallel modes. We perform a classroom productivity study involving 29 students who have written a homework exercise in a low level language (MPI message passing) and a high level language (Star-P with MATLAB client). Our conclusions indicate what perhaps should be of little surprise: 1) the high level language is always far easier on the students than the low level language. 2) The early versions of the high level language perform inadequately compared to the tuned low level language, but later versions substantially catch up. Asymptotically, the analogy must hold that message passing is to high level language parallel programming as assembler is to high level environments such as MATLAB, Mathematica, Maple, or even Python. We follow the Kepner method [6] that correctly realizes that traditional speedup numbers without some discussion of the human cost of reaching these numbers can fail to reflect the true human productivity cost of high performance computing. Traditional data compares low level message passing with serial computation. With the benefit of a high level language system in place, in our case Star-P running with MATLAB client, and with the benefit of a large data pool: 29 students, each running the same code ten times on three evolutions of the same platform, we can methodically demonstrate the productivity gains. To date we are not aware of any high level system as extensive and interoperable as Star-P, nor are we aware of an experiment of this kind performed with this volume of data.
Conference Paper
In this work we compare some of the freely available parallel Toolboxes for MATLAB, which differ in purpose and implementation details: while DP-Toolbox and MultiMATLAB offer a higher-level parallel environment, the goals of PVMTB and MPITB, developed by us [7], are to closely adhere to the PVM system and MPI standard, respectively. DP-Toolbox is also based on PVM, and MultiMATLAB on MPI. These Toolboxes allow the user to build a parallel application under the rapid-prototyping MATLAB environment. The differences between them are illustrated by means of a performance test and a simple case study frequently found in the literature. Thus, depending on the preferred message-passing software and the performance requirements of the application, the user can either choose a higher-level Toolbox and benefit from easier coding, or directly interface the message-passing routines and benefit from greater control and performance. Topics: Problem Solving Environments, Parallel and Distributed Computing, Cluster and Grid Computing.
Book
processors as a user-declared Cartesian mesh Physical processors ALIGN DISTRIBUTE implementationdependent directive The underlying assumptions are that an operation on two or more data objects is likely to be carried out much faster if they all reside in the same processor, and that it may 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 be possible to carry out many such operations concurrently if they can be performed on different processors.
Article
The true costs of high performance computing are currently dominated by software. Addressing these costs requires shifting to high productivity languages such as Matlab. The development of MatlabMPI (www.ll.mit.edu/MatlabMPI) was an important first step that has brought parallel messaging capabilities to the Matlab environment, and is now widely used in the com-munity. The ultimate goal is to move beyond basic mes-saging (and its inherent programming complexity) to-wards higher level parallel data structures and functions. The pMatlab Parallel Toolbox provides these capabili-ties, and allows any Matlab user to parallelize their pro-gram by simply changing a few characters in their pro-gram. The performance has been tested on both shared and distributed memory parallel computers (e.g. Sun, SGI, HP, IBM, Linux and MacOSX) on a variety of ap-plications.
Article
This chapter provides an introduction to parallel and distributed systems and their benefits in performance, resource sharing, extendibility, reliability, and cost-effectiveness. It outlines parallel and distributed computing approaches and paradigm, and outlines the opportunities and challenges of high performance parallel and distributed computing. Finally, it presents a three-tiered distributed system design framework to highlight architectural issues, services and candidate technologies for implementing parallel/distributed computing systems and applications.
Article
In this paper, we present the main algorithmic features in the software package SuperLU{_}DIST, a distributed-memory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with focus on scalability issues, and demonstrate the parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication pattern for sparse Gaussian elimination, which makes it more scalable on distributed memory machines. Based on this a priori knowledge, we designed highly parallel and scalable algorithms for both LU decomposition and triangular solve and we show that they are suitable for large-scale distributed memory machines.
Article
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI). 1 Introduction One objective for the development and maintenance of a parallel version of the ARPACK [3] package was to construct a parallelization strategy whose implementation required as few changes as possible to the current serial version. The basis for this requirement was not only to maintain a level of numerical and algorithmic consistency between the parallel and serial implementations, but also to investigate the possibility of maintaining the parallel and serial librari...
Article
this paper focuses on empirical measurements of execution time on the CM5 for our ?Socrates chess application. Figure 1 shows the outcome of many experiments of running ?Socrates on a variety of chess positions. The figure plots the speedup T 1 =TP for each run against the machine size P for that run. In order to compare the outcomes for runs with different parameters, we have normalized the data by dividing the plotted values by the average parallelism T 1 =T1 . Thus, the horizontal position of each datum is P=(T 1 =T1 ), and the vertical position of each datum is
  • J Kepner
  • S Ahalt
  • Matlabmpi
J. Kepner and S. Ahalt, "MatlabMPI," J. Parallel and Distributed Computing, vol. 64, no. 8, 2004, pp. 997-1005.
Interactive Supercomputing's Star-P platform: Parallel MATLAB and MPI Homework Classroom Study on High Level Language Productivity
  • A Edelman
A. Edelman et al., "Interactive Supercomputing's Star-P platform: Parallel MATLAB and MPI Homework Classroom Study on High Level Language Productivity," to appear in Proc. 10th High Performance Embedded Computing Workshop (HPEC 2006), MIT Lincoln Lab., 2006.
Trilinos Users Guide, tech. report SAND2003-2952, Sandia Nat'l Labs
  • M A Heroux
  • J M Willenbring
M.A. Heroux and J.M. Willenbring, Trilinos Users Guide, tech. report SAND2003-2952, Sandia Nat'l Labs., 2003; http://software.sandia.gov/ trilinos/TrilinosUserGuide.pdf.
Accelerating a Geospatial Application Using MATLAB Distributed Computing Tools
  • N Stefansson
  • K Luetkemeyer
  • R Comer
N. Stefansson, K. Luetkemeyer, and R. Comer, "Accelerating a Geospatial Application Using MATLAB Distributed Computing Tools," The MathWorks News & Notes, Jan. 2006; www.mathworks.com/company/
PETSc Users Manual, tech. report ANL-95/11-revision 2.1.5, Argonne Nat'l Lab
  • S Balay
S. Balay et al., PETSc Users Manual, tech. report ANL-95/11-revision 2.1.5, Argonne Nat'l Lab., 2004; www.mcs.anl.gov/petsc/petsc-as/snap shots/petsc-current/docs/manual.pdf.
Performance of Message Passing MATLAB Toolboxes
  • J Fernandez
J. Fernandez et al., "Performance of Message Passing MATLAB Toolboxes," 5th Int'l Conf. High-Performance Computing for Computational Science, LNCS vol. 2565, Springer-Verlag, 2003, pp. 228-241.
accelerating a geospatial application using matlab distributed computing tools
  • stefansson