Conference PaperPDF Available

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

September 2004
Lecture Notes in Computer Science 3241:97-104

September 2004
3241:97-104

DOI:10.1007/978-3-540-30218-6_19

Source
DBLP

Conference: Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 19-22, 2004, Proceedings

Authors:

George Bosilca

NVIDIA

Show all 14 authorsHide

A large number of MPI implementations are currently avail- able, each of which emphasize dierent aspects of high-performance com- puting or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and in- fluenced by experience gained from the code bases of the LAM/MPI, LA-MPI, and FT-MPI projects, Open MPI is an all-new, production- quality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality im- plementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time compo- sition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI.

Three main functional areas of Open MPI: the MCA, its component frameworks, and the modules in each framework.

…

Performance comparison for MPI BCAST and MPI ALLTOALL operations in Open MPI and in LAM/MPI v6.5.9.

…

Figures - uploaded by Jeffrey M. Squyres

Content may be subject to copyright.

Content uploaded by Jeffrey M. Squyres

Content may be subject to copyright.

Open MPI: Goals, Concept, and Design of a

Next Generation MPI Implementation

Edgar Gabriel1, Graham E. Fagg1, George Bosilca1, Thara Angskun1,

Jack J. Dongarra1, Jeﬀrey M. Squyres2, Vishal Sahay2,

Prabhanjan Kambadur2, Brian Barrett2, Andrew Lumsdaine2,

Ralph H. Castain3, David J. Daniel3, Richard L. Graham3,

Timothy S. Woodall3

1Innovative Computing Laboratory, University of Tennessee,

{egabriel, fagg, bosilca, anskun, dongarra}@cs.utk.edu

2Open System Laboratory, Indiana University

{jsquyres, vsahay, pkambadu, brbarret, lums}@osl.iu.edu

3Advanced Computing Laboratory, Los Alamos National Lab

{rhc, ddd, rlgraham,twoodall}@lanl.gov

Abstract. A large number of MPI implementations are currently avail-

able, each of which emphasize diﬀerent aspects of high-performance com-

puting or are intended to solve a speciﬁc research problem. The result

is a myriad of incompatible MPI implementations, all of which require

separate installation, and the combination of which present signiﬁcant

logistical challenges for end users. Building upon prior research, and in-

ﬂuenced by experience gained from the code bases of the LAM/MPI,

LA-MPI, and FT-MPI projects, Open MPI is an all-new, production-

quality MPI-2 implementation that is fundamentally centered around

component concepts. Open MPI provides a unique combination of novel

features previously unavailable in an open-source, production-quality im-

plementation of MPI. Its component architecture provides both a stable

platform for third-party research as well as enabling the run-time compo-

sition of independent software add-ons. This paper presents a high-level

overview the goals, design, and implementation of Open MPI.

1 Introduction

The evolution of parallel computer architectures has recently created new trends

and challenges for both parallel application developers and end users. Systems

comprised of tens of thousands of processors are available today; hundred-thousand

processor systems are expected within the next few years. Monolithic high-

performance computers are steadily being replaced by clusters of PCs and work-

stations because of their more attractive price/performance ratio. However, such

clusters provide a less integrated environment and therefore have diﬀerent (and

often inferior) I/O behavior than the previous architectures. Grid and metacom-

puting eﬀorts yield a further increase in the number of processors available to

parallel applications, as well as an increase in the physical distances between

computational elements.

These trends lead to new challenges for MPI implementations. An MPI ap-

plication utilizing thousands of processors faces many scalability issues that can

dramatically impact the overall performance of any parallel application. Such

issues include (but are not limited to): process control, resource exhaustion,

latency awareness and management, fault tolerance, and optimized collective

operations for common communication patterns.

Network layer transmission errors—which have been considered highly im-

probable for moderate-sized clusters—cannot be ignored when dealing with large-

scale computations [4]. Additionally, the probability that a parallel application

will encounter a process failure during its run increases with the number of pro-

cessors that it uses. If the application is to survive a process failure without

having to restart from the beginning, it either must regularly write checkpoint

ﬁles (and restart the application from the last consistent checkpoint [1, 8]) or the

application itself must be able to adaptively handle process failures during run-

time [3]. All of these issues are current, relevant research topics. Indeed, some

have been addressed at various levels by diﬀerent projects. However, no MPI

implementation is currently capable of addressing all of them comprehensively.

This directly implies that a new MPI implementation is necessary: one that

is capable of providing a framework to address important issues in emerging

networks and architectures. Building upon prior research, and inﬂuenced by ex-

perience gained from the code bases of the LAM/MPI [9], LA-MPI [4], and

FT-MPI [3] projects, Open MPI is an all-new, production-quality MPI-2 imple-

mentation. Open MPI provides a unique combination of novel features previously

unavailable in an open-source, production-quality implementation of MPI. Its

component architecture provides both a stable platform for cutting-edge third-

party research as well as enabling the run-time composition of independent soft-

ware add-ons.

1.1 Goals of Open MPI

While all participating institutions have signiﬁcant experience in implementing

MPI, Open MPI represents more than a simple merger of LAM/MPI, LA-MPI

and FT-MPI. Although inﬂuenced by previous code bases, Open MPI is an all-

new implementation of the Message Passing Interface. Focusing on production-

quality performance, the software implements the full MPI-1.2 [6] and MPI-2 [7]

speciﬁcations and fully supports concurrent, multi-threaded applications (i.e.,

MPI THREAD MULTIPLE).

To eﬃciently support a wide range of parallel machines, high performance

“drivers” for all established interconnects are currently being developed. These

include TCP/IP, shared memory, Myrinet, Quadrics, and Inﬁniband. Support

for more devices will likely be added based on user, market, and research re-

quirements. For network transmission errors, Open MPI provides optional fea-

tures for checking data integrity. By utilizing message fragmentation and striping

framework

Component

framework

Component

framework

Component

framework

Component

...

Module A Module B Module N

meta framework

Component framework

MCA

Fig. 1. Three main functional areas of Open MPI: the MCA, its component frame-

works, and the modules in each framework.

over multiple (potentially heterogeneous) network devices, Open MPI is capa-

ble of both maximizing the achievable bandwidth to applications and provid-

ing the ability to dynamically handle the loss of network devices when nodes

are equipped with multiple network interfaces. Thus, the handling of network

failovers is completely transparent to the application.

The runtime environment of Open MPI will provide basic services to start and

manage parallel applications in interactive and non-interactive environments.

Where possible, existing run-time environments will be leveraged to provide the

necessary services; a portable run-time environment based on user-level daemons

will be used where such services are not already available.

2 The Architecture of Open MPI

The Open MPI design is centered around the MPI Component Architecture

(MCA). While component programming is widely used in industry, it is only

recently gaining acceptance in the high performance computing community [2,9].

As shown in Fig. 1, Open MPI is comprised of three main functional areas:

–MCA: The backbone component architecture that provides management ser-

vices for all other layers;

–Component frameworks: Each major functional area in Open MPI has a

corresponding back-end component framework, which manages modules;

–Modules: Self-contained software units that export well-deﬁned interfaces

that can be deployed and composed with other modules at run-time.

The MCA manages the component frameworks and provides services to them,

such as the ability to accept run-time parameters from higher-level abstractions

(e.g., mpirun) and pass them down through the component framework to indi-

vidual modules. The MCA also ﬁnds components at build-time and invokes their

corresponding hooks for conﬁguration, building, and installation.

Each component framework is dedicated to a single task, such as providing

parallel job control or performing MPI collective operations. Upon demand, a

framework will discover, load, use, and unload modules. Each framework has

diﬀerent policies and usage scenarios; some will only use one module at a time

while others will use all available modules simultaneously.

Modules are self-contained software units that can conﬁgure, build, and in-

stall themselves. Modules adhere to the interface prescribed by the component

framework that they belong to, and provide requested services to higher-level

tiers and other parts of MPI.

The following is a partial list of component frameworks in Open MPI (MPI

functionality is described; run-time environment support components are not

covered in this paper):

–Point-to-point Transport Layer (PTL): a PTL module corresponds to a par-

ticular network protocol and device. Mainly responsible for the “wire proto-

cols” of moving bytes between MPI processes, PTL modules have no knowl-

edge of MPI semantics. Multiple PTL modules can be used in a single pro-

cess, allowing the use of multiple (potentially heterogeneous) networks. PTL

modules supporting TCP/IP, shared memory, Quadrics elan4, Inﬁniband

and Myrinet will be available in the ﬁrst Open MPI release.

–Point-to-point Management Layer (PML): the primary function of the PML

is to provide message fragmentation, scheduling, and re-assembly service

between the MPI layer and all available PTL modules. More details to the

PML and the PTL modules can be found at [11].

–Collective Communication (COLL): the back-end of MPI collective oper-

ations, supporting both intra- and intercommunicator functionality. Two

collective modules are planned at the current stage: a basic module imple-

menting linear and logarithmic algorithms and a module using hierarchical

algorithms similar to the ones used in the MagPIe project [5].

–Process Topology (TOPO): Cartesian and graph mapping functionality for

intracommunicators. Cluster-based and Grid-based computing may beneﬁt

from topology-aware communicators, allowing the MPI to optimize commu-

nications based on locality.

–Reduction Operations: the back-end functions for MPI’s intrinsic reduction

operations (e.g., MPI SUM). Modules can exploit specialized instruction sets

for optimized performance on target platforms.

–Parallel I/O: I/O modules implement parallel ﬁle and device access. Many

MPI implementations use ROMIO [10], but other packages may be adapted

for native use (e.g., cluster- and parallel-based ﬁlesystems).

The wide variety of framework types allows third party developers to use

Open MPI as a research platform, a deployment vehicle for commercial products,

or even a comparison mechanism for diﬀerent algorithms and techniques.

The component architecture in Open MPI oﬀers several advantages for end-

users and library developers. First, it enables the usage of multiple components

within a single MPI process. For example, a process can use several network

device drivers (PTL modules) simultaneously. Second, it provides a convenient

possibility to use third party software, supporting both source code and binary

distributions. Third, it provides a ﬁne-grained, run-time, user-controlled compo-

nent selection mechanism.

2.1 Module Lifecycle

Although every framework is diﬀerent, the COLL framework provides an illus-

trative example of the usage and lifecycle of a module in an MPI process:

1. During MPI INIT, the COLL framework ﬁnds all available modules. Modules

may have been statically linked into the MPI library or be shared library

modules located in well-known locations.

2. All COLL modules are queried to see if they want to run in the process.

Modules may choose not to run; for example, an Inﬁniband-based module

may choose not to run if there are no Inﬁniband NICs available. A list is

made of all modules who choose to run – the list of “available” modules.

3. As each communicator is created (including MPI COMM WORLD and MPI -

COMM SELF), each available module is queried to see if wants to be used

on the new communicator. Modules may decline to be used; e.g., a shared

memory module will only allow itself to be used if all processes in the com-

municator are on the same physical node. The highest priority module that

accepted is selected to be used for that communicator.

4. Once a module has been selected, it is initialized. The module typically

allocates any resources and potentially pre-computes information that will

be used when collective operations are invoked.

5. When an MPI collective function is invoked on that communicator, the mod-

ule’s corresponding back-end function is invoked to perform the operation.

6. The ﬁnal phase in the COLL module’s lifecycle occurs when that commu-

nicator is destroyed. This typically entails freeing resources and any pre-

computed information associated with the communicator being destroyed.

3 Implementation details

Two aspects of Open MPI’s design are discussed: its object-oriented approach

and the mechanisms for module management.

3.1 Object Oriented Approach

Open MPI is implemented using a simple C-language object-oriented system

with single inheritance and reference counting-based memory management us-

ing a retain/release model. An “object” consists of a structure and a singly-

instantiated “class” descriptor. The ﬁrst element of the structure must be a

pointer to the parent class’s structure.

Macros are used to eﬀect C++-like semantics (e.g., new, construct, destruct,

delete). The experience with various software projects based on C++ and the

according compilation problems on some platforms has encouraged us to take

this approach instead of using C++ directly.

Upon construction, an object’s reference count is set to one. When the object

is retained, its reference count is incremented; when it is released, its reference

count is decreased. When the reference count reaches zero, the class’s destructor

(and its parents’ destructor) is run and the memory is freed.

3.2 Module Discovery and Management

Open MPI oﬀers three diﬀerent mechanisms for adding a module to the MPI

library (and therefore to user applications):

–During the conﬁguration of Open MPI, a script traverses the build tree

and generates a list of modules found. These modules will be conﬁgured,

compiled, and linked statically into the MPI library.

–Similarly, modules discovered during conﬁguration can also be compiled as

shared libraries that are installed and then re-discovered at run-time.

–Third party library developers who do not want to provide the source code

of their modules can conﬁgure and compile their modules independently of

Open MPI and distribute the resulting shared library in binary form. Users

can install this module into the appropriate directory where Open MPI can

discover it at run-time.

At run-time, Open MPI ﬁrst “discovers” all modules that were statically

linked into the MPI library. It then searches several directories (e.g., $HOME/ompi/,

${INSTALLDIR}/lib/ompi/, etc.) to ﬁnd available modules, and sorts them by

framework type. To simplify run-time discovery, shared library modules have a

speciﬁc ﬁle naming scheme indicating both their MCA component framework

type and their module name.

Modules are identiﬁed by their name and version number. This enables the

MCA to manage diﬀerent versions of the same component, ensuring that the

modules used in one MPI process are the same—both in name and version

number–as the modules used in a peer MPI process. Given this ﬂexibility, Open

MPI provides multiple mechanisms both to choose a given module and to pass

run-time parameters to modules: command line arguments to mpirun, environ-

ment variables, text ﬁles, and MPI attributes (e.g., on communicators).

4 Performance Results

A performance comparison of Open MPI’s point-to-point methodology to other,

public MPI libraries can be found in [11]. As a sample of Open MPI’s perfor-

mance in this paper, a snapshot of the development code was used to run the

Pallas benchmarks (v2.2.1) for MPI Bcast and MPI Alltoall. The algorithms used

for these functions in Open MPI’s basic COLL module were derived from their

corresponding implementations in LAM/MPI v6.5.9, a monolithic MPI imple-

mentation (i.e., not based on components). The collective operations are based

on standard linear/logarithmic algorithms using MPI’s point-to-point message

passing for data movement. Although Open MPI’s code is not yet complete,

measuring its performance against the same algorithms in monolithic architec-

ture provides a basic comparison to ensure that the design and implementation

are sound.

The performance measurements were executed on a cluster of 2.4 GHz dual

processor Intel Xeon machines connected via fast Ethernet. The results shown in

Fig. 2 indicate that the performance of the collective operations using the Open

MPI approach is identical for large message sizes to its LAM/MPI counterpart.

For short messages, there is currently a slight overhead for Open MPI compared

to LAM/MPI. This is due to point-to-point latency optimizations in LAM/MPI

not yet included in Open MPI; these optimizations will be included in the release

of Open MPI. The graph shows, however, that the design and overall approach

is sound, and simply needs optimization.

100

1000

10000

100000

1 10 100 1000 10000 100000 1e+06

Minimum Execution Time [mec]

Message Length [Bytes]

LAM/MPI 6.5.9 BCAST

Open MPI BCAST

LAM/MPI 6.5.9 ALLTOALL

Open MPI ALLTOALL

Fig. 2. Performance comparison for MPI BCAST and MPI ALLTOALL operations in

Open MPI and in LAM/MPI v6.5.9.

5 Summary

Open MPI is a new implementation of the MPI standard. It provides function-

ality that has not previously been available in any single, production-quality

MPI implementation, including support for all of MPI-2, multiple concurrent

user threads, and multiple options for handling process and network failures.

The Open MPI group is furthermore working on establishing a proper legal

framework, which enbales third party developers to contribute source code to

the project.

The ﬁrst full release of Open MPI is planned for the 2004 Supercomputing

Conference. An initial beta release supporting most of the described functionality

and an initial subset of network device drivers (tcp, shmem, and a loopback

device) is planned for release mid-2004. http://www.open-mpi.org/

Acknowledgments

This work was supported by a grant from the Lilly Endowment, National Sci-

ence Foundation grants 0116050, EIA-0202048, EIA-9972889, and ANI-0330620,

and Department of Energy Contract DE-FG02-02ER25536. Los Alamos National

Laboratory is operated by the University of California for the National Nuclear

Security Administration of the United States Department of Energy under con-

tract W-7405-ENG-36. Project support was provided through ASCI/PSE and

the Los Alamos Computer Science Institute, and the Center for Information

Technology Research (CITR) of the University of Tennessee.

References

1. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault,

P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V:

Toward a scalable fault tolerant MPI for volatile nodes. In SC’2002 Conference

CD, Baltimore, MD, 2002. IEEE/ACM SIGARCH. pap298,LRI.

2. D. E. Bernholdt et. all. A component architecture for high-performance scientiﬁc

computing. Intl. J. High-Performance Computing Applications, 2004.

3. G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, A. Bukovski, and J. J.

Dongarra. Fault tolerant communication library and applications for high perofr-

mance. In Los Alamos Computer Science Institute Symposium, Santa Fee, NM,

October 27-29 2003.

4. R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Ras-

mussen, L. D. Risinger, and M. W. Sukalksi. A network-failure-tolerant message-

passing system for terascale clusters. International Journal of Parallel Program-

ming, 31(4):285–303, August 2003.

5. T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. Mag-

PIe: MPI’s collective communication operations for clustered wide area systems.

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

(PPoPP’99), 34(8):131–140, May 1999.

6. Message Passing Interface Forum. MPI: A Message Passing Interface Standard,

June 1995. http://www.mpi-forum.org.

7. Message Passing Interface Forum. MPI-2: Extensions to the Message Passing In-

terface, July 1997. http://www.mpi-forum.org.

8. Sriram Sankaran, Jeﬀrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason

Duell, Paul Hargrove, and Eric Roman. The LAM/MPI checkpoint/restart frame-

work: System-initiated checkpointing. International Journal of High Performance

Computing Applications, To appear, 2004.

9. Jeﬀrey M. Squyres and Andrew Lumsdaine. A Component Architecture for

LAM/MPI. In Proceedings, 10th European PVM/MPI Users’ Group Meeting, num-

ber 2840 in Lecture Notes in Computer Science, Venice, Italy, Sept. 2003. Springer.

10. Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective I/O

in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively

Parallel Computation, pages 182–189. IEEE Computer Society Press, Feb 1999.

11. T.S. Woodall, R.L. Graham, R.H. Castain, D.J. Daniel, M.W. Sukalski, G.E. Fagg,

E. Gabriel, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sahay, P. Kam-

badur, B. Barrett, and A. Lumsdaine. TEG: A high-performance, scalable, multi-

network point-to-point communications methodology. In Proceedings, 11th Euro-

pean PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004.

Configurable Algorithms for All-to-All Collectives

Conference Paper

May 2024

MPI/OpenMP-Based Parallel Solver for Imprint Forming Simulation

Article

Full-text available

Apr 2024
Comput Model Eng Sci

In this research, we present the pure open multi-processing (OpenMP), pure message passing interface (MPI), and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns. Achieving such precision demands the utilization of at least 7 million tetrahedron elements, surpassing the capabilities of traditional serial programs previously developed. To mitigate data races when calculating internal forces, intermediate arrays are introduced within the OpenMP directive. This helps ensure proper synchronization and avoid conflicts during parallel execution. Additionally, in the MPI implementation, the coins are partitioned into the desired number of regions. This division allows for efficient distribution of computational tasks across multiple processes. Numerical simulation examples are conducted to compare the three solvers with serial programs, evaluating correctness, acceleration ratio, and parallel efficiency. The results reveal a relative error of approximately 0.3% in forming force among the parallel and serial solvers, while the predicted insufficient material zones align with experimental observations. Additionally, speedup ratio and parallel efficiency are assessed for the coining process simulation. The pure MPI parallel solver achieves a maximum acceleration of 9.5 on a single computer (utilizing 12 cores) and the hybrid solver exhibits a speedup ratio of 136 in a cluster (using 6 compute nodes and 12 cores per compute node), showing the strong scalability of the hybrid MPI/OpenMP programming model. This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.

Toward Realistic Modelling, Imaging and Inversion Testing

Technical Report

Full-text available

Nov 2023

Research in academia often suffers from a limitation in the number of data sets employed for testing, resulting in a lack of feedback diversity that is crucial for comprehensive analysis. Applied research necessitates engagement with a broad spectrum of datasets, which significantly enriches research and development projects. Obtaining authentic datasets for publication in academia is not only challenging but also involves time-consuming preprocessing, making the pursuit of testing diversity a formidable task. Consequently , numerous tests are conducted on modelled data, often generated using similar algorithms employed in inversion processes, thereby giving rise to the "inverse crime sce-nario". Predominance of synthetic data testing in academia also comes as a consequence of the substantial difference in computational resources with industrial environments. Software developed in academia often lacks the capability in handling intensive computations with large seismic files with irregular acquisitions, capabilities that are required to work with real data sets used in industry. The consequence is a large gap between toy examples used in academia and realistic examples required for industrial use. This report details the implementation advancements made in our seismic libraries, showcasing tests aimed at enhancing the reliability of results in diverse environments, including large models, salt environments, topography settings, and physical models. Furthermore , we elucidate the disparities between inverse crime scenarios and realistic situations. The immediate ramifications of these advancements include the ability to circumvent the inverse crime problem and conduct tests in a variety of environments. Moreover, we anticipate that this research will foster increased collaboration with industry and deepen our understanding of the practical capabilities of novel techniques developed at CREWES.

Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective Operations

Conference Paper

May 2024

TCCL: Discovering Better Communication Paths for PCIe GPU Clusters

Conference Paper

Apr 2024

Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM

Conference Paper

Apr 2024

PDAssess: A Privacy-preserving Free-speech based Parkinson's Disease Daily Assessment System

Conference Paper

Apr 2024

A high-performance computing applied to composition reservoir simulation using distributed memory and 3D hybrid unstructured grids

Article

Apr 2024
J BRAZ SOC MECH SCI

Reservoir models have become more complex over time, and accurate models also require refined grids for obtaining accurate results. The memory required for the simulation of reservoir models with refined grids containing dozens of millions of grid blocks is a hard task to be accomplished in small workstations. Parallel processing in reservoir simulation has gained repute over the past decades as a solution to tackle this challenging problem. This work shows how the distributed memory parallelization is applied to an in-house compositional simulator called UTCOMP in conjunction with an implicit pressure explicit composition approach using unstructured grids and the element-based finite volume method. To achieve high-performance computing, it was developed an in-house library named automatic distributed mesh database that employs open-source libraries, like Zoltan and ParMETIS, to manage the distributed grid information and Petsc as the parallel solver. The simulator can handle four different 3D element types: hexahedrons, tetrahedrons, prisms, and pyramids. Several case scenarios with grids ranging from 200 thousand to 26 million nodes using up to 512 processes were successfully simulated. Results showed excellent speedup, with values very close to the ideal speedup for most simulated cases.

PaReDiSo: A reaction-diffusion solver coupled with OpenMPI and CVODE

Article

Apr 2024
COMPUT PHYS COMMUN

Nested Dissection Based Parallel Transient Power Grid Analysis on Public Cloud Virtual Machines

Conference Paper

Jan 2024

A network-failure-tolerant message-passing system for terascale clusters

Conference Paper

Full-text available

Jan 2002

MagPIe

Article

Full-text available

Aug 1999

Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed M AG PI E , a library of collective communication operations optimized for wide area systems. M AG PI E 's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, M AG PI E executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, M AG PI E 's advantage increases for higher wide area latencies.

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Article

Full-text available

Aug 2003

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology

Conference Paper

Full-text available

Sep 2004
Lect Notes Comput Sci

TEG is a new component-based methodology for point-to-point mes- saging. Developed as part of the Open MPI project, TEG provides a configurable fault-tolerant capability for high-performance messaging that utilizes multi-network interfaces where available. Initial performance comparisons with other MPI im- plementations show comparable ping-pong latencies, but with bandwidths up to 30% higher.

Lecture Notes in Computer Science

Conference Paper

Full-text available

Jan 2003

To better manage the ever increasing complexity of LAM/MPI, we have created a lightweight component architecture for it that is specifically designed for high-performance message passing. This paper de- scribes the basic design of the component architecture, as well as some of the particular component instances that constitute the latest release of LAM/MPI. Performance comparisons against the previous, mono- lithic, version of LAM/MPI show no performance impact due to the new architecture—in fact, the newest version is slightly faster. The modular and extensible nature of this implementation is intended to make it significantly easier to add new functionality and to conduct new research using LAM/MPI as a development platform.

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing.

Article

Full-text available

Jan 2005
INT J HIGH PERFORM C

Abstract As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming,limiting factors on application scalability. To ad- dress these issues, we present the design and implementa- tion of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel- level process checkpoint system with the LAM implementa- tion of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance,and schedul- ing reasons as well as for fault tolerance. Experimental re- sults show negligible communication,performance,impact due to the incorporation of the checkpoint support capabil- ities into LAM/MPI.

MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes

Conference Paper

Full-text available

Dec 2002

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.

Data sieving and collective I/O in ROMIO

Conference Paper

Full-text available

Mar 1999

The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D) the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000

Fault Tolerant Communication Library and Applications for HPC

Article

High-Performance Parallel and Distributed Scientific Computing with the Common Component Architecture

Conference Paper

Jun 2004
Lect Notes Comput Sci

David E. Bernholdt

In the scientific computing community, parallel and, increasingly, distributed computing are both important paradigms for the development of large-scale simulation software. The ability to bridge seamlessly between these two paradigms is a valuable characteristic for programming models in this general domain. The Common Component Architecture (CCA) is a software component model specially designed for the needs of the scientific community, including support for both high-performance parallel and distributed computing. The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA is being applied within an increasing range of disciplines, including combustion research, mesoscale storm prediction, global climate simulation, and computational chemistry, as well as connecting to instruments and sensors. In this talk, I will introduce the basic concepts behind component-based software engineering in general, and the common component architecture in particular. I will emphasize the mechanisms by which the CCA provides for both high-performance parallel computing and distributed computing, and how it integrates with several popular distributed computing environments. Finally, I will offer examples of several applications using the CCA in parallel and distributed contexts.

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

Abstract and Figures

Recommended publications

GTC - the developer conference for the era of AI and the Metaverse

GTC '22: Free AI and Deep Learning Sessions at your Fingertips

Take a Free Faculty Development Workshop with NVIDIA DEEP LEARNING INSTITUTE

Discover Latest AI Research - Watch Free Talks for a Limited Time

Open MPI: A flexible high performance MPI

Open MPI: Goals, concept, and design of a next generation MPI implementation

Open MPI’s TEG Point-to-Point Communications Methodology: Comparison to Existing Implementations

The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms*