Conference PaperPDF Available

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

Authors:

Abstract and Figures

A large number of MPI implementations are currently avail- able, each of which emphasize dierent aspects of high-performance com- puting or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and in- fluenced by experience gained from the code bases of the LAM/MPI, LA-MPI, and FT-MPI projects, Open MPI is an all-new, production- quality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality im- plementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time compo- sition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI.
Content may be subject to copyright.
Open MPI: Goals, Concept, and Design of a
Next Generation MPI Implementation
Edgar Gabriel1, Graham E. Fagg1, George Bosilca1, Thara Angskun1,
Jack J. Dongarra1, Jeffrey M. Squyres2, Vishal Sahay2,
Prabhanjan Kambadur2, Brian Barrett2, Andrew Lumsdaine2,
Ralph H. Castain3, David J. Daniel3, Richard L. Graham3,
Timothy S. Woodall3
1Innovative Computing Laboratory, University of Tennessee,
{egabriel, fagg, bosilca, anskun, dongarra}@cs.utk.edu
2Open System Laboratory, Indiana University
{jsquyres, vsahay, pkambadu, brbarret, lums}@osl.iu.edu
3Advanced Computing Laboratory, Los Alamos National Lab
{rhc, ddd, rlgraham,twoodall}@lanl.gov
Abstract. A large number of MPI implementations are currently avail-
able, each of which emphasize different aspects of high-performance com-
puting or are intended to solve a specific research problem. The result
is a myriad of incompatible MPI implementations, all of which require
separate installation, and the combination of which present significant
logistical challenges for end users. Building upon prior research, and in-
fluenced by experience gained from the code bases of the LAM/MPI,
LA-MPI, and FT-MPI projects, Open MPI is an all-new, production-
quality MPI-2 implementation that is fundamentally centered around
component concepts. Open MPI provides a unique combination of novel
features previously unavailable in an open-source, production-quality im-
plementation of MPI. Its component architecture provides both a stable
platform for third-party research as well as enabling the run-time compo-
sition of independent software add-ons. This paper presents a high-level
overview the goals, design, and implementation of Open MPI.
1 Introduction
The evolution of parallel computer architectures has recently created new trends
and challenges for both parallel application developers and end users. Systems
comprised of tens of thousands of processors are available today; hundred-thousand
processor systems are expected within the next few years. Monolithic high-
performance computers are steadily being replaced by clusters of PCs and work-
stations because of their more attractive price/performance ratio. However, such
clusters provide a less integrated environment and therefore have different (and
often inferior) I/O behavior than the previous architectures. Grid and metacom-
puting efforts yield a further increase in the number of processors available to
parallel applications, as well as an increase in the physical distances between
computational elements.
These trends lead to new challenges for MPI implementations. An MPI ap-
plication utilizing thousands of processors faces many scalability issues that can
dramatically impact the overall performance of any parallel application. Such
issues include (but are not limited to): process control, resource exhaustion,
latency awareness and management, fault tolerance, and optimized collective
operations for common communication patterns.
Network layer transmission errors—which have been considered highly im-
probable for moderate-sized clusters—cannot be ignored when dealing with large-
scale computations [4]. Additionally, the probability that a parallel application
will encounter a process failure during its run increases with the number of pro-
cessors that it uses. If the application is to survive a process failure without
having to restart from the beginning, it either must regularly write checkpoint
files (and restart the application from the last consistent checkpoint [1, 8]) or the
application itself must be able to adaptively handle process failures during run-
time [3]. All of these issues are current, relevant research topics. Indeed, some
have been addressed at various levels by different projects. However, no MPI
implementation is currently capable of addressing all of them comprehensively.
This directly implies that a new MPI implementation is necessary: one that
is capable of providing a framework to address important issues in emerging
networks and architectures. Building upon prior research, and influenced by ex-
perience gained from the code bases of the LAM/MPI [9], LA-MPI [4], and
FT-MPI [3] projects, Open MPI is an all-new, production-quality MPI-2 imple-
mentation. Open MPI provides a unique combination of novel features previously
unavailable in an open-source, production-quality implementation of MPI. Its
component architecture provides both a stable platform for cutting-edge third-
party research as well as enabling the run-time composition of independent soft-
ware add-ons.
1.1 Goals of Open MPI
While all participating institutions have significant experience in implementing
MPI, Open MPI represents more than a simple merger of LAM/MPI, LA-MPI
and FT-MPI. Although influenced by previous code bases, Open MPI is an all-
new implementation of the Message Passing Interface. Focusing on production-
quality performance, the software implements the full MPI-1.2 [6] and MPI-2 [7]
specifications and fully supports concurrent, multi-threaded applications (i.e.,
MPI THREAD MULTIPLE).
To efficiently support a wide range of parallel machines, high performance
“drivers” for all established interconnects are currently being developed. These
include TCP/IP, shared memory, Myrinet, Quadrics, and Infiniband. Support
for more devices will likely be added based on user, market, and research re-
quirements. For network transmission errors, Open MPI provides optional fea-
tures for checking data integrity. By utilizing message fragmentation and striping
framework
Component
framework
Component
framework
Component
framework
Component
...
Module A Module B Module N
meta framework
Component framework
MCA
Fig. 1. Three main functional areas of Open MPI: the MCA, its component frame-
works, and the modules in each framework.
over multiple (potentially heterogeneous) network devices, Open MPI is capa-
ble of both maximizing the achievable bandwidth to applications and provid-
ing the ability to dynamically handle the loss of network devices when nodes
are equipped with multiple network interfaces. Thus, the handling of network
failovers is completely transparent to the application.
The runtime environment of Open MPI will provide basic services to start and
manage parallel applications in interactive and non-interactive environments.
Where possible, existing run-time environments will be leveraged to provide the
necessary services; a portable run-time environment based on user-level daemons
will be used where such services are not already available.
2 The Architecture of Open MPI
The Open MPI design is centered around the MPI Component Architecture
(MCA). While component programming is widely used in industry, it is only
recently gaining acceptance in the high performance computing community [2,9].
As shown in Fig. 1, Open MPI is comprised of three main functional areas:
MCA: The backbone component architecture that provides management ser-
vices for all other layers;
Component frameworks: Each major functional area in Open MPI has a
corresponding back-end component framework, which manages modules;
Modules: Self-contained software units that export well-defined interfaces
that can be deployed and composed with other modules at run-time.
The MCA manages the component frameworks and provides services to them,
such as the ability to accept run-time parameters from higher-level abstractions
(e.g., mpirun) and pass them down through the component framework to indi-
vidual modules. The MCA also finds components at build-time and invokes their
corresponding hooks for configuration, building, and installation.
Each component framework is dedicated to a single task, such as providing
parallel job control or performing MPI collective operations. Upon demand, a
framework will discover, load, use, and unload modules. Each framework has
different policies and usage scenarios; some will only use one module at a time
while others will use all available modules simultaneously.
Modules are self-contained software units that can configure, build, and in-
stall themselves. Modules adhere to the interface prescribed by the component
framework that they belong to, and provide requested services to higher-level
tiers and other parts of MPI.
The following is a partial list of component frameworks in Open MPI (MPI
functionality is described; run-time environment support components are not
covered in this paper):
Point-to-point Transport Layer (PTL): a PTL module corresponds to a par-
ticular network protocol and device. Mainly responsible for the “wire proto-
cols” of moving bytes between MPI processes, PTL modules have no knowl-
edge of MPI semantics. Multiple PTL modules can be used in a single pro-
cess, allowing the use of multiple (potentially heterogeneous) networks. PTL
modules supporting TCP/IP, shared memory, Quadrics elan4, Infiniband
and Myrinet will be available in the first Open MPI release.
Point-to-point Management Layer (PML): the primary function of the PML
is to provide message fragmentation, scheduling, and re-assembly service
between the MPI layer and all available PTL modules. More details to the
PML and the PTL modules can be found at [11].
Collective Communication (COLL): the back-end of MPI collective oper-
ations, supporting both intra- and intercommunicator functionality. Two
collective modules are planned at the current stage: a basic module imple-
menting linear and logarithmic algorithms and a module using hierarchical
algorithms similar to the ones used in the MagPIe project [5].
Process Topology (TOPO): Cartesian and graph mapping functionality for
intracommunicators. Cluster-based and Grid-based computing may benefit
from topology-aware communicators, allowing the MPI to optimize commu-
nications based on locality.
Reduction Operations: the back-end functions for MPI’s intrinsic reduction
operations (e.g., MPI SUM). Modules can exploit specialized instruction sets
for optimized performance on target platforms.
Parallel I/O: I/O modules implement parallel file and device access. Many
MPI implementations use ROMIO [10], but other packages may be adapted
for native use (e.g., cluster- and parallel-based filesystems).
The wide variety of framework types allows third party developers to use
Open MPI as a research platform, a deployment vehicle for commercial products,
or even a comparison mechanism for different algorithms and techniques.
The component architecture in Open MPI offers several advantages for end-
users and library developers. First, it enables the usage of multiple components
within a single MPI process. For example, a process can use several network
device drivers (PTL modules) simultaneously. Second, it provides a convenient
possibility to use third party software, supporting both source code and binary
distributions. Third, it provides a fine-grained, run-time, user-controlled compo-
nent selection mechanism.
2.1 Module Lifecycle
Although every framework is different, the COLL framework provides an illus-
trative example of the usage and lifecycle of a module in an MPI process:
1. During MPI INIT, the COLL framework finds all available modules. Modules
may have been statically linked into the MPI library or be shared library
modules located in well-known locations.
2. All COLL modules are queried to see if they want to run in the process.
Modules may choose not to run; for example, an Infiniband-based module
may choose not to run if there are no Infiniband NICs available. A list is
made of all modules who choose to run – the list of “available” modules.
3. As each communicator is created (including MPI COMM WORLD and MPI -
COMM SELF), each available module is queried to see if wants to be used
on the new communicator. Modules may decline to be used; e.g., a shared
memory module will only allow itself to be used if all processes in the com-
municator are on the same physical node. The highest priority module that
accepted is selected to be used for that communicator.
4. Once a module has been selected, it is initialized. The module typically
allocates any resources and potentially pre-computes information that will
be used when collective operations are invoked.
5. When an MPI collective function is invoked on that communicator, the mod-
ule’s corresponding back-end function is invoked to perform the operation.
6. The final phase in the COLL module’s lifecycle occurs when that commu-
nicator is destroyed. This typically entails freeing resources and any pre-
computed information associated with the communicator being destroyed.
3 Implementation details
Two aspects of Open MPI’s design are discussed: its object-oriented approach
and the mechanisms for module management.
3.1 Object Oriented Approach
Open MPI is implemented using a simple C-language object-oriented system
with single inheritance and reference counting-based memory management us-
ing a retain/release model. An “object” consists of a structure and a singly-
instantiated “class” descriptor. The first element of the structure must be a
pointer to the parent class’s structure.
Macros are used to effect C++-like semantics (e.g., new, construct, destruct,
delete). The experience with various software projects based on C++ and the
according compilation problems on some platforms has encouraged us to take
this approach instead of using C++ directly.
Upon construction, an object’s reference count is set to one. When the object
is retained, its reference count is incremented; when it is released, its reference
count is decreased. When the reference count reaches zero, the class’s destructor
(and its parents’ destructor) is run and the memory is freed.
3.2 Module Discovery and Management
Open MPI offers three different mechanisms for adding a module to the MPI
library (and therefore to user applications):
During the configuration of Open MPI, a script traverses the build tree
and generates a list of modules found. These modules will be configured,
compiled, and linked statically into the MPI library.
Similarly, modules discovered during configuration can also be compiled as
shared libraries that are installed and then re-discovered at run-time.
Third party library developers who do not want to provide the source code
of their modules can configure and compile their modules independently of
Open MPI and distribute the resulting shared library in binary form. Users
can install this module into the appropriate directory where Open MPI can
discover it at run-time.
At run-time, Open MPI first “discovers” all modules that were statically
linked into the MPI library. It then searches several directories (e.g., $HOME/ompi/,
${INSTALLDIR}/lib/ompi/, etc.) to find available modules, and sorts them by
framework type. To simplify run-time discovery, shared library modules have a
specific file naming scheme indicating both their MCA component framework
type and their module name.
Modules are identified by their name and version number. This enables the
MCA to manage different versions of the same component, ensuring that the
modules used in one MPI process are the same—both in name and version
number–as the modules used in a peer MPI process. Given this flexibility, Open
MPI provides multiple mechanisms both to choose a given module and to pass
run-time parameters to modules: command line arguments to mpirun, environ-
ment variables, text files, and MPI attributes (e.g., on communicators).
4 Performance Results
A performance comparison of Open MPI’s point-to-point methodology to other,
public MPI libraries can be found in [11]. As a sample of Open MPI’s perfor-
mance in this paper, a snapshot of the development code was used to run the
Pallas benchmarks (v2.2.1) for MPI Bcast and MPI Alltoall. The algorithms used
for these functions in Open MPI’s basic COLL module were derived from their
corresponding implementations in LAM/MPI v6.5.9, a monolithic MPI imple-
mentation (i.e., not based on components). The collective operations are based
on standard linear/logarithmic algorithms using MPI’s point-to-point message
passing for data movement. Although Open MPI’s code is not yet complete,
measuring its performance against the same algorithms in monolithic architec-
ture provides a basic comparison to ensure that the design and implementation
are sound.
The performance measurements were executed on a cluster of 2.4 GHz dual
processor Intel Xeon machines connected via fast Ethernet. The results shown in
Fig. 2 indicate that the performance of the collective operations using the Open
MPI approach is identical for large message sizes to its LAM/MPI counterpart.
For short messages, there is currently a slight overhead for Open MPI compared
to LAM/MPI. This is due to point-to-point latency optimizations in LAM/MPI
not yet included in Open MPI; these optimizations will be included in the release
of Open MPI. The graph shows, however, that the design and overall approach
is sound, and simply needs optimization.
100
1000
10000
100000
1 10 100 1000 10000 100000 1e+06
Minimum Execution Time [mec]
Message Length [Bytes]
LAM/MPI 6.5.9 BCAST
Open MPI BCAST
LAM/MPI 6.5.9 ALLTOALL
Open MPI ALLTOALL
Fig. 2. Performance comparison for MPI BCAST and MPI ALLTOALL operations in
Open MPI and in LAM/MPI v6.5.9.
5 Summary
Open MPI is a new implementation of the MPI standard. It provides function-
ality that has not previously been available in any single, production-quality
MPI implementation, including support for all of MPI-2, multiple concurrent
user threads, and multiple options for handling process and network failures.
The Open MPI group is furthermore working on establishing a proper legal
framework, which enbales third party developers to contribute source code to
the project.
The first full release of Open MPI is planned for the 2004 Supercomputing
Conference. An initial beta release supporting most of the described functionality
and an initial subset of network device drivers (tcp, shmem, and a loopback
device) is planned for release mid-2004. http://www.open-mpi.org/
Acknowledgments
This work was supported by a grant from the Lilly Endowment, National Sci-
ence Foundation grants 0116050, EIA-0202048, EIA-9972889, and ANI-0330620,
and Department of Energy Contract DE-FG02-02ER25536. Los Alamos National
Laboratory is operated by the University of California for the National Nuclear
Security Administration of the United States Department of Energy under con-
tract W-7405-ENG-36. Project support was provided through ASCI/PSE and
the Los Alamos Computer Science Institute, and the Center for Information
Technology Research (CITR) of the University of Tennessee.
References
1. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault,
P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V:
Toward a scalable fault tolerant MPI for volatile nodes. In SC’2002 Conference
CD, Baltimore, MD, 2002. IEEE/ACM SIGARCH. pap298,LRI.
2. D. E. Bernholdt et. all. A component architecture for high-performance scientific
computing. Intl. J. High-Performance Computing Applications, 2004.
3. G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, A. Bukovski, and J. J.
Dongarra. Fault tolerant communication library and applications for high perofr-
mance. In Los Alamos Computer Science Institute Symposium, Santa Fee, NM,
October 27-29 2003.
4. R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G. Minnich, C. E. Ras-
mussen, L. D. Risinger, and M. W. Sukalksi. A network-failure-tolerant message-
passing system for terascale clusters. International Journal of Parallel Program-
ming, 31(4):285–303, August 2003.
5. T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. Mag-
PIe: MPI’s collective communication operations for clustered wide area systems.
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP’99), 34(8):131–140, May 1999.
6. Message Passing Interface Forum. MPI: A Message Passing Interface Standard,
June 1995. http://www.mpi-forum.org.
7. Message Passing Interface Forum. MPI-2: Extensions to the Message Passing In-
terface, July 1997. http://www.mpi-forum.org.
8. Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason
Duell, Paul Hargrove, and Eric Roman. The LAM/MPI checkpoint/restart frame-
work: System-initiated checkpointing. International Journal of High Performance
Computing Applications, To appear, 2004.
9. Jeffrey M. Squyres and Andrew Lumsdaine. A Component Architecture for
LAM/MPI. In Proceedings, 10th European PVM/MPI Users’ Group Meeting, num-
ber 2840 in Lecture Notes in Computer Science, Venice, Italy, Sept. 2003. Springer.
10. Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective I/O
in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively
Parallel Computation, pages 182–189. IEEE Computer Society Press, Feb 1999.
11. T.S. Woodall, R.L. Graham, R.H. Castain, D.J. Daniel, M.W. Sukalski, G.E. Fagg,
E. Gabriel, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sahay, P. Kam-
badur, B. Barrett, and A. Lumsdaine. TEG: A high-performance, scalable, multi-
network point-to-point communications methodology. In Proceedings, 11th Euro-
pean PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004.
... The log P implementation, on the other hand, takes the largest possible number of rounds (P -1) while transmitting only the exact required data (similar to the linear time Spread-out algorithm [3]). Historically, only these two radices have been used by MPI implementations MPICH [4] and OpenMPI [5], [6]. ...
... Graham et al. [29] described a new hierarchical collective communication framework and demonstrated performance optimization for MPI barrier and ⃝ is initial rotation phase -moving up rank datablocks; the first data-block for each process after rotation is highlighted in grey. 2 ⃝, 3 ⃝, 4 ⃝, 5 ⃝ are 4 communication rounds; each process exchanges some non-continuous data-blocks per round that are highlighted in a unique color. 6 ⃝ is final rotation phase; the first data-block after rotation is highlighted in grey. The last figure shows the sent data-blocks in 3-representations per round, matching the colors with the previous four communication rounds. ...
... Assuming that d = P ′ −P , then v = ⌊d/r w−1 ⌋, since each value z with x th bit has r w−1 data-blocks. This means (w − 1) th digit has only (r − 1 − v) distinct values (e.g., in Figure 3 6 ⃝, the 3 rd digit only has value ...
... In the case of DMP, each processor has its memory and uses message-passing interfaces for communication. The utilization of multiple address spaces, as in message passing interface (MPI), can enhance portability but can also lead to increased programming complexity, as stated by [9][10][11]. On the other hand, in SMP systems, where several processors share a single address space, programming becomes simpler but portability may be reduced, as mentioned by [12][13][14][15] in the case of OpenMP and Pthreads. ...
Article
Full-text available
In this research, we present the pure open multi-processing (OpenMP), pure message passing interface (MPI), and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns. Achieving such precision demands the utilization of at least 7 million tetrahedron elements, surpassing the capabilities of traditional serial programs previously developed. To mitigate data races when calculating internal forces, intermediate arrays are introduced within the OpenMP directive. This helps ensure proper synchronization and avoid conflicts during parallel execution. Additionally, in the MPI implementation, the coins are partitioned into the desired number of regions. This division allows for efficient distribution of computational tasks across multiple processes. Numerical simulation examples are conducted to compare the three solvers with serial programs, evaluating correctness, acceleration ratio, and parallel efficiency. The results reveal a relative error of approximately 0.3% in forming force among the parallel and serial solvers, while the predicted insufficient material zones align with experimental observations. Additionally, speedup ratio and parallel efficiency are assessed for the coining process simulation. The pure MPI parallel solver achieves a maximum acceleration of 9.5 on a single computer (utilizing 12 cores) and the hybrid solver exhibits a speedup ratio of 136 in a cluster (using 6 compute nodes and 12 cores per compute node), showing the strong scalability of the hybrid MPI/OpenMP programming model. This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.
... In the 2D case, we have been able to overcome the computational costs by using GPUs and low-level coding in C++ CUDA. For 3D, this approach has to be combined with Message Passage Interface (MPI) (Gabriel et al., 2004), which combined with the GPU coding increases complexity for coding, testing and, in particular, debugging effort. ...
Technical Report
Full-text available
Research in academia often suffers from a limitation in the number of data sets employed for testing, resulting in a lack of feedback diversity that is crucial for comprehensive analysis. Applied research necessitates engagement with a broad spectrum of datasets, which significantly enriches research and development projects. Obtaining authentic datasets for publication in academia is not only challenging but also involves time-consuming preprocessing, making the pursuit of testing diversity a formidable task. Consequently , numerous tests are conducted on modelled data, often generated using similar algorithms employed in inversion processes, thereby giving rise to the "inverse crime sce-nario". Predominance of synthetic data testing in academia also comes as a consequence of the substantial difference in computational resources with industrial environments. Software developed in academia often lacks the capability in handling intensive computations with large seismic files with irregular acquisitions, capabilities that are required to work with real data sets used in industry. The consequence is a large gap between toy examples used in academia and realistic examples required for industrial use. This report details the implementation advancements made in our seismic libraries, showcasing tests aimed at enhancing the reliability of results in diverse environments, including large models, salt environments, topography settings, and physical models. Furthermore , we elucidate the disparities between inverse crime scenarios and realistic situations. The immediate ramifications of these advancements include the ability to circumvent the inverse crime problem and conduct tests in a variety of environments. Moreover, we anticipate that this research will foster increased collaboration with industry and deepen our understanding of the practical capabilities of novel techniques developed at CREWES.
Article
Reservoir models have become more complex over time, and accurate models also require refined grids for obtaining accurate results. The memory required for the simulation of reservoir models with refined grids containing dozens of millions of grid blocks is a hard task to be accomplished in small workstations. Parallel processing in reservoir simulation has gained repute over the past decades as a solution to tackle this challenging problem. This work shows how the distributed memory parallelization is applied to an in-house compositional simulator called UTCOMP in conjunction with an implicit pressure explicit composition approach using unstructured grids and the element-based finite volume method. To achieve high-performance computing, it was developed an in-house library named automatic distributed mesh database that employs open-source libraries, like Zoltan and ParMETIS, to manage the distributed grid information and Petsc as the parallel solver. The simulator can handle four different 3D element types: hexahedrons, tetrahedrons, prisms, and pyramids. Several case scenarios with grids ranging from 200 thousand to 26 million nodes using up to 512 processes were successfully simulated. Results showed excellent speedup, with values very close to the ideal speedup for most simulated cases.
Article
Full-text available
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed M AG PI E , a library of collective communication operations optimized for wide area systems. M AG PI E 's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, M AG PI E executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, M AG PI E 's advantage increases for higher wide area latencies.
Article
Full-text available
The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.
Conference Paper
Full-text available
TEG is a new component-based methodology for point-to-point mes- saging. Developed as part of the Open MPI project, TEG provides a configurable fault-tolerant capability for high-performance messaging that utilizes multi-network interfaces where available. Initial performance comparisons with other MPI im- plementations show comparable ping-pong latencies, but with bandwidths up to 30% higher.
Conference Paper
Full-text available
To better manage the ever increasing complexity of LAM/MPI, we have created a lightweight component architecture for it that is specifically designed for high-performance message passing. This paper de- scribes the basic design of the component architecture, as well as some of the particular component instances that constitute the latest release of LAM/MPI. Performance comparisons against the previous, mono- lithic, version of LAM/MPI show no performance impact due to the new architecture—in fact, the newest version is slightly faster. The modular and extensible nature of this implementation is intended to make it significantly easier to add new functionality and to conduct new research using LAM/MPI as a development platform.
Article
Full-text available
Abstract As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming,limiting factors on application scalability. To ad- dress these issues, we present the design and implementa- tion of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel- level process checkpoint system with the LAM implementa- tion of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance,and schedul- ing reasons as well as for fault tolerance. Experimental re- sults show negligible communication,performance,impact due to the incorporation of the checkpoint support capabil- ities into LAM/MPI.
Conference Paper
Full-text available
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
Conference Paper
Full-text available
The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D) the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000
Conference Paper
In the scientific computing community, parallel and, increasingly, distributed computing are both important paradigms for the development of large-scale simulation software. The ability to bridge seamlessly between these two paradigms is a valuable characteristic for programming models in this general domain. The Common Component Architecture (CCA) is a software component model specially designed for the needs of the scientific community, including support for both high-performance parallel and distributed computing. The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA is being applied within an increasing range of disciplines, including combustion research, mesoscale storm prediction, global climate simulation, and computational chemistry, as well as connecting to instruments and sensors. In this talk, I will introduce the basic concepts behind component-based software engineering in general, and the common component architecture in particular. I will emphasize the mechanisms by which the CCA provides for both high-performance parallel computing and distributed computing, and how it integrates with several popular distributed computing environments. Finally, I will offer examples of several applications using the CCA in parallel and distributed contexts.