Architecture of the Component Collective Messaging Interface.
ABSTRACT Different programming paradigms utilize a variety of collective communication operations, often with different semantics. We present the component collective messaging interface (CCMI) that can support asynchronous non-blocking collectives and is extensible to different programming paradigms and architectures. CCMI is designed with components written in the C++ programming language, allowing it to be reusable and extendible. Collective algorithms are embodied in topological schedules and executors that execute them. Portability across architectures is enabled by the multisend data movement component. CCMI includes a programming language adaptor used to implement different APIs with different semantics for different paradigms. We study the effectiveness of CCMI on 16K nodes of Blue Gene/P machine and evaluate its performance for the barrier, broadcast, and allreduce collective operations and several application benchmarks. We also present the performance of the barrier collective on the Abe Infiniband cluster.
- SourceAvailable from: Raphael Finkel[show abstract] [hide abstract]
ABSTRACT: The authors describe two new algorithms for implementing barrier synchronization on a shared-memory multicomputer. Both algorithms are based on a method due to Brooks. They first improve Brooks' algorithm by introducing double buffering. Their dissemination algorithm replaces Brooks' communication pattern with an information dissemination algorithm described by Han and Finkel. Their tournament algorithm uses a different communication pattern and generally requires fewer total instructions. The resulting algorithms improve Brooks' original barrier by a factor of two when the number of processes is a power of two. When the number of processes is not a power of two, these algorithms improve even more upon Brooks' algorithm because absent processes need not be simulated. These algorithms share with Brooks' barrier the limitation that each of the n processes meeting at the barrier must be assigned identifiers i such that 0 less than or equal to i < n.International Journal of Parallel Programming 01/1988; 17:1-17. · 0.40 Impact Factor
Conference Proceeding: Implementation and performance analysis of non-blocking collective operations for MPI.[show abstract] [hide abstract]
ABSTRACT: Collective operations and non-blocking point-to-point oper- ations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI com- munication operations. LibNBC provides non-blocking ver- sions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our imple- mentation, we also present a microbenchmark for measuring both latency and overlap of computation and communica- tion. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high- performance MPI implementations. Our library introduces a very low overhead between the application and the un- derlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the poten- tial for optimizing real-world applications.Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC 2007, November 10-16, 2007, Reno, Nevada, USA; 01/2007
Conference Proceeding: Design of High Performance MVAPICH2: MPI2 over InfiniBand[show abstract] [hide abstract]
ABSTRACT: MPICH2 provides a layered architecture for implementing MPI-2. In this paper, we provide a new design for implementing MPI-2 over InfiniBand by extending the MPICH2 ADI3 layer. Our new design aims to achieve high performance by providing a multi-communication method framework that can utilize appropriate communication channels/devices to attain optimal performance without compromising on scalability and portability. We also present the performance comparison of the new design with our previous design based on the MPICH2 RDMA channel. We show significant performance improvements in micro-benchmarks and NAS Parallel Benchmarks.Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on; 06/2006