Architecture of the Component Collective Messaging Interface.

International Journal of High Performance Computing Applications (Impact Factor: 1.48). 02/2010; 24:16-33. DOI: 10.1177/1094342009359011
Source: DBLP


Different programming paradigms utilize a variety of collective communication operations, often with different semantics. We present the component collective messaging interface (CCMI) that can support asynchronous non-blocking collectives and is extensible to different programming paradigms and architectures. CCMI is designed with components written in the C++ programming language, allowing it to be reusable and extendible. Collective algorithms are embodied in topological schedules and executors that execute them. Portability across architectures is enabled by the multisend data movement component. CCMI includes a programming language adaptor used to implement different APIs with different semantics for different paradigms. We study the effectiveness of CCMI on 16K nodes of Blue Gene/P machine and evaluate its performance for the barrier, broadcast, and allreduce collective operations and several application benchmarks. We also present the performance of the barrier collective on the Abe Infiniband cluster.

1 Follower
14 Reads
  • Source
    • "Many-to-many in the context of the UPC programming paradigm is presented in [12]. The DCMF active message library for Blue Gene/P has been presented in [11] and the Component Collective Messaging Interface (CCMI) for optimized MPI collective communication operations is presented in [17]. CCMI algorithms and MPI collectives are built on top of the DCMF multisend interface. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We explore the multisend interface as a data mover interface to optimize applications with neighborhood collective communication operations. One of the limitations of the current MPI 2.1 standard is that the vector collective calls require counts and displacements (zero and nonzero bytes) to be specified for all the processors in the communicator. Further, all the collective calls in MPI 2.1 are blocking and do not permit overlap of communication with computation. We present the record replay persistent optimization to the multisend interface that minimizes the processor overhead of initiating the collective. We present four different case studies with the multisend API on Blue Gene/P (i) 3D-FFT, (ii) 4D nearest neighbor exchange as used in Quantum Chromodynamics, (iii) NAMD and (iv) neural network simulator NEURON. Performance results show 1.9× speedup with 32(3) 3D-FFTs, 1.9× speedup for 4D nearest neighbor exchange with the 2(4) problem, 1.6× speedup in NAMD and almost 3× speedup in NEURON with 256K cells and 1k connections/cell.
  • Source
    • "Bruck [6] also studies multiporting in the context of all-to-all collectives . In recent work, Kumar et al. [13] [14] incorporate a number of these optimizations in collective interfaces that exist at different layers of the messaging stack. They show that performance saturates at different number of links, depending on the message protocol used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Collective communication operations can dominate the cost of large-scale parallel algorithms. Image compositing in parallel scientific visualization is a reduction operation where this is the case. We present a new algorithm called Radix-k that in many cases performs better than existing compositing algorithms. It does so through a set of configurable parameters, the radices, that determine the number of communication partners in each message round. The algorithm embodies and unifies binary swap and direct-send, two of the best-known compositing methods, and enables numerous other configurations through appropriate choices of radices. While the algorithm is not tied to a particular computing architecture or network topology, the selection of radices allows Radix-k to take advantage of new supercomputer interconnect features such as multiporting. We show scalability across image size and system size, including both powers of two and nonpowers-of-two process counts.
    Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA; 01/2009
  • Source
    • "Theorem 1 shows that we can define a language for universal group communication operations that is based on two primitives and a strict ordering among operations. Schemes implemented in LibNBC [6] and IBM's Collective Component Messaging Interface [7] already rely on such a strict ordering. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The implementation and optimization of collective communication operations is an important field of active research. Such operations directly influence application performance and need to map the communication requirements in an optimal way to steadily changing network architectures. In this work, we define an abstract domain-specific language to express arbitrary group communication operations. We show the universality of this language and how all existing collective operations can be implemented with it. By design, it readily lends itself to blocking and nonblocking execution, as well as to off-loaded execution of complex group communication operations. We also define several offline and online optimizations (compiler transformations and scheduling decisions, respectively) to improve the overall performance of the operation. Performance results show that the overhead to express current collective operations is negligible in comparison to the potential gains in a highly optimized implementation.
    ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22-25 September 2009; 01/2009
Show more