Conference Paper

Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network

Scable Syst. Software Dept., Sandia Nat. Labs., Albuquerque, NM
DOI: 10.1109/ICCCN.2008.ECP.116 Conference: Computer Communications and Networks, 2008. ICCCN '08. Proceedings of 17th International Conference on
Source: DBLP

ABSTRACT Understanding the communication behavior and network resource usage of parallel applications is critical to achieving high performance and scalability on systems with tens of thousands of network endpoints. The need for better understanding is not only driven by the desire to identify potential performance optimization opportunities for current networks, but is also a necessity for designing next-generation networking hardware. In this paper, we describe our approach to instrumenting the SeaStar interconnect on the Cray XT series of massively parallel processing machines to gather low-level network timing data. This data provides a new perspective on performance evaluation, both in terms of evaluating the resource usage patterns of applications as well as evaluating different implementation strategies in the network protocol stack.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Understanding the message passing behavior and network resource usage of distributed-memory message-passing parallel applications is critical to achieving high performance and scalability. While much research has focused on how applications use critical compute related resources, relatively little attention has been devoted to characterizing the usage of network resources, specifically those needed by the network interface. This paper discusses the importance of understanding network interface resource usage requirements for parallel applications and describes an initial attempt to gather network resource usage data for several real-world codes. The results show widely varying usage patterns between processes in the same parallel job and indicate that resource requirements can change dramatically as process counts increase and input data changes. This suggests that general network resource management strategies may not be widely applicable, and that adaptive strategies or more fine-grained controls may be necessary for environments where network interface resources are severely constrained.
    Parallel Processing, 2005. ICPP 2005. International Conference on; 07/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the heavy reliance of modern scientific applications upon the MPI Standard, it has become critical for the implementation of MPI to be as capable and as fast as possible. This has led some of the fastest modern networks to introduce the capability to offload aspects of MPI processing to an embedded processor on the network interface. With this important capability has come significant performance implications. Most notably, the time to process long queues of posted receives or unexpected messages is substantially longer on embedded processors. This paper presents an associative list matching structure to accelerate the processing of moderate length queues in MPI. Simulations are used to compare the performance of an embedded processor augmented with this capability to a baseline implementation. The proposed enhancement significantly reduces latency for moderate length queues while adding virtually no overhead for extremely short queues.
    Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International; 05/2005
  • [Show abstract] [Hide abstract]
    ABSTRACT: The latency and throughput of MPI messages are critically important to a range of parallel scientific applications. In many modern networks, both of these performance characteristics are largely driven by the performance of a processor on the network interface. Because of the semantics of MPI, this embedded processor is forced to traverse a linked list of posted receives each time a messages is received. As this list grows long, the latency of message reception grows and the throughput of MPI messages decreases. This paper presents a novel hardware feature to handle list management functions on a network interface. By moving functions such as list insertion, list traversal, and list deletion to the hardware unit, latencies are decreased by up to 20% in the zero length queue case with dramatic improvements in the presence of long queues. Similarly, the throughput is increased by up to 10% in the zero length queue case and by nearly 100% in the presence queues of 30 messages
    2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26 - 30, 2005, Boston, Massachusetts, USA; 01/2005

Preview (3 Sources)

Available from