Conference Paper

A low latency, loss tolerant architecture and protocol for widearea group communication

Dept. of Comput. Sci., Johns Hopkins Univ., Baltimore, MD;
DOI: 10.1109/ICDSN.2000.857557 Conference: Dependable Systems and Networks, 2000. DSN 2000. Proceedings International Conference on
Source: IEEE Xplore

ABSTRACT Group communication systems are proven tools upon which to build fault-tolerant systems. As the demands for fault-tolerance increase and more applications require reliable distributed computing over wide area networks, wide area group communication systems are becoming very useful. However, building a wide area group communication system is a challenge. This paper presents the design of the transport protocols of the spread wide area group communication system. We focus on two aspects of the system. First, the value of using overlay networks for application level group communication services. Second, the requirements and design of effective low latency link protocols used to construct wide area group communication. We support our claims with the results of live experiments conducted over the Internet

  • [Show abstract] [Hide abstract]
    ABSTRACT: Although IP and its overlying protocols, such as TCP and UDP, are ubiquitous, they were originally designed for point-to-point connections between computers in reasonably fixed locations. They are less suited to mobile networks and broadcast communications. In this paper, we present an alternative to IP that is based on a publish-subscribe approach. The approach that we present combines an application publish-subscribe programming model with a content delivery network, which provides several advantages in certain communication environments, including quality of service based on application level needs; efficient support for reliable broadcast; support for disadvantaged, intermittent, and limited communications; and more efficient reliability and fault tolerance. The paper presents our approach, based on a streamlined Data Distribution Service and simplified Content Delivery Network, a motivating example in which the publish-subscribe based distribution and network provides advantages, and a contrast to TCP/IP in the example context.
    ICC; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The use of virtualization in HPC clusters can provide rich software environments, application isolation and efficient workload management mechanisms, but system-level virtualization introduces a software layer on the computing nodes that reduces performance and inhibits the direct use of hardware devices. We present an unobtrusive user-level platform to execute virtual machines inside batch jobs that does not handicap the computing cluster's ability to execute the most demanding applications. A per-user platform uses a static mode in which the VMs run entirely within the resources of a single batch job and a dynamic mode in which the VMs navigate at runtime between the continuously allocated jobs node time-slots. In the dynamic mode fault-tolerant system agents are integrated using group communication to control the system, to execute user commands and to implement user-defined scheduling policies. In our tests compute intensive applications suffered negligible performance overhead compared to the native configuration, but the user-mode network overlay introduced a significant penalty on the more taxing networked applications.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Information Management (IM) services provide a powerful capability for military operations, enabling managed information exchange based on the characteristics of the information that is needed and the information that is available, rather than on explicit knowledge of the information consumers, producers, and repositories. To be usable in tactical environments and mission critical operations, IM services need to be resilient to faults and failures, which can be due to many factors, including design or implementation flaws, misconfiguration, corruption, hardware or infrastructure failure, resource intermittency or contention, or hostile actions. This paper presents a reference model for representing the performance and fault tolerance requirements of IM services in tactical operations. A Joint Close Air Support operation is described using this representation and the viability of canonical fault tolerance techniques are examined for a given deployment.