Long Wang

University of Illinois, Urbana-Champaign, Urbana, IL, USA

Are you Long Wang?

Claim your profile

Publications (2)1.28 Total impact

  • Article: Reliability MicroKernel: Providing Application-Aware Reliability in the OS
    [show abstract] [hide abstract]
    ABSTRACT: This paper describes the reliability MicroKernel (RMK) framework, a loadable kernel module (or a device driver) for providing application-aware reliability, and dynamically configuring reliability mechanisms. Characteristics of application/system execution are exploited transparently through application-aware reliability techniques to achieve low-latency detection, and low-overhead checkpointing. The RMK prototype is implemented in both Linux, and Windows; and it supports detection of application/OS failures, and transparent application checkpointing. Experiment results show that the system hang detection and application hang detection, which exploit characteristics of application, and system behavior, can achieve high coverage (100% observed in our experiments) with a low false positive rate. Moreover, the performance overhead of RMK, and its detection/checkpointing mechanisms, is small: 0.6% for application hang detection, and 0.1% for transparent application checkpointing in the experiments.
    IEEE Transactions on Reliability 01/2008; · 1.28 Impact Factor
  • Source
    Conference Proceeding: Group communication protocols under errors
    [show abstract] [hide abstract]
    ABSTRACT: Group communication protocols constitute a basic building block for highly dependable distributed applications. Designing and correctly implementing a group communication system (GCS) is a difficult task. While many theoretical algorithms have been formalized and proved for correctness, only few research projects have experimentally assessed the dependability of GCS implementations under complex error scenarios. This paper describes a thorough error-injection experimental campaign conducted on Ensemble, a popular GCS. By employing synthetic benchmark applications, we stress selected components of the GCS $the group membership service, the FIFO-ordered reliable multicast - under various error models, including errors in the memory (text and heap segments) and in the network messages. The data show that about 5-6% of the failures are due to an error escaping Ensemble's error-containment mechanism and manifesting as a fail silence violation. This constitutes an impediment to achieving high dependability, the natural objective of GCSs. Our results are derived for a particular system (Ensemble), and more investigation involving other GCSs is required to generalize the conclusions. Nevertheless, through an accurate analysis of the failure causes and the error propagation patterns, this paper offers insights into the design and the implementation of robust GCSs.
    Reliable Distributed Systems, 2003. Proceedings. 22nd International Symposium on; 11/2003

Institutions

  • 2008
    • University of Illinois, Urbana-Champaign
      • Coordinated Science Laboratory
      Urbana, IL, USA