Conference Proceeding

The performance of coordinated and independent checkpointing

Dept. de Engenharia Inf., Coimbra Univ.;
05/1999; DOI:10.1109/IPPS.1999.760487 ISBN: 0-7695-0143-5 pp.280-284 In proceeding of: Parallel and Distributed Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings
Source: IEEE Xplore

ABSTRACT Checkpointing is a very effective technique to tolerate the occurrence of failures in distributed and parallel applications. The existing algorithms in the literature are basically divided into two main classes: coordinated and independent checkpointing. This paper presents an experimental study that compares the performance of these two classes of algorithms. The main conclusion of our study is that coordinated checkpointing is more efficient than independent checkpointing and all the arguments against the performance of coordinated algorithms were not verified in practice

0 0
 · 
0 Bookmarks
 · 
26 Views
  • Source
    Article: System structure for software fault tolerance
    [show abstract] [hide abstract]
    ABSTRACT: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
    ACM SIGPLAN Notices 08/2003; 10(6):437-449. · 0.09 Impact Factor
  • Source
    Conference Proceeding: Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach
    [show abstract] [hide abstract]
    ABSTRACT: A checkpoint algorithm is presented that benefits from the research in concurrency control, commit, and site recovery algorithms in transaction processing. In the authors' approach a number of checkpointing processes, a number of rollback processes, and computations on operational processes can proceed concurrently while tolerating the failure of an arbitrary number of processes. Each process takes checkpoints independently. During recovery after a failure, a process invokes a two-phase rollback algorithm. It collects information about relevant message exchanges in the system in the first phase and uses it in the second phase to determine both the set of processes that must roll back and the set of checkpoints up to which rollback must occur. Concurrent rollbacks are completed in the order of the priorities of the recovering processes. The proposed solution is optimistic in the sense that it does well if failures are infrequent by minimizing overhead during normal processing
    Reliable Distributed Systems, 1988. Proceedings., Seventh Symposium on; 11/1988
  • Source
    Article: State Restoration In Distributed Systems
    [show abstract] [hide abstract]
    ABSTRACT: This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in "backward error recovery", i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed.
    08/2003;

Full-text

View
0 Downloads
Available from

Keywords

Checkpointing
 
coordinated checkpointing
 
effective technique
 
efficient
 
existing algorithms
 
experimental study
 
independent checkpointing
 
parallel applications
 

L.M. Silva