Conference Proceeding
The performance of coordinated and independent checkpointing
Dept. de Engenharia Inf., Coimbra Univ.;
05/1999;
DOI:10.1109/IPPS.1999.760487
ISBN: 0-7695-0143-5 pp.280-284 In proceeding of: Parallel and Distributed Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings
Source: IEEE Xplore
-
Article: System structure for software fault tolerance
[show abstract] [hide abstract]
ABSTRACT: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.ACM SIGPLAN Notices 08/2003; 10(6):437-449. · 0.09 Impact Factor -
Conference Proceeding: Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach
[show abstract] [hide abstract]
ABSTRACT: A checkpoint algorithm is presented that benefits from the research in concurrency control, commit, and site recovery algorithms in transaction processing. In the authors' approach a number of checkpointing processes, a number of rollback processes, and computations on operational processes can proceed concurrently while tolerating the failure of an arbitrary number of processes. Each process takes checkpoints independently. During recovery after a failure, a process invokes a two-phase rollback algorithm. It collects information about relevant message exchanges in the system in the first phase and uses it in the second phase to determine both the set of processes that must roll back and the set of checkpoints up to which rollback must occur. Concurrent rollbacks are completed in the order of the priorities of the recovering processes. The proposed solution is optimistic in the sense that it does well if failures are infrequent by minimizing overhead during normal processingReliable Distributed Systems, 1988. Proceedings., Seventh Symposium on; 11/1988 -
Article: State Restoration In Distributed Systems
[show abstract] [hide abstract]
ABSTRACT: This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in "backward error recovery", i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed.08/2003;
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed.
The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual
current impact factor.
Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence
agreement may be applicable.