Checkpointing in CosMiC: a User-level Process Migration Environment
ABSTRACT The CosMiC system is a user-level process migrationenvironment. Process migration is defined as the mechanism to checkpoint the state of an unfinished process, transfer the state from one machine to another, and resume process execution on the new machine. The main purposes of process migration are (1) to utilize the CPU power and balance load on all machines in an environment; (2) to provide faulttolerance by migrating a process from a failed machine to another machine. CosMiC provides an extensible architecture to allow an application to choose its own checkpointing mechanism. It is equipped with four checkpoint libraries, namely, libckp, libfcp, libft and libst. They provide different strategies for state saving and restoring. Libckp is a transparent checkpoint library, it checkpoints the entire process state. It requires minimum user involvement and no modifications to the source code. Libfcp is a file checkpoint library that saves and restores file contents. Libft is a critical ...
- SourceAvailable from: psu.edu
Conference Proceeding: Accent: A Communication Oriented Network Operating System Kernel.[show abstract] [hide abstract]
ABSTRACT: Accent is a communication oriented operating system kernel being built at Carnegie-Mellon University to support the distributed personal computing project, Spice, and the development of a fault-tolerant distributed sensor network (DSN). Accent is. built around a single, powerful abstraction of communication between processes, with all kernel functions, such as device access and virtual memory management accessible through messages and distributable throughout a network. In this paper, specific attention is given to system supplied facilities which support transparent network access and fault-tolerant behavior. Many of these facilities are already being provided under a modified version of VAX/UNIX. The Accent system itself is currently being implemented on the Three Rivers Corp. PERQ.01/1981
- Softw., Pract. Exper. 01/1985; 15:725-737.
- [show abstract] [hide abstract]
ABSTRACT: We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment) , a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.09/2001;