Checkpointing in CosMiC: a User-level Process Migration Environment

Bell Labs., Lucent Technol., Murray Hill, NJ
02/1999; DOI: 10.1109/PRFTS.1997.640146
Source: CiteSeer

ABSTRACT The CosMiC system is a user-level process migrationenvironment. Process migration is defined as the mechanism to checkpoint the state of an unfinished process, transfer the state from one machine to another, and resume process execution on the new machine. The main purposes of process migration are (1) to utilize the CPU power and balance load on all machines in an environment; (2) to provide faulttolerance by migrating a process from a failed machine to another machine. CosMiC provides an extensible architecture to allow an application to choose its own checkpointing mechanism. It is equipped with four checkpoint libraries, namely, libckp, libfcp, libft and libst. They provide different strategies for state saving and restoring. Libckp is a transparent checkpoint library, it checkpoints the entire process state. It requires minimum user involvement and no modifications to the source code. Libfcp is a file checkpoint library that saves and restores file contents. Libft is a critical ...

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Checkpoint and Recovery facility saves the process state to stable storage periodically so that after a failure, user program can be restored to the state of its most recent checkpoint. Checkpoint and Recovery facility is especially important for long-running processes, because it prevents the loss of intermediate results generated by long-running processes due to a failure. In this paper we present Kckpt, a Checkpoint and Recovery facility on UnixWare Kernel, and compares Kckpt with Libckpt, a user-level checkpoint library. Using Kckpt, UnixWare can provide totally user-transparent checkpoint as well as user-directed checkpoint. In totally user-transparent checkpoint, no modification on source code is needed to take a checkpoint. Checkpoint overheads of Kckpt are significantly reduced compared with the previously developed user-level checkpoint library. Keywords: Checkpoint, Recovery, Totally User-transparent Checkpoint, Forked Checkpoint, UnixWare 1 Introduction Checkpoint and Reco...
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these results. We study three separate projects that monitor failure in workstation networks. Our goals are twofold. The first is to see how these results correlate with the theoretical results, and the second is to assess their impact on strategies for checkpointing long-running computations on workstations and networks of workstations. A significant result of our work is that although the base assumptions of the theoretical research do not hold, many of the results are still applicable
    Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on; 07/1998
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Not Available
    IEEE Transactions on Reliability 01/2000; 48(4):315- 324. · 2.29 Impact Factor