Checkpointing in CosMiC: a User-level Process Migration Environment

Bell Labs., Lucent Technol., Murray Hill, NJ
02/1999; DOI: 10.1109/PRFTS.1997.640146
Source: CiteSeer

ABSTRACT The CosMiC system is a user-level process migrationenvironment. Process migration is defined as the mechanism to checkpoint the state of an unfinished process, transfer the state from one machine to another, and resume process execution on the new machine. The main purposes of process migration are (1) to utilize the CPU power and balance load on all machines in an environment; (2) to provide faulttolerance by migrating a process from a failed machine to another machine. CosMiC provides an extensible architecture to allow an application to choose its own checkpointing mechanism. It is equipped with four checkpoint libraries, namely, libckp, libfcp, libft and libst. They provide different strategies for state saving and restoring. Libckp is a transparent checkpoint library, it checkpoints the entire process state. It requires minimum user involvement and no modifications to the source code. Libfcp is a file checkpoint library that saves and restores file contents. Libft is a critical ...

6 Reads
  • Source
    • "A large body of literature considers checkpointing and replaying the execution of processes, as means for intrusion detection, debugging , process migration, and fault tolerance [5] [15] [16] [23] [26] [34] [28] [10] [25] [41] [6] [27] [13] [47] [48]. However, none of them examine the data lifetime implications of checkpointing or replaying the execution. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Virtual Machine (VM) checkpointing enables a user to capture a snapshot of a running VM on persistent storage. VM checkpoints can be used to roll back the VM to a previous "good" state in order to recover from a VM crash or to undo a previous VM activity. Although VM checkpointing eases systems administration and improves usability, it can also increase the risks of exposing sensitive information. This is because the checkpoint may store VM's physical memory pages that contain confidential information such as clear text passwords, credit card numbers, patients' health records, tax returns, etc. This paper presents the design and implementation of SPARC, a security and privacy aware checkpointing mechanism. SPARC enables users to selectively exclude processes and terminal applications that contain sensitive data from being checkpointed. Selective exclusion is performed by the hypervisor by sanitizing memory pages in the checkpoint file that belong to the excluded applications. We describe the design challenges in effectively tracking and excluding process-specific memory contents from the checkpoint file in a VM running the commodity Linux operating system. Our preliminary results show that SPARC imposes only 1% - 5.3% of overhead if most pages are dirty before checkpointing is performed.
  • Source
    • "When checkpointing, it replaces the original file with its replica by means of the atomic rename operation. Libfcp[2] deploys a " inplace update with undo logs " scheme, and the file is rolled back according to the undo logs on recovery. Libra[18] combines a " copy-on-change " strategy and undo log to record the parts that are really changed in order to reduce the log size. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Checkpoint and Restart (CPR) is becoming critical to large scale parallel computers, whose Mean Time Between Fail- ures (MTBF) may be much shorter than the execution times of the applications. The CPR mechanism should be able to store and recover the states of virtual memory, communica- tion and files for the applications in a consistent way. However, many CPR tools ignore file states, which may cause errors for applications with file operations on recovery. Some CPR tools adopt library-based approaches or kernel- level file systems to deal with file states, but they only sup- port limited types of file operations which are not sufficient for some applications. Moreover, many library-based ap- proaches are not transparent to user applications because they wrap file APIs. Kernel-level file systems are difficult to deploy in production systems due to unnecessary overhead they may introduce to applications that do not need CPR. In this paper we propose a user-level file system, CprFS, to address these problems. As a file system, CprFS can guar- antee transparency to user applications, and is convenient to support arbitrary file operations. It can be deployed on applications' demand to avoid intervention with other appli- cations. Experimental results show that CprFS introduces acceptable overhead and has little impact on checkpointing systems.
    Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, 2008; 01/2008
  • Source
    • "There are two common machine-independent data formats, XDR (External Data Representation) and UCF (Universal Checkpoint Format). For example, XDR was used in PVM and CosMic [20], [16] and UCF was used in Porch [17]. ASCII data representation has also been utilized [18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Heterogeneous computing environments, where computers may have different instruction set architectures, data representations, and operating systems, complicate checkpointing and recovery of processes. This paper describes an approach to recovery and an implementation, PREACHES, that provides portable checkpointing of single-process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation mechanism creates machine-dependent checkpoints for different architectures in the heterogeneous environment. A process is restored on a specific machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES has been evaluated on a heterogeneous network of workstations, including Sun, HP, and Pentium machines. The experimental results show that PREACHES achieves efficient checkpointing and rapid recovery.
    IEEE Transactions on Computers 03/2003; 52(2):126- 138. DOI:10.1109/TC.2003.1176981 · 1.66 Impact Factor
Show more