Article

The Evolution of Condor Checkpointing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

this paper was successfully used by Condor for several years. By eliminating assembly language in favor of "standard" Unix facilities such as core dumps and setjmp/longjmp, it greatly eased the difficulty of porting Condor to new platforms. However, the mechanism was found wanting for a variety of reasons some of which were anticipated in the original paper, and evolved into the algorithm described below. This note describes Version 6 of Condor, released in early 1998. A more detailed description of Condor checkpointing as of Version 5 has been published elsewhere [1].

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It contains support for incremental checkpoints, in which only pages that have been modified since the last checkpoint are saved. Condor [25] [26] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [45] provides a kernel implementation of checkpoint/restart for Linux. ...
... It contains support for incremental checkpoints, in which only pages that have been modified since the last checkpoint are saved. Condor [25, 26] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [45] provides a kernel implementation of checkpoint/restart for Linux. ...
Article
Full-text available
Abstract As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming,limiting factors on application scalability. To ad- dress these issues, we present the design and implementa- tion of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel- level process checkpoint system with the LAM implementa- tion of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance,and schedul- ing reasons as well as for fault tolerance. Experimental re- sults show negligible communication,performance,impact due to the incorporation of the checkpoint support capabil- ities into LAM/MPI.
... • Library level: There are libraries for checkpointing, such as Libckpt [92] and Condor libraries [93]. To use this kind of checkpointing, developers have to recompile the source code by including the checkpointing library in their program. ...
Thesis
Resource sharing environments enable sharing and aggregation of resources across several resource providers. InterGrid provides an architecture for resource sharing based on virtual machine technology between Grids. Resource providers in Inter- Grid serve their local requests as well as external requests assigned by InterGrid. However, resource providers would like to ensure that the requirements of their local requests are not delayed because of running external requests. This scenario leads to contention for resources between the external and local requests. In this dissertation, preemption mechanism is considered to resolve the con- tention, while side-effects of this mechanism are taken into account. Particularly, the number of preempted external requests, their waiting time, and imposed over- head of preemption are considered. Therefore, this dissertation investigates and categorises mechanisms for management of resource contention in the existing systems. Then, it presents a contention management scheme that includes two main strategies. The first strategy avoids the contentious situation by establish- ing contention-awareness in the scheduling policies. The second strategy, handles contention side-effects while considering long waiting time and energy consump- tion issues. These strategies are proposed within different architectural elements of the InterGrid platform. In this dissertation, first feasibility of the preemption mechanism to resolve resource contention is presented, then overhead time imposed for performing var- ious preemption scenarios are modelled, and different policies to minimise the side-effects of resource contention are proposed. To avoid resource contention, a scheduling policy is proposed in gateway (meta-scheduling) level, that proactively disseminates external requests on resource providers. Also, a dispatch policy is proposed to decrease the likelihood of resource contention for more valuable ex- ternal users. To prevent long waiting time for external requests, an admission control policy is proposed to limit the number of accepted external requests when there is a surge in demand. Then, a contention-aware energy management policy is proposed to adapt energy consumption of resource providers to user demand. This policy is for situation that resource providers operate at low utilisation and it considers long waiting time for external requests. Performance evaluations of the strategies are achieved using discrete-event simulation. This dissertation also realises the proposed scheme in InterGrid.
... Libckpt [105] is an open source library for transparent checkpointing of Unix processes. Condor [87,88,89,90,135] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [152,153] provides a kernel implementation of checkpoint/restart for Linux. ...
... Libckpt [104] is an open source library for transparent checkpointing of Unix processes. Condor [86,87,88,89,134] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [151,152] provides a kernel implementation of checkpoint/restart for Linux. ...
... A traditional and widely used error recovery mechanism is to reboot the system, with repair applied during the reboot if necessary to bring the system back up successfully [30]. Mechanisms such as fast reboots [51] and checkpointing [41,42] can improve the performance of the basic reboot process. ...
Article
We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply discards invalid writes and manufactures values to return for invalid reads, enabling the server to continue its normal execution path. We have applied failure-oblivious computing to a set of widely-used servers from the Linux-based open-source computing environment. Our results show that our techniques 1) make these servers invulnerable to known security attacks that exploit memory errors, and 2) enable the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors. We observed several reasons for this successful continued execution. When the memory errors occur in irrelevant computations, failure-oblivious computing enables the server to execute through the memory errors to continue on to execute the relevant computation. Even when the memory errors occur in relevant computations, failure-oblivious computing converts requests that trigger unanticipated and dangerous execution paths into anticipated invalid inputs, which the error-handling logic in the server rejects. Because servers tend to have small error propagation distances (localized errors in the computation for one request tend to have little or no effect on the computations for subsequent requests), redirecting reads that would otherwise cause addressing errors and discarding writes that would otherwise corrupt critical data structures (such as the call stack) localizes the effect of the memory errors, prevents addressing exceptions from terminating the computation, and enables the server to continue on to successfully process subsequent requests. The overall result is a substantial extension of the range of requests that the server can successfully process.
... It contains support for incremental checkpoints, in which only pages that have been modified since the last checkpoint are saved. Condor [21, 22] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [40] provides a kernel implementation of checkpoint/restart for Linux. ...
Article
Full-text available
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To ad-dress these issues, we present the design and implementa-tion of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implemen-tation of MPI through a defined checkpoint/restart inter-face. Checkpointing is transparent to the application, al-lowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Exper-imental results show negligible performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.
... Now ChaRM64 for PVM has been implemented in Linux. Besides checkpointing the IA-64 process, some other key technologies, including process ID mapping, functions wrapping and renaming, exit/rejoin mechanism and signal/message notification, are employed [6]. ...
Conference Paper
Full-text available
We design and implement a high availability parallel run-time system—ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture.
... That work can be classied into two categories: nondistributed and distributed. The most prominent examples for non-distributed solutions include BLCR [8], Condor [13], libCkpt [19] and OpenVZ [2]. These approaches focus on single node and either save the local state of a process or encapsulate the processes within a container (e.g. ...
Article
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying grid-node checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform interface. In this paper, we present the integration of an independent checkpointing and rollback-recovery protocol into the XtreemGCP. The solution we propose is not checkpointer bound and thus can be transparently used on top of any grid-node checkpointer. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.
... The traditional error recovery mechanism is to reboot the system, with repair applied during the reboot if necessary to bring the system back up successfully [35]. Mechanisms such as fast reboots [50], checkpointing [44, 45] , and partial system restarts [28] can improve the performance of the reboot process. Hardware redundancy is the standard solution for increased availability. ...
Article
We present a new technique, failure-oblivious computing,that enables programs to continue to execute through memoryerrors without memory corruption. Our safe compilerfor C inserts checks that dynamically detect invalid memoryaccesses. Instead of terminating the execution or throwingan exception, the generated code simply discards invalidwrites and manufactures values to return for invalid reads,enabling the program to continue its normal execution.We have applied failure-oblivious computing to a set ofwidely-used programs that are part of the Linux-based opensourceinteractive computing environment. Our results showthat our techniques 1) make these programs invulnerableto known security attacks that exploit memory errors, and2) enable the programs to continue to operate successfullyto service legitimate requests and satisfy the needs of theirusers even after attacks trigger their memory errors.
... The traditional error recovery mechanism is to reboot the system, with repair applied during the reboot if necessary to bring the system back up successfully [21]. Mechanisms such as fast reboots [34], checkpointing [26], [27], and partial system restarts [17] can improve the performance of the reboot process. Hardware redundancy is the standard solution for increased availability. ...
Article
Memory errors are a common cause of incorrect software execution and security vulnerabilities. We have developed two new techniques that help software continue to execute successfully through memory errors: failure-oblivious computing and boundless memory blocks. The foundation of both techniques is a compiler that generates code that checks accesses via pointers to detect out of bounds accesses. Instead of terminating or throwing an exception, the generated code takes another action that keeps the program executing without memory corruption. Failure-oblivious code simply discards invalid writes and manufactures values to return for invalid reads, enabling the program to continue its normal execution path. Code that implements boundless memory blocks stores invalid writes away in a hash table to return as the values for corresponding out of bounds reads. he net effect is to (conceptually) give each allocated memory block unbounded size and to eliminate out of bounds accesses as a programming error. We have implemented both techniques and acquired several widely used open source servers (Apache, Sendmail, Pine, Mutt, and Midnight Commander).With standard compilers, all of these servers are vulnerable to buffer overflow attacks as documented at security tracking web sites. Both failure-oblivious computing and boundless memory blocks eliminate these security vulnerabilities (as well as other memory errors). Our results show that our compiler enables the servers to execute successfully through buffer overflow attacks to continue to correctly service user requests without security vulnerabilities. Singapore-MIT Alliance (SMA)
... Libckpt [105] is an open source library for transparent checkpointing of Unix processes. Condor [87,88,89,90,135] is another system that provides checkpointing services for single process jobs on a number of Unix platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [152,153] provides a kernel implementation of checkpoint/restart for Linux. ...
Article
Full-text available
Thesis (Ph. D.)--University of Notre Dame, 2003. Thesis directed by Andrew Lumsdaine for the Department of Computer Science and Engineering. "April 2004." Includes bibliographical references (leaves 301-312). Electronic reproduction.
... The traditional error recovery mechanism is to reboot the system, with repair applied during the reboot if necessary to bring the system back up successfully [12] . Mechanisms such as fast reboots [31], checkpointing [17, 18], and partial system restarts [5] can improve the performance of the reboot process. Hardware redundancy is the standard solution for increased availability. ...
Article
We present a new technique, boundless memory blocks, that automatically eliminates buffer overflow errors, enabling programs to continue to execute through memory errors without memory corruption. Buffer overflow vulnerabilities are caused by programming errors that allow an attacker to cause the program to write beyond the bounds of an allocated memory block to corrupt other data structures. The standard way to exploit a buffer overflow vulnerability involves a request that is too large for the buffer intended to hold it. The buffer overflow error causes the program to write part of the request beyond the bounds of the buffer, corrupting the address space of the program and causing the program to execute injected code contained in the request. Our boundless memory blocks compiler inserts checks that dynamically detect all out of bounds accesses. When it detects an out of bounds write, it stores the value away in a hash. Our compiler can then return the stored value as the result of an out of bounds read to that address. In the case of uninitialized addresses, our compiler simply returns a predefined value. We have acquired several widely used open source applications (Apache, Sendmail, Pine, Mutt, and Midnight Commander). With standard compilers, all of these applications are vulnerable to buffer overflow attacks as documented at security tracking web sites. Instead, our compiler enables the applications to execute successfully through buffer overflow attacks to continue to correctly service user requests without security vulnerabilities. We have also found that only one application contains uninitialized reads, which means that in most cases, the net effect of our compiler is to (conceptually) give each allocated memory block unbounded size and to eliminate out of bounds accesses as a programming error.
... It contains support for incremental checkpoints, in which only pages that have been modified since the last checkpoint are saved. Condor [9] is another system that provides checkpointing services for single-process-jobs on a number of UNIX platforms. The CRAK (Checkpoint/Restart As a Kernel module) project [10] provides a kernel implementation of checkpoint/restart for Linux. ...
Conference Paper
Full-text available
As the clusters continue to grow in size and popularity, issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability. To address these issues, we design and implement a high availability parallel run-time system - ChaRM64 for MPI, a checkpoint-based rollback recovery and migration system for MPI programs on a cluster of IA-64 computers. Our approach integrates MPICH with a user-level, single process checkpoint/recovery library for IA-64 Linux, and modifies P4 libraries to implement a coordinated checkpointing and rollback recovery (CRR) and migration mechanism for parallel applications. In addition, the CRR of file operations is supported. Testing shows negligible performance overhead introduced by the CRR mechanism in our implementation.
... The traditional error recovery mechanism is to reboot the system, with repair applied during the reboot if necessary to bring the system back up successfully [23]. Mechanisms such as fast reboots [39], checkpointing [30, 31] , and partial system restarts [17] can improve the performance of the reboot process. Hardware redundancy is the standard solution for increased availability. ...
Conference Paper
Buffer overflow vulnerabilities are caused by programming errors that allow an attacker to cause the program to write beyond the bounds of an allocated memory block to corrupt other data structures. The standard way to exploit a buffer overflow vulnerability involves a request that is too large for the buffer intended to hold it. The buffer overflow error causes the program to write part of the request beyond the bounds of the buffer, corrupting the address space of the program and causing the program to execute injected code contained in the request. We have implemented a compiler that inserts dynamic checks into the generated code to detect all out of bounds memory accesses. When it detects an out of bounds write, it stores the value away in a hash table to return as the value for corresponding out of bounds reads. The net effect is to (conceptually) give each allocated memory block unbounded size and to eliminate out of bounds accesses as a programming error. We have acquired several widely used open source servers (Apache, Sendmail, Pine, Mutt, and Midnight Commander). With standard compilers, all of these servers are vulnerable to buffer overflow attacks as documented at security tracking Web sites. Our compiler eliminates these security vulnerabilities (as well as other memory errors). Our results show that our compiler enables the servers to execute successfully through buffer overflow attacks to continue to correctly service user requests without security vulnerabilities.
Conference Paper
Full-text available
Virtual machine, which typically consists of a guest operating system (OS) and its serial applications, can be checkpointed, migrated to another cluster node, and restarted later to its previous saved state. However, to date, it is nontrivial to provide checkpoint-restart mechanisms with the same level of transparency for distributed applications running on a cluster of virtual machines. To address this particular issue, we have created the Virtual Cluster CheckPointing (VCCP) system, a novel system for transparent coordinated checkpoint-restart of virtual machines and its distributed application on commodity clusters. In this paper, we detail the design and implementation of the VCCP system. Our VCCP prototype extends the open source QEMU system with kqemu module by implementing hypervisor-based Coordinated Checkpoint-Restart protocols. To verify and validate our prototype, we measured its performance using the NAS parallel benchmark. Our experimental results indicate that VCCP generates less than 1% of additional execution overhead for non-communication intensive parallel applications. Furthermore, our correctness analysis shows that VCCP does not cause message loss or reordering, which is a necessary property to ensure correctness of checkpoint-restart mechanism. Finally, we believe that VCCP is a promising checkpoint-restart alternative for legacy applications that have implemented traditional process-level checkpoint-restart.
Conference Paper
The EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart.
Conference Paper
Full-text available
We present ClearView, a system for automatically patching errors in deployed software. ClearView works on stripped Windows x86 binaries without any need for source code, debugging information, or other external information, and without human intervention. ClearView (1) observes normal executions to learn invariants that characterize the application’s normal behavior, (2) uses error detectors to monitor the execution to detect failures, (3) identifies violations of learned invariants that occur during failed executions, (4) generates candidate repair patches that enforce selected invariants by changing the state or the flow of control to make the invariant true, and (5) observes the continued execution of patched applications to select the most successful patch. ClearView is designed to correct errors in software with high availability requirements. Aspects of ClearView that make it particularly
Article
Full-text available
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAM/MPI, a high performance implementation of the Message Passing Interface (MPI), to improve its availability. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and own higher portability, which can run on more platforms including IA32 and IA64 Linux. In addition, the test shows that less than 15% performance overhead is introduced by the CRR mechanism of our implementation.
Conference Paper
Full-text available
The EU-funded XtreemOS project implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in a distributed heterogeneous environment. The latter may spawn millions of grid nodes using different system-specific checkpointers saving and restoring application and kernel data structures on a grid node. In this paper we present the architecture of the XtreemGCP service integrating existing checkpointing solutions. Our architecture is open to support different checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. We propose to bridge the gap between grid semantics and system-specific checkpointers by introducing a common kernel checkpointer API that allows using different checkpointers in a uniform way. Furthermore, we discuss other grid related checkpointing issues including resource conflicts during restart, security, and checkpoint file management. Although this paper presents a solution within the XtreemOS context it can be applied to any other grid middleware or distributed OS, too.
Article
Full-text available
this reporting is to be turned on (1) or turned off (0) for subsequent calls. A value of (2) will cause the program to exit after printing the error message (not implemented in 3.2). The default is reporting turned on. PvmOutputTid: For this option val is the stdout device for children. All the standard output from the calling task and any tasks it spawns will be redirected to the specified device. Val is the tid of a PVM task or pvmd. The default val of 0 redirects stdout to master host, which writes to the log file /tmp/pvml.!uid?
Article
This report is the PVM version 2.3 users' guide. It contains an overview of PVM and how it is installed and used. Example programs in C and Fortran are included. PVM stands for Parallel Virtual Machine. It is a software package that allows the utilization of a heterogeneous network of parallel and serial computers as a single computational resource. PVM consists of two parts: a daemon process that any user can install on a machine, and a user library that contains routines for initiating process on other machines, for communicating between processes, and synchronizing process. 1 refs., 3 figs., 3 tabs.