Kemari: virtual machine synchronization for fault tolerance

Article · January 2008with96 Reads
    • The required logging bandwidth is very large, since all the incoming network packets must be sent on that channel making it a potential bottleneck. In checkpointing replication [Cully et al. 2008] (Remus extending Xen) [Kihara and Moriai 2008] (Kemari extending KVM) the primary VM's changed memory data are periodically sent to the passive one (Figure 6.a), in order to update its memory, so that it is consistent with the active VM. The active VM during the checkpointing phase waits for the passive VM to acknowledge the received data before continuing its operation.
    [Show abstract] [Hide abstract] ABSTRACT: We study the virtual machine live migration (LM) and disaster recovery (DR) from a networking perspective, considering long-distance networks, for example, between data centers. These networks are usually constrained by limited available bandwidth, increased latency and congestion, or high cost of use when dedicated network resources are used, while their exact characteristics cannot be controlled. LM and DR present several challenges due to the large amounts of data that need to be transferred over long-distance networks, which increase with the number ofmigrated or protected resources. In this context, ourwork presents theway LM and DR are currently being performed and their operation in long-distance networking environments, discussing related issues and bottlenecks and surveying other works. We also present the way networks are evolving today and the new technologies and protocols (e.g., software-defined networking, or SDN, and flexible optical networks) that can be used to boost the efficiency of LMand DR over long distances. Traffic redirection in a long-distance environment is also an important part of the whole equation, since it directly affects the transparency ofLMandDR. Relatedworks and solutions both from academia and the industry are presented.
    Full-text · Article · Jul 2016
    • In contrast, our model focuses on a primary-backup scheme for VM replication that does not require execution on all replica VMs. Kemari [11] is another approach that uses both lock-stepping and continuous check-pointing. It synchronizes primary and secondary VMs just before the primary VM has to send an event to devices, such as storage and networks.
    [Show abstract] [Hide abstract] ABSTRACT: Applications are increasingly being deployed in the cloud due to benefits stemming from economy of scale, scalability, flexibility and utility-based pricing model. Although most cloud-based applications have hitherto been enterprise-style, there is an emerging need for hosting real-time streaming applications in the cloud that demand both high availability and low latency. Contemporary cloud computing research has seldom focused on solutions that provide both high availability and real-time assurance to these applications in a way that also optimizes resource consumption in data centers, which is a key consideration for cloud providers. This paper makes three contributions to address this dual challenge. First, it describes an architecture for a fault-tolerant framework that can be used to automatically deploy replicas of virtual machines in data centers in a way that optimizes resources while assuring availability and responsiveness. Second, it describes the design of a pluggable framework within the fault-tolerant architecture that enables plugging in different placement algorithms for VM replica deployment. Third, it illustrates the design of a framework for real-time dissemination of resource utilization information using a real-time publish/subscribe framework, which is required by the replica selection and placement framework. Experimental results using a case study that involves a specific replica placement algorithm are presented to evaluate the effectiveness of our architecture.
    Article · Oct 2014
    • Kemari (Tamura et al., 2008) is a cluster system which tries to keep VMs transparently running in the event of hardware failures. Kemari uses primary-backup approach so that any storage or network event that changes the state of the primary VM must be synchronized in backup VM.
    [Show abstract] [Hide abstract] ABSTRACT: Disaster recovery is a persistent problem in IT platforms. This problem is more crucial in cloud computing, because Cloud Service Providers (CSPs) have to provide the services to their customers even if the data center is down, due to a disaster. In the past few years, researchers have shown interest to disaster recovery using cloud computing, and a considerable amount of literature has been published in this area. However, to the best of our knowledge, there is a lack of precise survey for detailed analysis of cloud-based disaster recovery. To fill this gap, this paper provides an extensive survey of disaster recovery concepts and research in the cloud environments. We present different taxonomy of disaster recovery mechanisms, main challenges and proposed solutions. We also describe the cloud-based disaster recovery platforms and identify open issues related to disaster recovery.
    Full-text · Article · Sep 2014
    • Virtual machine hot standby is a realization method of " Primary-Backup " mechanism. There are three ways of virtual machines hot standby based on Xen: (1) Continuous storage, shared hard disk, event driven, trigger synchronization of the two virtual machines when read/write files[3]. This method can complete the process of synchronization by suspending primary virtual machine (VM) for millisecond level time.
    [Show abstract] [Hide abstract] ABSTRACT: Double Shadow Page Tables method based on Most Recently Used algorithm is presented to solve the low availability of output delay in virtual machine hot standby. This method uses check-pointing mechanism, the workflow of primary virtual machine is divided into two stages: running and synchronization stage. One of the two Shadow Page Tables is as the main table and the other is as the alternate table. During running stage of primary virtual machine, the main table is used to save the operation of the virtual machine, and then primary virtual machine uses the Most Recently Used algorithm to check out some pages and copies them in the alternate table. During synchronization stage, the previous main table realizes the synchronization of the two virtual machine s, and the previous alternate table is as the main table. Experimental results show that this method compared with the original method Remus has obvious advantages.
    Article · Jul 2013
    • The aspect of leveraging virtualization to improve availability and reliability of the system has been used in the past. In [8] and [9], the authors presented a mechanism to continuously synchronize the memory state of a node to backup nodes using checkpointing. When the primary node failure happens, the backup node resumes execution immediately, providing semblance of no interruption to the end user.
    [Show abstract] [Hide abstract] ABSTRACT: The increasing popularity of Cloud computing as an attractive alternative to classic information processing systems has increased the importance of its correct and continuous operation even in the presence of faulty components. In this paper, we introduce an innovative, system-level, modular perspective on creating and managing fault tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer. In particular, the service layer allows the user to specify and apply the desired level of fault tolerance, and does not require knowledge about the fault tolerance techniques that are available in the envisioned Cloud and their implementations.
    Full-text · Article · Jun 2013
    • Thus, virtual machine migration has emerged as a promising technique to be utilized by resource management algorithms since it can cope with resources allocation problems in the virtualized environments. The checkpoint mechanism works by asynchronously replicate/checkpointing the primary VM memory and disk to a backup one, in a very high frequency101112. Administrators use the previously mentioned tools to adequately manage the computing environment.
    [Show abstract] [Hide abstract] ABSTRACT: Cloud computing is increasingly being adopted in different scenarios, like social networking, business applications, scientific experiments, etc. Relying in virtualization technology, the construction of these computing environments targets improvements in the infrastructure, such as power-efficiency and fulfillment of users' SLA specifications. The methodology usually applied is packing all the virtual machines on the proper physical servers. However, failure occurrences in these networked computing systems can induce substantial negative impact on system performance, deviating the system from ours initial objectives. In this work, we propose adapted algorithms to dynamically map virtual machines to physical hosts, in order to improve cloud infrastructure power-efficiency, with low impact on users' required performance. Our decision making algorithms leverage proactive fault-tolerance techniques to deal with systems failures, allied with virtual machine technology to share nodes resources in an accurately and controlled manner. The results indicate that our algorithms perform better targeting power-efficiency and SLA fulfillment, in face of cloud infrastructure failures.
    Full-text · Conference Paper · Mar 2013 · IEEE Systems Journal
Show more
Conference Paper
June 2013
    Cloud computing is fast emerging as a popular choice for a variety of business needs. Providing adequate fault tolerance guarantees to diverse applications is an important challenge. Fault tolerance needs vary from one application to another. Fault tolerance consumes resources. In this paper, we propose fault tolerance to be added as a service, termed here as FTaaS, which can provide both... [Show full abstract]
    Conference Paper
    May 2010
      The security and reliability issues that reside in VM live migration is a critical factor for its acceptance by IT industry. We propose to leverage Intel vPro and TPM to improve security in virtual machine live migration. A role-based mechanism with remote attestation is introduced, under which the VM migration is controlled by specific policies that are protected in seal storage. To provide... [Show full abstract]
      Article
      March 2012
        Fault tolerance, reliability and resilience in Cloud Computing are of paramount importance to ensure continuous operation and correct results, even in the presence of a given maximum amount of faulty components. Most existing research and implementations focus on architecture-specific solutions to introduce fault tolerance. This implies that users must tailor their applications by taking into... [Show full abstract]
        Article
        March 2012
          Virtual machine (VM) replication has been recognized as an inexpensive way of providing high availability on commodity hardware. Unfortunately, its impact on system performance is far from negligible and strategies have been proposed to mitigate this problem. In this paper we take a look at VM replication from a different perspective: the choice of a hyper visor. Namely, the differences... [Show full abstract]
          Discover more