文章基本信息

标题：Romulus: disaster tolerant system based on Kernel Virtual Machines.
作者：Caraman, Mihai Claudiu ; Moraru, Sorin Aurel ; Dan, Stefan 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Disaster tolerance (DT) represents the capability of a computing environment and data to withstand a disaster (such as loss of power, loss of communication components, fire or natural catastrophe) and to continue to operate, or to return to operation in a relatively short period of time. DT is achieved by building a distributed system in which redundant elements are physically separated on distances ranging from adjacent buildings to thousands of miles (HP, 2006).
关键词：Circuit design;Fault tolerance (Computers);Fault tolerant computer systems;Virtual computer systems;Virtual machines

Romulus: disaster tolerant system based on Kernel Virtual Machines.

Caraman, Mihai Claudiu ; Moraru, Sorin Aurel ; Dan, Stefan 等

1. INTRODUCTION

Disaster tolerance (DT) represents the capability of a computing environment and data to withstand a disaster (such as loss of power, loss of communication components, fire or natural catastrophe) and to continue to operate, or to return to operation in a relatively short period of time. DT is achieved by building a distributed system in which redundant elements are physically separated on distances ranging from adjacent buildings to thousands of miles (HP, 2006).

The advanced of virtualization technology creates new opportunities to control and manipulate the operating systems running inside the virtual machines (VM). Current techniques allow hypervisors to migrate live a VM between two hosts, while preserving clients' connections in a transparent manner (Clark et al., 2005). Our intent was to leverage hypervisor's live migration capabilities in order to implement DT systems.

2. CURRENT WORK

High availability (HA) consists in the assurance that the computing environment and data are available to those who need them and to the degree they are needed. From the technical perspective, HA represents the capability of a system to continue to provide service in the event of the failure of one or more components, or during planned downtime (HP, 2006).

Newly released commercial products (Vmware, 2009) leverage hypervisor's live or on storage migration capabilities to offer different kinds of availability and recovery capabilities:

* High availability (HA) is a solution that monitor and restart virtual machines

* Disaster Recovery (DR) is a solution that implies discontinuity in operation and even loses of information.

* Fault Tolerant (FT) is a solution that provides continuous availability. The solution depends on a shared storage (SAN/NAS) and is not suitable for disasters.

None of the enumerated solutions provides full disaster tolerant capabilities. Remus paper (Brendan Cully et al., 2008) presents a new approach for DT systems by combining live VM replication with synchronized live replication of storage. This solution provides transparent fail-over for clients, by preserving the existing connection.

3. OUR CONTRIBUTION

Although Remus introduces an innovative approach it does not provide a specific algorithm for implementing DT systems. Our first goal was to provide an algorithm with accurate specifications for implementing a DT system based on virtual machines.

3.1 VII-stage DT algorithm

We propose a DT algorithm that takes place between a primary host running a VM and a backup host. The algorithm presents the following VII stages:

I. Disk replication & Network protection

* On primary host, buffer network egress traffic

* On primary host, apply local disk writes

* From primary host, replicate disk writes

* On backup host, buffer the replicated disk writes II. VM checkpoint

* On primary host, postpone until previous replication synchronization request is received and processed

* On primary host, suspend VM

* On primary host, copy VM state (mem/cpu) to a buffer

* On primary host, resume VM

* From primary host, request checkpoint synchronization

III. Checkpoint synchronization

* On primary host, create new buffer network traffic

* On backup host, create new disk buffer

IV. Additional disk replication & Network protection

* On primary host, disk writes go to the new buffer

* On backup host, egress traffic goes to the new buffer V. VM replication

* From primary host, replicate the VM buffer on a dedicated thread

* On backup host, buffer the received VM state

* From the backup host, on VM replication completion, request replication synchronization

VI. Replication synchronization

* On primary host, release network buffer

* On primary host, new network buffers become current

* On backup host, flush VM and disk buffers

* On backup host, new disk buffer becomes current

VII. Failure detection and fail-over

* On backup host, continuously monitor primary host

* On backup host, if a failure on primary host is detected, wait for in-flight stage VI to finish and fail-over

These VII stages are repeated until the fail-over takes place.

Due to low latency requirements, the VII-stage algorithm is best suitable for Extended Distance or Metropolitan Clusters (HP, 2006).

3.2 Romulus DT system

We introduce Romulus, a novel DT system implementation that takes advantage of the VII-stage algorithm. Romulus is based on the open-source kvm project (Redhat, 2008). Kvm works in conjunction with qemu-kvm, a machine emulator based on dynamic translations. Romulus DT support consisted in a series of improvements and extensions applied to the qemu emulator. The live replication algorithm was extended to accommodate continuous high frequency replication. A new proxy driver was introduced in the block driver model to accommodate disk replication. Finally, a simplified model was experimented for handling network egress traffic using the proxy driver concept.

3.3 DT on the fly, DT fail-over and DT API

We name 'DT on the fly' the capability of a system to activate the DT capability on a running VM. Romulus implements this capability introducing a full disk replication algorithm. The goal of this algorithm is to replicate the existing disk state together with the new disk writes. This is achieved by performing an optimized replication of disk blocks, combined with dirty tracking. The VM's DT capability becomes effective as soon as the disk is fully replicated.

We name 'DT fail-over' the capability of a system to automatically activate the DT support on a VM that has just failed-over during a disaster event. This capability depends on 'DT on the fly' capability. Romulus implements 'DT fail-over' with the possibility to fail-over to a second backup host.

In order to control the DT capabilities and to allow integration into Infrastructure as a Service clouds (IaaS), Romulus proposes a DT API. The implementation consists in a series of qemu-kvm command line parameters and monitor commands extensions. These allow the initial setting of DT parameters and options as well as dynamic control and inspection of VM's DT behaviour: checkpoint frequency, heart beat time-out, DT on the fly and DT fail-over activation.

3.4 Remus fundamental flaw

In order to understand the Remus's subtle details we analysed the reference implementation provided by its authors. We identified three main processing blocks, which control the Remus's execution flow, summarized below:

* Continuous VM replication on primary host:

protect process

suspend domain

copy VM mem/cpu to buffer

disk replication write 'flush' (RI)

resume domain

replicate VM buffer (RII)

disk replication read 'done' (RV)

net buffer send 'queue release' (RVI)

* Disk handling of 'flush' message on primary host: tapdisk process block-client write 'creq' (RIII)

* Disk handling of 'creq' message on backup host: tapdisk process block-server flush disk buffer write 'done' (RIV)

Analyzing this flow we identified a fundamental flaw that breaks the tolerance capability. If the VM buffer replication fails at stage (RII), the backup host will fail-over to the previous VM state. Thus, the system will end up with an inconsistency between the VM state and the disk.

We also identified a degradation of the overall system performance. This is caused by the late release of the network egress traffic (RVI) which takes place only after the flushing of the current disk buffer on the backup host (RIV).

3.5 Correctness verification and performance rules

We introduce a new correctness verification rule intended to assure the quality of the VII-stage algorithm implementations. Any failure, induced on the primary host after VM checkpoint (IV) or during VM replication (V), should result in the backup host handling consistently the disk and replication, without applying disk writes or VM state that belongs to the new checkpoint.

We also specify a new rule for performance measurement. The network latency is measured since the creation of the new egress buffer on stage III and until its release on the stage VII, during the next iteration.

3.6 Comparative results

As a base for comparing Romulus and the VII-stage algorithm we choose Remus and its reference implementation.

We presented an improved correctness verification rule that identifies the fundamental flaw of Remus implementation. This flaw is not present in the VII-stage algorithm, the disk being flushed only after the VM and disk buffers are fully replicated on the backup host. We presented a performance measurement rule that identifies the system performance degradation induced by Remus implementation. The VII-stage algorithm improves the overall system performance reducing the network traffic latency. This is achieved by releasing the egress buffer after the VM replication without waiting for VM or disk buffers to be flushed on the backup host.

Romulus adopted a distinct strategy by selecting for its foundation a native kernel virtual machine with full virtualization support. This induces a series of advantages: support for non-paravirtualized operating systems mandatory for closed source OSs, increased performance due to hardware virtualization, easiness of installation and maintenance.

With Remus, the user had to replicate in advance the initial disk image. This prevents the user to activate DT on a running VM. There is no solution to overcome this limitation from outside the DT system since the disk image is continuously changing and some data can be buffered in the drivers' internal state. Romulus solves this limitation by introducing the 'DT on the fly' capability.

4. FUTURE RESEARCH

There are a series of research areas for DT systems based on virtual machines. One area that should be investigated is represented by the ISP deployments models for Border Gateway Protocol and Virtual IP Address to achieve fast transparent IP fail-over. Another area for enhancing the system is represented by the failover detection techniques and fencing mechanisms. Our plan is to continue the research with DT systems integration into Infrastructure as a Service cloud computing solutions.

5. CONCLUSIONS

In this paper we have provided an in-depth analysis of the DT systems implementations based on virtual machines, which revealed a fundamental flaw and other limitations. To overcome these issues we have presented a VII-stage DT algorithm with accurate specifications for each stage. We have introduced Romulus, a disaster tolerant (DT) system implementation based on native kernel virtual machine with full virtualization support. Romulus presets a series of novel features as DT on the fly, DT fail-over and DT API for Infrastructure as a Service (IaaS) integration.

6. REFERENCES

Brendan, C.; Geoffrey, L.; Dutch, M.; Mike, F.; Norm, H. & Andrew, W. (2008). Remus: High Availability via Asynchronous Virtual Machine Replication, Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation

Brendan, C. (2009). http://dsg.cs.ubc.ca/remus, UBC, 2009-06-01 Clark, C.; Fraser, K.; Hand, S.; Pratt, I. & Warfield, A. (2005) Live migration of virtual machines. Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation

*** (2007) http://www.hp.com, Delivering high availability and disaster tolerance in a multi-operating-system, HP, 2009-0801

*** (2008) http://www.linux-kvm.org, Redhat, 2009-06-01

*** (2009) http://www.vmware.com/products/fault-tolerance, VMware Fault Tolerance, Vmware, 2009-08-01