Romulus: disaster tolerant system based on Kernel Virtual Machines.
Caraman, Mihai Claudiu ; Moraru, Sorin Aurel ; Dan, Stefan 等
1. INTRODUCTION
Disaster tolerance (DT) represents the capability of a computing
environment and data to withstand a disaster (such as loss of power,
loss of communication components, fire or natural catastrophe) and to
continue to operate, or to return to operation in a relatively short
period of time. DT is achieved by building a distributed system in which
redundant elements are physically separated on distances ranging from
adjacent buildings to thousands of miles (HP, 2006).
The advanced of virtualization technology creates new opportunities
to control and manipulate the operating systems running inside the
virtual machines (VM). Current techniques allow hypervisors to migrate
live a VM between two hosts, while preserving clients' connections
in a transparent manner (Clark et al., 2005). Our intent was to leverage
hypervisor's live migration capabilities in order to implement DT
systems.
2. CURRENT WORK
High availability (HA) consists in the assurance that the computing
environment and data are available to those who need them and to the
degree they are needed. From the technical perspective, HA represents
the capability of a system to continue to provide service in the event
of the failure of one or more components, or during planned downtime
(HP, 2006).
Newly released commercial products (Vmware, 2009) leverage
hypervisor's live or on storage migration capabilities to offer
different kinds of availability and recovery capabilities:
* High availability (HA) is a solution that monitor and restart
virtual machines
* Disaster Recovery (DR) is a solution that implies discontinuity
in operation and even loses of information.
* Fault Tolerant (FT) is a solution that provides continuous
availability. The solution depends on a shared storage (SAN/NAS) and is
not suitable for disasters.
None of the enumerated solutions provides full disaster tolerant
capabilities. Remus paper (Brendan Cully et al., 2008) presents a new
approach for DT systems by combining live VM replication with
synchronized live replication of storage. This solution provides
transparent fail-over for clients, by preserving the existing
connection.
3. OUR CONTRIBUTION
Although Remus introduces an innovative approach it does not
provide a specific algorithm for implementing DT systems. Our first goal
was to provide an algorithm with accurate specifications for
implementing a DT system based on virtual machines.
3.1 VII-stage DT algorithm
We propose a DT algorithm that takes place between a primary host
running a VM and a backup host. The algorithm presents the following VII
stages:
I. Disk replication & Network protection
* On primary host, buffer network egress traffic
* On primary host, apply local disk writes
* From primary host, replicate disk writes
* On backup host, buffer the replicated disk writes II. VM
checkpoint
* On primary host, postpone until previous replication
synchronization request is received and processed
* On primary host, suspend VM
* On primary host, copy VM state (mem/cpu) to a buffer
* On primary host, resume VM
* From primary host, request checkpoint synchronization
III. Checkpoint synchronization
* On primary host, create new buffer network traffic
* On backup host, create new disk buffer
IV. Additional disk replication & Network protection
* On primary host, disk writes go to the new buffer
* On backup host, egress traffic goes to the new buffer V. VM
replication
* From primary host, replicate the VM buffer on a dedicated thread
* On backup host, buffer the received VM state
* From the backup host, on VM replication completion, request
replication synchronization
VI. Replication synchronization
* On primary host, release network buffer
* On primary host, new network buffers become current
* On backup host, flush VM and disk buffers
* On backup host, new disk buffer becomes current
VII. Failure detection and fail-over
* On backup host, continuously monitor primary host
* On backup host, if a failure on primary host is detected, wait
for in-flight stage VI to finish and fail-over
These VII stages are repeated until the fail-over takes place.
Due to low latency requirements, the VII-stage algorithm is best
suitable for Extended Distance or Metropolitan Clusters (HP, 2006).
3.2 Romulus DT system
We introduce Romulus, a novel DT system implementation that takes
advantage of the VII-stage algorithm. Romulus is based on the
open-source kvm project (Redhat, 2008). Kvm works in conjunction with
qemu-kvm, a machine emulator based on dynamic translations. Romulus DT
support consisted in a series of improvements and extensions applied to
the qemu emulator. The live replication algorithm was extended to
accommodate continuous high frequency replication. A new proxy driver
was introduced in the block driver model to accommodate disk
replication. Finally, a simplified model was experimented for handling
network egress traffic using the proxy driver concept.
3.3 DT on the fly, DT fail-over and DT API
We name 'DT on the fly' the capability of a system to
activate the DT capability on a running VM. Romulus implements this
capability introducing a full disk replication algorithm. The goal of
this algorithm is to replicate the existing disk state together with the
new disk writes. This is achieved by performing an optimized replication
of disk blocks, combined with dirty tracking. The VM's DT
capability becomes effective as soon as the disk is fully replicated.
We name 'DT fail-over' the capability of a system to
automatically activate the DT support on a VM that has just failed-over
during a disaster event. This capability depends on 'DT on the
fly' capability. Romulus implements 'DT fail-over' with
the possibility to fail-over to a second backup host.
In order to control the DT capabilities and to allow integration
into Infrastructure as a Service clouds (IaaS), Romulus proposes a DT
API. The implementation consists in a series of qemu-kvm command line
parameters and monitor commands extensions. These allow the initial
setting of DT parameters and options as well as dynamic control and
inspection of VM's DT behaviour: checkpoint frequency, heart beat
time-out, DT on the fly and DT fail-over activation.
3.4 Remus fundamental flaw
In order to understand the Remus's subtle details we analysed
the reference implementation provided by its authors. We identified
three main processing blocks, which control the Remus's execution
flow, summarized below:
* Continuous VM replication on primary host:
protect process
suspend domain
copy VM mem/cpu to buffer
disk replication write 'flush' (RI)
resume domain
replicate VM buffer (RII)
disk replication read 'done' (RV)
net buffer send 'queue release' (RVI)
* Disk handling of 'flush' message on primary host:
tapdisk process block-client write 'creq' (RIII)
* Disk handling of 'creq' message on backup host: tapdisk
process block-server flush disk buffer write 'done' (RIV)
Analyzing this flow we identified a fundamental flaw that breaks
the tolerance capability. If the VM buffer replication fails at stage
(RII), the backup host will fail-over to the previous VM state. Thus,
the system will end up with an inconsistency between the VM state and
the disk.
We also identified a degradation of the overall system performance.
This is caused by the late release of the network egress traffic (RVI)
which takes place only after the flushing of the current disk buffer on
the backup host (RIV).
3.5 Correctness verification and performance rules
We introduce a new correctness verification rule intended to assure
the quality of the VII-stage algorithm implementations. Any failure,
induced on the primary host after VM checkpoint (IV) or during VM
replication (V), should result in the backup host handling consistently
the disk and replication, without applying disk writes or VM state that
belongs to the new checkpoint.
We also specify a new rule for performance measurement. The network
latency is measured since the creation of the new egress buffer on stage
III and until its release on the stage VII, during the next iteration.
3.6 Comparative results
As a base for comparing Romulus and the VII-stage algorithm we
choose Remus and its reference implementation.
We presented an improved correctness verification rule that
identifies the fundamental flaw of Remus implementation. This flaw is
not present in the VII-stage algorithm, the disk being flushed only
after the VM and disk buffers are fully replicated on the backup host.
We presented a performance measurement rule that identifies the system
performance degradation induced by Remus implementation. The VII-stage
algorithm improves the overall system performance reducing the network
traffic latency. This is achieved by releasing the egress buffer after
the VM replication without waiting for VM or disk buffers to be flushed
on the backup host.
Romulus adopted a distinct strategy by selecting for its foundation
a native kernel virtual machine with full virtualization support. This
induces a series of advantages: support for non-paravirtualized
operating systems mandatory for closed source OSs, increased performance
due to hardware virtualization, easiness of installation and
maintenance.
With Remus, the user had to replicate in advance the initial disk
image. This prevents the user to activate DT on a running VM. There is
no solution to overcome this limitation from outside the DT system since
the disk image is continuously changing and some data can be buffered in
the drivers' internal state. Romulus solves this limitation by
introducing the 'DT on the fly' capability.
4. FUTURE RESEARCH
There are a series of research areas for DT systems based on
virtual machines. One area that should be investigated is represented by
the ISP deployments models for Border Gateway Protocol and Virtual IP
Address to achieve fast transparent IP fail-over. Another area for
enhancing the system is represented by the failover detection techniques
and fencing mechanisms. Our plan is to continue the research with DT
systems integration into Infrastructure as a Service cloud computing
solutions.
5. CONCLUSIONS
In this paper we have provided an in-depth analysis of the DT
systems implementations based on virtual machines, which revealed a
fundamental flaw and other limitations. To overcome these issues we have
presented a VII-stage DT algorithm with accurate specifications for each
stage. We have introduced Romulus, a disaster tolerant (DT) system
implementation based on native kernel virtual machine with full
virtualization support. Romulus presets a series of novel features as DT
on the fly, DT fail-over and DT API for Infrastructure as a Service
(IaaS) integration.
6. REFERENCES
Brendan, C.; Geoffrey, L.; Dutch, M.; Mike, F.; Norm, H. &
Andrew, W. (2008). Remus: High Availability via Asynchronous Virtual
Machine Replication, Proceedings of the 5th USENIX Symposium on
Networked Systems Design and Implementation
Brendan, C. (2009). http://dsg.cs.ubc.ca/remus, UBC, 2009-06-01
Clark, C.; Fraser, K.; Hand, S.; Pratt, I. & Warfield, A. (2005)
Live migration of virtual machines. Proceedings of the 2nd USENIX
Symposium on Networked Systems Design and Implementation
*** (2007) http://www.hp.com, Delivering high availability and
disaster tolerance in a multi-operating-system, HP, 2009-0801
*** (2008) http://www.linux-kvm.org, Redhat, 2009-06-01
*** (2009) http://www.vmware.com/products/fault-tolerance, VMware
Fault Tolerance, Vmware, 2009-08-01