ArticlePDF Available

Live migration with pass-through device for Linux VM

Authors:

Abstract and Figures

Open source Linux virtualization, such as Xen and KVM, has made great progress recently, and has been a hot topic in Linux world for years. With virtualization support, the hypervisor de-privileges operating systems as guest operating systems and shares physical resources among guests, such as memory and the network device. For device virtualization, some mechanisms are intro-duced for improving performance. Paravirtualized (PV) drivers are implemented to avoid excessive guest and hypervisor switching and thus achieve better perfor-mance, for example Xen's split virtual network inter-face driver (VNIF). Unlike software optimization in PV driver, IOMMU, such as Intel R Virtualization Technol-ogy for Directed I/O, AKA VT-d, enables direct passing through of physical devices to guests to take advantage of hardware DMA remapping, thus reducing hypervisor intervention and achieving high bandwidth. Physically assigned devices impose challenges to live migration, which is one of the most important virtualiza-tion features in server consolidation. This paper shows how we solve this issue using virtual hot plug technol-ogy, in addition with the Linux bonding driver, and is or-ganized as follows: We start from device virtualization and live migration challenges, followed by the design and implementation of the virtual hotplug based solu-tion. The network connectivity issue is also addressed using the bonding driver for live migration with a direct assigned NIC device. Finally, we present the current status, future work, and other alternative solutions.
Content may be subject to copyright.
Live Migration with Pass-through Device for Linux VM
Edwin Zhai, Gregory D. Cummings, and Yaozu Dong
Intel Corp.
{edwin.zhai, gregory.d.cummings, eddie.dong}@intel.com
Abstract
Open source Linux virtualization, such as Xen and
KVM, has made great progress recently, and has been
a hot topic in Linux world for years. With virtualization
support, the hypervisor de-privileges operating systems
as guest operating systems and shares physical resources
among guests, such as memory and the network device.
For device virtualization, some mechanisms are intro-
duced for improving performance. Paravirtualized (PV)
drivers are implemented to avoid excessive guest and
hypervisor switching and thus achieve better perfor-
mance, for example Xen’s split virtual network inter-
face driver (VNIF). Unlike software optimization in PV
driver, IOMMU, such as Intel R
Virtualization Technol-
ogy for Directed I/O, AKA VT-d, enables direct passing
through of physical devices to guests to take advantage
of hardware DMA remapping, thus reducing hypervisor
intervention and achieving high bandwidth.
Physically assigned devices impose challenges to live
migration, which is one of the most important virtualiza-
tion features in server consolidation. This paper shows
how we solve this issue using virtual hot plug technol-
ogy, in addition with the Linux bonding driver, and is or-
ganized as follows: We start from device virtualization
and live migration challenges, followed by the design
and implementation of the virtual hotplug based solu-
tion. The network connectivity issue is also addressed
using the bonding driver for live migration with a direct
assigned NIC device. Finally, we present the current
status, future work, and other alternative solutions.
1 Introduction to Virtualization
Virtualization became a hot topic in Linux world re-
cently, as various open source virtualization solutions
based on Linux were released. With virtualization,
the hypervisor supports simultaneously running multi-
ple operating systems on one physical machine by pre-
senting a virtual platform to each guest operating sys-
tem. There are two different approaches a hypervisor
can take to present the virtual platform: full virtualiza-
tion and paravirtualization. With full virtualization, the
guest platform presented consists of all existing compo-
nents, such as a PIIX chipset, an IDE controller/disk, a
SCSI controller/disk, and even an old Pentium R
II pro-
cessor, etc. which can be already supported by mod-
ern OS without any modification. Paravirtualization
presents the guest OS with a synthetic platform, with
components that may not have existed in the real world
to date, and thus are unable to run a commercial OS di-
rectly. Instead, paravirtualization requires modifications
to the guest OS or driver source code to match the syn-
thetic platform, which is usually designed to avoid ex-
cessive context switches between guest and hypervisor,
by using the underlying hypervisor knowledge, and thus
achieving better performance.
2 Device Virtualization
Most hardware today doesn’t support virtualization, so
device virtualization could only rely on pure software
technology. Software based virtualization shares phys-
ical resources between different guests, by intercepting
guest access to device resource, for example trapping
I/O commands from a native device driver running in
the guest and providing emulation, that is an emulated
device, or servicing hypercalls from the guest front-end
paravirtualized drivers in split device model, i.e. a PV
device. Both sharing solutions require hypervisor inter-
vention which cause additional overhead, which limits
performance.
To reduce this overhead, a pass-through mechanism is
introduced in Xen and KVM (work in progress) to al-
low assignment of a physical PCI device to a specific
guest so that the guest can directly access the physical
261
262 Live Migration with Pass-through Device for Linux VM
resource without hypervisor intervention [8]. A pass-
through mechanism introduces an additional require-
ment for the DMA engines. A DMA engine transac-
tion requires the host physical address but a guest can
only provide the guest physical address. So a method
must be invoked to convert a guest physical address to a
host physical address for correctness in a non-identical
mapping guest and for secure isolation among guests.
Hardware IOMMU technologies, such as Intel R
Virtu-
alization Technology for devices, i.e. VT-d [7], are de-
signed to convert guest physical addresses to host physi-
cal addresses. They do so by remapping DMA addresses
provided by the guest to host physical addresses in hard-
ware via a VT-d table indexed by a device requestor ID,
i.e. Bus/Device/Function as defined in the PCI specifica-
tion. Pass-through devices have close to native through-
put while maintaining low CPU usage.
PCI SIG I/O Virtualization based Single Root I/O Vir-
tualization, i.e. SR-IOV, is another emerging hardware
virtualization technology which specifies how a single
device can be shared between multiple guest via a hard-
ware mechanism. A single SR-IOV device can have
multiple virtual functions (VF). Each VF has its own
requestor ID and resources which allows the VF to be
assigned to a specific guest. The guest can then di-
rectly access the physical resource without hypervisor
intervention and the VF specific requestor ID allows the
hardware IOMMU to convert guest physical addresses
to host physical addresses.
Of all the devices that are virtualized, network devices
are one of the most critical in data centers. With tra-
ditional LAN solutions and storage solutions such as
iSCSI and FCoE converging on to the network, network
device virtualization is becoming increasingly impor-
tant. In this paper, we choose network devices as a case
study.
3 Live Migration
Relocating a virtual machine from one physical host to
another with very small down-time of service, such as
100 ms [6], is one major benefit of virtualization. Data
centers can use the VM relocation feature, i.e. live mi-
gration, to dynamically balance load on different host-
ing platforms, to achieve better throughput. It can also
be used to consolidate services to reduce the number of
hosting platforms dynamically to achieve better power
savings, or be used to maintain the physical platform
after running for a long time because each physical plat-
form has its life cycles, while VMs can run far longer
than the life cycle of a physical machine. Live migra-
tion, or its similar features, like VM save and VM re-
store, is achieved by copying VM state from one place
to another including memory, virtual devices, and pro-
cessor states. The virtual platform, where the migrated
VM is running, must be the same as the one where it
previously ran, and it must provide the capability that
all internal states can be saved and restored, which de-
pends on how the devices are virtualized.
The guest memory subsystem, making up the guest plat-
form, is kept identical when the VM relocates, assigning
the same amount of memory in the target VM with the
same layout. The live migration manager will copy con-
tents from the source VM to the target, using an incre-
mental approach, to reduce the service outage time [5],
given that the memory a guest owns may vary from tens
of megabytes to tens of gigabytes, and even more in the
future, which means a relatively long time to transmit
even in a ten gigabit Ethernet environment.
The processor type the guest owns and features the host
processor have usually need to be the same across VM
migration, but certain exceptions can be taken if all the
features the source VM uses exist in the target host-
ing processor, or if the hypervisor could provide emu-
lation of those features which do not exist on the tar-
get side. For example, live migration can request the
same CPUID in host side, or just hide the difference in
host side by providing the guest a common subset of
physical features. MSRs are more complicated, except
that the host platform is identical. Fortunately, today’s
guest platform presented is pretty simple and won’t use
those model-specific MSRs. The whole CPU context
size saved at the final step of live migration is usually
in the magnitude of tens of kilobytes, which means just
several milliseconds of out of service time.
On the device side, cloning source device instances to
the target VM after live migration is much more com-
plicated. If the source VM includes only those software
emulated devices or paravirtualized devices, identical
platform device could be maintained by generating ex-
actly the same configuration for the target VM startup,
and the device state could be easily maintained since
the hypervisor knows all of its internal state. Those
devices are called migration friendly devices. But for
guests who have pass-through devices or SR-IOV Vir-
tual Functions on the source VM side, things are totally
2008 Linux Symposium, Volume Two 263
different.
3.1 Issues of Live Migration with pass-through de-
vice
Although guest with pass-through device can achieve
almost native performance, maintaining identical plat-
form device after migration may be impossible. The tar-
get VM may not have the same hardware. Furthermore,
even if the target guest has the identical platform device
as the source side, cloning the device instance to target
VM is also almost impossible, because some device in-
ternal states may not be readable, and some may be still
in-flight at migration time, which is unknown to the hy-
pervisor without the device-specific knowledge. Even
without those unknown states, knowing how to write
those internal states to the relocated VM is another big
problem without device-specific knowledge in the hy-
pervisor. Finally, some devices may have unique infor-
mation that can’t be migrated, such as a MAC address.
Those devices are migration unfriendly.
To address pass-through device migration, either the hy-
pervisor needs to have the device knowledge to help mi-
gration or the guest needs to do those device-specific
operations. In this paper, we ask for guest support by
proposing a guest hot plug based solution to request co-
operation from the guest to unplug all the migration un-
friendly devices before relocation happens, so that we
can have identical platform devices and identical de-
vice states after migration. But hot unplugging an Eth-
ernet card may lead to network service outage, usually
in the magnitude of several seconds. The Linux bond-
ing driver, originally developed for aggregating multiple
network interfaces, is used here to maintain connectiv-
ity.
4 Solution
This section describes a simple and generic solution to
resolve the issue of live migration with pass-through de-
vice. This section also illustrates how to address the fol-
lowing key issues: save/restore device state and keeping
network connectivity for NIC device.
4.1 Stop Pass-through Device
As described in the previous section, unlike emulated
devices, most physical devices can’t be paused to save
and restore their hardware states, so a consistent de-
vice state across live migration is impossible. The only
choice is to stop the guest from using physical devices
before live migration.
How to do it? One easy way is to let the end user stop ev-
erything using a pass-through device including applica-
tions, services, and drivers, and then restore them on the
target machine after the hypervisor allocates a new de-
vice. This method works, but it’s not generic, as differ-
ent Linux distributions have different operations. More-
over, a lot of user intervention is needed inside the Linux
guest.
Another generic solution is ACPI [1] S3 (suspend-to-
ram), in which the operating system freezes all pro-
cesses, suspends all I/O devices, then goes into a sleep
state with all context lost except system memory. But
this is overkill because the whole platform is affected,
besides the target device, and service outage time is in-
tolerable. PCI hotplug is perfect in this case, because:
Unlike ACPI S3, it is a device-level, fine-grained
mechanism.
It’s generic, because the 2.6 kernel supports various
PCI hotplug mechanisms.
No huge user intervention, because PCI hotplug
can be triggered by hardware.
The solution using PCI hotplug looks like the following:
1. Before live migration, on the source host, the con-
trol panel triggers a virtual PCI hot removal event
against the pass-through device into the guest.
2. The Linux guest responds to the hot removal event,
and stops using the pass-through device after un-
loading the driver.
3. Without any pass-through device, Linux can be
safely live migrated to the target platform.
4. After live migration, on the target host, a virtual
PCI hot add event, against a new pass-through de-
vice, is triggered.
5. Linux guest loads the proper driver and starts using
the new pass-through device. Because the guest re-
initializes a new device that has nothing to do with
the old one, the limitation described in 3.1 doesn’t
hold.
264 Live Migration with Pass-through Device for Linux VM
4.2 Keeping Network Connectivity
The most popular usage model for a pass-through device
is assigning a NIC to a VM for high network throughput.
Unfortunately, using PCI NIC hotplug within live mi-
gration breaks the network connectivity, which leads to
an unpleasant user experience. To address this issue, it
is desired that the Linux guest can automatically switch
to a virtual NIC after hot removal of the physical NIC,
and then migrate with the virtual NIC. Thanks to the
powerful and versatile Linux network stack, the Ether-
net bonding driver [3] already supports this feature.
The Linux bonding driver provides a mechanism for en-
slaving multiple network interfaces into a single, log-
ical “bonded” interface with the same MAC address.
Behavior of the bonded interfaces depends on modes.
For instance, the bonding driver has the ability to detect
link failure and reroute network traffic around a failed
link in a manner transparent to the application, which
is active-backup mode. It also has the ability to ag-
gregate network traffic in all working links to achieve
higher throughput, which is referred to as trunking [4].
The active-backup mode can be used for an automatic
switch. In this mode, only one slave in the bond is ac-
tive, while another acts as a backup. The backup slave
becomes active if, and only if, the active slave fails. Ad-
ditionally, one slave can be defined as primary that will
always be the active while it is available. Only when the
primary is off-line will secondary devices be used. This
is very useful when bonding pass-through device, as the
physical NIC is preferred over other virtual devices, for
performance reasons.
It’s very simple to enable bonding driver in Linux. The
end user just needs to reconfigure the network before us-
ing a pass-through device. The whole configuration in
the Linux guest is shown in Figure 1, where a new bond
is created to aggregate two slaves: the physical NIC as
primary, and a virtual NIC as secondary. In normal con-
ditions, the bond would rely on the physical NIC, and
take the following actions in response to hotplug events
in live migration:
When hot removal happens, the virtual NIC be-
comes active and takes over the in/out traffic, with-
out breaking the network inside of the Linux guest.
With this virtual NIC, the Linux guest is migrated
to target machine.
When hot add is complete on the target machine,
the new physical NIC recovers as the active slave
with high throughput.
In this process, no user intervention is required to switch
because the powerful bonding driver handles everything
well.
4.3 PCI Hotplug Implementation
PCI hotplug plays an important role in live migration
with a pass-through device. It should be implemented
in the device model, according to the hardware PCI hot-
plug spec. Currently, the device model of most popular
Linux virtualization solutions such as Xen and KVM,
are derived from QEMU. Unfortunately, QEMU did not
support virtual PCI hotplug when this solution was de-
veloped, so we implemented a virtual PCI hotplug de-
vice model from scratch.
4.3.1 Choosing Hotplug Spec
The PCI spec doesn’t define a standard hotplug mecha-
nism. Here are the three existing categories of PCI hot-
plug mechanisms:
ACPI Hotplug: This is a similar mechanism as the
ACPI dock hot insert/ejection, where some ACPI
control methods work with ACPI GPE to service
the hotplug.
SHPC [2] (Standard HotPlug Controller): It’s
the spec from PCI-SIG to define a complicated
controller to handle the PCI hotplug.
Vendor-specific: There are other vendor-specific
standards, such as Compaq and IBM, which have
their own hardware on servers for PCI hotplug.
Linux 2.6 supports all of the above hotplug standards,
which gives us more choices to select a simple, open,
and efficient one. SHPC is a really complicated device,
so it’s hard to implement. Vendor-specific controllers
are not well supported in other OS. ACPI hotplug is best
suited to being emulated in the device model, because
interface exposed to OSPM is very simple and well de-
fined.
2008 Linux Symposium, Volume Two 265










 !"
 !"


"
"
#$%



%

Figure 1: Live Migration with Pass-through Device
4.3.2 Virtual ACPI hotplug
Making an ACPI hotplug controller in device model is
something like designing a hardware platform to support
ACPI hotplug, but using software emulation. Virtual
ACPI hotplug needs several parts in the device model
to coordinate in a sequence similar to native. For sys-
tem event notification, ACPI introduces GPE (General
Purpose Event), which is a bitmap, and each bit can be
wired to different value-added event hardware depend-
ing on design.
The virtual ACPI hotplug sequence is described in Fig-
ure 2. When the end user issues the hot removal com-
mand for the pass-through device, analogous to pushing
the eject button, the hotplug controller updates its status,
then asserts the GPE bit and raises a SCI (System Con-
trol Interrupt). Upon receiving a SCI, the ACPI driver in
the Linux guest clears the GPE bit, queries the hotplug
controller about which specific device it needs to eject,
and then notifies the Linux guest. In turn, the Linux
guest shuts down the device and unloads the driver. At
last, the ACPI driver executes the related control method
_EJ0, to power off the PCI device, and _STA to verify
the success of the ejection. Hot add is similar to this
process, except it doesn’t call the _EJ0.
In the process shown above, it’s obvious that following
components are needed:
GPE: A GPE device model, with one bit wired
to the hotplug controller, is described in the guest
FADT (Fixed ACPI Description Table).
PCI Hotplug Controller: A PCI hotplug con-
troller is needed to respond to the user’s hotplug ac-
tion and maintain the status of the PCI slots. ACPI
abstracts a well-defined interface so we can imple-
ment internal logic in a simplified style, such as
stealing some reserved ioports for register status.
Hotplug Control Method: ACPI control methods
for hotplug, such as _EJ0 and _STA, should be
added in the guest ACPI table. These methods in-
teract with the hotplug controller for device ejec-
tion and status check.
5 Status and Future Work
Right now, hotplug with a pass-through device works
well on Xen. With this and the bonding driver, Linux
guests can successfully do live migration. Besides live
migration, pass-through device hotplug has other useful
usage models, such as dynamically switching physical
devices between different VMs.
There is some work and investigation that needs to be
done in future:
High-level Management Tools: Currently, hot-
plug of a pass-through device is separated from
generic live migration logic for a clean design, so
266 Live Migration with Pass-through Device for Linux VM










 
 
!
"#
$


%&
Figure 2: ACPI Hotplug Sequence
the end user is required to issue hotplug commands
manually before and after live migration. In the
future, these actions should be pushed into high
level management tools, such as a friendly GUI or
scripts, in order to function without user interven-
tion.
Virtual S3: The Linux bonding driver works per-
fectly for a NIC, but bonding other directly as-
signed devices, such as graphics cards, is not as
useful. Since Linux has good support for ACPI S3,
we can try virtual S3 to suspend all devices before
live migration and wakeup them after that. Some
drawbacks of virtual S3 need more consideration:
All other devices, besides pass-through de-
vices, go into this loop too, which takes more
time than virtual hotplug.
With S3, the OS is in sleep state, so a long
down time of the running service is unavoid-
able.
S3 has the assumption that the OS would
wake up on the same platform, so the same
type of pass-through devices must exist in the
target machine.
S3 support in the guest may not be complete
and robust.
Although virtual S3 for pass-through device live
migration has its own limitation, it is still useful in
some environments where virtual hotplug doesn’t
work, for instance, hot removal of pass-through
display cards which are likely to cause a guest
crash.
Other Guest: Linux supports ACPI hotplug and
has a powerful bonding driver, but other guest OS
may not be lucky enough to have such a frame-
work. We are in the process of extending support
to other guests.
6 Conclusion
VM direct access of physical device achieves close to
native performance, but breaks VM live migration. Our
virtual ACPI hotplug device model allows VM to hot
remove the pass-through device before relocation and
hot add another one after relocation, thus making pass-
through devices coexist with VM relocation. By inte-
grating the Linux bonding driver into the relocation pro-
cess, we enable continuous network connectivity for di-
rectly assigned NIC devices, which is the most popular
pass-through device usage model.
References
[1] “Advanced Configuration & Power
Specification,” Revision 3.0b, 2006,
Hewlett-Packard, Intel, Microsoft, Phoenix,
Toshiba. http://www.acpi.info
2008 Linux Symposium, Volume Two 267
[2] “PCI Standard Hot-Plug Controller and
Subsystem Specification,” Revision 1.0, June,
2001, http://www.pcisig.info
[3] “Linux Ethernet Bonding Driver,” T. Davis, W.
Tarreau, C. Gavrilov, C.N. Tindel Linux Howto
Documentation, April, 2006.
[4] “High Available Networking,” M. John, Linux
Journal, January, 2006.
[5] “Live Migration of Virtual Machines,” C. Clark,
K. Fraser, S. Hand, J.G. Hansen, E. Jul, C.
Limpach, I. Pratt, and A. Warfiled, In
Proceedings of the 2nd Symposium on
Networked Systems Design and Implementation,
2005.
[6] “Xen 3.0 and the Art of Virtualization,” I. Pratt,
K. Fraser, S. Hand, C. Limpach, A. Warfield, D.
Magenheimer, J. Nakajima, and A. Mallick, In
Proceedings of the Linux Symposium (OLS),
Ottawa, Ontario, Canada, 2005.
[7] “Intel Virtualization Technology for Directed
I/O Architecture Specification,” 2006,
ftp://download.intel.com/
technology/computing/vptech/
Intel(r)_VT_for_Direct_IO.pdf
[8] “Utilizing IOMMUs for Virtualization in Linux
and Xen,” M. Ben-Yehuda, J. Mason, O. Krieger,
J. Xenidis, L.V. Doorn, A. Mallick, and J.
Nakamima, In Proceedings of the Linux
Symposium, Ottawa, Ontario, Canada, (OLS),
2006.
Intel may make changes to specifications, product descrip-
tions, and plans at any time, without notice.
Intel and Pentium are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States and
other countries (regions).
*Other names and brands may be claimed as the property of
others.
Copyright (c) 2008, Intel Corporation. Redistribution rights
are granted per submission guidelines; all other rights re-
served.
Proceedings of the
Linux Symposium
Volume Two
July 23rd–26th, 2008
Ottawa, Ontario
Canada
Conference Organizers
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,
Thin Lines Mountaineering
C. Craig Ross, Linux Symposium
Review Committee
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium,
Thin Lines Mountaineering
Dirk Hohndel, Intel
Gerrit Huizenga, IBM
Dave Jones, Red Hat, Inc.
Matthew Wilson, rPath
C. Craig Ross, Linux Symposium
Proceedings Formatting Team
John W. Lockhart, Red Hat, Inc.
Gurhan Ozen, Red Hat, Inc.
Eugene Teo, Red Hat, Inc.
Kyle McMartin, Red Hat, Inc.
Jake Edge, LWN.net
Robyn Bergeron
Dave Boutcher, IBM
Mats Wichmann, Intel
Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.
... Nowadays, virtualization is not supported in the most of the hardware, so device virtualization could only rely on pure software technology [30]. Physical resources are shared using software based virtualization between different guests, by preventing access to device resources from guests. ...
Article
Full-text available
In a cloud computing the live migration of virtual machines shows a process of moving a running virtual machine from source physical machine to the destination, considering the CPU, memory, network, and storage states. Various performance metrics are tackled such as, downtime, total migration time, performance degradation, and amount of migrated data, which are affected when a virtual machine is migrated. This paper presents an overview and understanding of virtual machine live migration techniques, of the different works in literature that consider this issue, which might impact the work of professionals and researchers to further explore the challenges and provide optimal solutions.
... If we look solely at I/O performance, self-virtualizing I/O devices [39] are in conflict with commonly-used live migration [50], do not easily scale with the number of VMs [26], and prevent commonly-used interposition techniques [43]. SVt triggers VM traps in a more efficient way when accessing devices, therefore supporting these use cases while reducing their overheads. ...
Conference Paper
IaaS datacenters offer virtual machines (VMs) to their clients, who in turn sometimes deploy their own virtualized environments, thereby running a VM inside a VM. This is known as nested virtualization. VMs are intrinsically slower than bare-metal execution, as they often trap into their hypervisor to perform tasks like operating virtual I/O devices. Each VM trap requires loading and storing dozens of registers to switch between the VM and hypervisor contexts, thereby incurring costly runtime overheads. Nested virtualization further magnifies these overheads, as every VM trap in a traditional virtualized environment triggers at least twice as many traps. We propose to leverage the replicated thread execution resources in simultaneous multithreaded (SMT) cores to alleviate the overheads of VM traps in nested virtualization. Our proposed architecture introduces a simple mechanism to colocate different VMs and hypervisors on separate hardware threads of a core, and replaces the costly context switches of VM traps with simple thread stall and resume events. More concretely, as each thread in an SMT core has its own register set, trapping between VMs and hypervisors does not involve costly context switches, but simply requires the core to fetch instructions from a different hardware thread. Furthermore, our inter-thread communication mechanism allows a hypervisor to directly access and manipulate the registers of its subordinate VMs, given that they both share the same in-core physical register file. A model of our architecture shows up to 2.3× and 2.6× better I/O latency and bandwidth, respectively. We also show a software-only prototype of the system using existing SMT architectures, with up to 1.3× and 1.5× better I/O latency and bandwidth, respectively, and 1.2--2.2× speedups on various real-world applications.
Article
Virtualized Network I/O (VNIO) plays a key role in providing the network connectivity to cloud services, as it delivers packets for Virtual Machines (VMs). Existing para-virtualized solutions accelerate the virtual Switch (vSwitch) data transfer via memory-sharing mechanism, that unfortunately impairs the memory isolation barrier among VMs. In this paper, we categorize existing para-virtualized solutions into two types: VM to vSwitch (V2S) and vSwitch to VM (S2V), according to the memory-sharing strategy. We then analyze their individual VM isolation issues, that is, a malicious VM may access other ones’ data by exploiting the shared memory. To solve this issue, we propose a new S2H memory sharing scheme, which shares the I/O memory from vSwitch to Hypervisor. The S2H scheme can guarantee both VM isolation and network performance as the hypervisor acts as a “setter” between VM and vSwitch for packet delivery. To show that S2H can be implemented easily and efficiently, we implement the prototype based on the de-facto para-virtualization standard vHost-User solution. Extensive experimental results show that S2H not only guarantees the isolation but also holds the comparable throughput with the same CPU cores configured, when comparing with the native vHost-User solution.
Article
The fast access to data and high parallel processing in high-performance computing instigates an urgent demand on the improvement of the NVMe storage within modern data centers. However, the former NVMe virtualization's unsatisfactory performance demonstrates that NVMe devices are often underutilized within cloud platforms. An NVMe virtualization mechanism with high performance and device sharing has captured researchers and developers' attention. This paper introduces MDev-NVMe, a new virtualization solution for NVMe storage device with (1) full NVMe storage virtualization for VMs running native NVMe driver, (2) a mediated pass-through mechanism for NVMe management, and (3) adaptive configuration of active polling optimization to simultaneously achieve high throughput, low latency performance, and substantial device scalability. We practically implement the MDev-NVMe as a Linux kernel module. This paper subsequently evaluates MDev-NVMe with Intel OPTANE and P3600 SSD by comparing several mainstream NVMe virtualization mechanisms using application-level I/O benchmarks. MDev-NVMe with active polling can demonstrate a 142% improvement over native (interrupt-driven) throughput and over 2.5 × the Virtio throughput with only 70% native average latency and 31% Virtio average latency. Finally, the advantages of MDev-NVMe and the importance of adaptive polling are discussed, offering evidence that MDev-NVMe is a superior virtualization choice for cloud storage.
Conference Paper
Full-text available
High availability is the most important and challenging problem for cloud providers. However, virtual machine monitor (VMM), a crucial component of the cloud infrastructure, has to be frequently updated and restarted to add security patches and new features, undermining high availability. There are two existing live update methods to improve the cloud availability: kernel live patching and Virtual Machine (VM) live migration. However, they both have serious drawbacks that impair their usefulness in the large cloud infrastructure: kernel live patching cannot handle complex changes (e.g., changes to persistent data structures); and VM live migration may incur unacceptably long delays when migrating millions of VMs in the whole cloud, for example, to deploy urgent security patches. In this paper, we propose a new method, VMM live upgrade, that can promptly upgrade the whole VMM (KVM & QEMU) without interrupting customer VMs. Timely upgrade of the VMM is essential to the cloud because it is both the main attack surface of malicious VMs and the component to integrate new features. We have built a VMM live upgrade system called Orthus. Orthus features three key techniques: dual KVM, VM grafting, and device handover. Together, they enable the cloud provider to load an upgraded KVM instance while the original one is running and "cut-and-paste'' the VM to this new instance. In addition, Orthus can seamlessly hand over passthrough devices to the new KVM instance without losing any ongoing (DMA) operations. Our evaluation shows that Orthus can reduce the total migration time and downtime by more than $99%$ and $90%$, respectively. We have deployed Orthus in one of the largest cloud infrastructures for a long time. It has become the most effective and indispensable tool in our daily maintenance of hundreds of thousands of servers and millions of VMs.
Article
Barely acceptable block I/O performance prevents virtualization from being widely used in the High- Performance Computing field. Although the virtio paravirtual framework brings great I/O performance improvement, there is a sharp performance degradation when accessing high-performance NAND-flash-based devices in the virtual machine due to their data parallel design. The primary cause of this fact is the deficiency of block I/O parallelism in hypervisor, such as KVM and Xen. In this paper, we propose a novel design of block I/O layer for virtualization, named VBMq. VBMq is based on virtio paravirtual I/O model, aiming to solve the block I/O parallelism issue in virtualization. It uses multiple dedicated I/O threads to handle I/O requests in parallel. In the meanwhile, we use polling mechanism to alleviate overheads caused by the frequent context switches of the VM’s notification to and from its hypervisor. Each dedicated I/O thread is assigned to a non-overlapping core to improve performance by avoiding unnecessary scheduling. In addition, we configure CPU affinity to optimize I/O completion for each request. The CPU affinity setting is very helpful to reduce CPU cache miss rate and increase CPU efficiency. The prototype system is based on Linux 4.1 kernel and QEMU 2.3.1. Our measurements show that the proposed method scales graciously in the multi-core environment, and provides performance which is 39.6x better than the baseline Received September 26, 2016; accepted June 23, 2017 E-mail: diming.zhang@gmail.com at most, and approaches bare-metal performance.
Article
Single-Root I/O Virtualization (SR-IOV) is a specification that allows a single PCI Express (PCIe) device (physical function or PF) to be used as multiple PCIe devices (virtual functions or VF). In a virtualization system, each VF can be directly assigned to a virtual machine (VM) in passthrough mode to significantly improve the network performance. However, VF passthrough mode is not compatible with live migration, which is an essential capability that enables many advanced virtualization features such as high availability and resource provisioning. To solve this problem, we design SRVM which provides hypervisor support to ensure the VF device can be correctly used by the migrated VM and the applications. SRVM is implemented in the hypervisor without modification in guest operating systems or guest VM drivers. SRVM does not increase VM downtime. It only costs limited resources (an extra CPU core only during the live migration pre-copy phase), and there is no significant runtime overhead in VM network performance.
Article
Full-text available
We present the design, implementation, and evaluation of post-copy based live migration for virtual machines (VMs) across a Gigabit LAN. Post-copy migration defers the trans- fer of a VM's memory contents until after its processor state has been sent to the target host. This deferral is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by a final transfer of the processor state. The post-copy strategy can provide a "win-win" by reducing total migration time while maintaining the liveness of the VM during migration. We compare post-copy extensively against the traditional pre- copy approach on the Xen Hypervisor. Using a range of VM workloads we show that post-copy improves several metrics including pages transferred, total migration time, and net- work overhead. We facilitate the use of post-copy with adap- tive prepaging techniques to minimize the number of page faults across the network. We propose different prepaging strategies and quantitatively compare their effectiveness in reducing network-bound page faults. Finally, we eliminate the transfer of free memory pages in both pre-copy and post- copy through a dynamic self-ballooning (DSB) mechanism. DSB periodically reclaims free pages from a VM and sig- nificantly speeds up migration with negligible performance impact on VM workload.
Advanced Configuration & Power Specification,” Revision 3.0b
  • Hewlett
  • Packard
  • Intel
  • Microsoft
  • Phoenix
  • Toshiba
“Advanced Configuration & Power Specification,” Revision 3.0b, 2006, Hewlett-Packard, Intel, Microsoft, Phoenix, Toshiba. http://www.acpi.info r2008 Linux Symposium, Volume Two • 267
Linux Ethernet Bonding Driver
  • W Davis
  • C Tarreau
  • C N Gavrilov
" Linux Ethernet Bonding Driver, " T. Davis, W. Tarreau, C. Gavrilov, C.N. Tindel Linux Howto Documentation, April, 2006.
Linux Symposium, Thin Lines Mountaineering Dirk Hohndel
  • Andrew J Hutton
  • Inc Steamballoon
Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Dirk Hohndel, Intel Gerrit Huizenga, IBM Dave Jones, Red Hat, Inc.
Intel Corporation Redistribution rights are granted per submission guidelines; all other rights re- served
Copyright (c) 2008, Intel Corporation. Redistribution rights are granted per submission guidelines; all other rights re- served. Proceedings of the Linux Symposium Volume Two July 23rd–26th, 2008 Ottawa, Ontario Canada Conference Organizers
Intel Virtualization Technology for Directed I/O Architecture Specification Intel(r)_VT_for_Direct_IO.pdf [8] Utilizing IOMMUs for Virtualization in Linux and Xen
  • J Ben-Yehuda
  • O Mason
  • J Krieger
  • L V Xenidis
  • A Doorn
  • J Mallick
  • Nakamima
" Intel Virtualization Technology for Directed I/O Architecture Specification, " 2006, ftp://download.intel.com/ technology/computing/vptech/ Intel(r)_VT_for_Direct_IO.pdf [8] " Utilizing IOMMUs for Virtualization in Linux and Xen, " M. Ben-Yehuda, J. Mason, O. Krieger, J. Xenidis, L.V. Doorn, A. Mallick, and J. Nakamima, In Proceedings of the Linux Symposium, Ottawa, Ontario, Canada, (OLS), 2006.
Tindel Linux Howto Documentation
  • T Davis
  • W Tarreau
  • C Gavrilov
"Linux Ethernet Bonding Driver," T. Davis, W. Tarreau, C. Gavrilov, C.N. Tindel Linux Howto Documentation, April, 2006.
High Available Networking
"High Available Networking," M. John, Linux Journal, January, 2006.
Live Migration of Virtual Machines
  • C Clark
  • K Fraser
  • S Hand
  • J G Hansen
  • E Jul
  • C Limpach
  • I Pratt
  • A Warfiled
"Live Migration of Virtual Machines," C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfiled, In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation, 2005.