Georgia Institute of Technology
Paravirtual devices are common in virtualized environments, providing improved virtual device performance compared to emulated physical devices. For virtualization to make inroads in High Performance Computing and other areas that require high bandwidth and low latency, high-performance transports such as InfiniBand, the Internet Wide Area RDMA Protocol (iWARP), and RDMA over Converged Ethernet (RoCE) must be virtualized.
We developed a paravirtual interface called Virtual RDMA (vRDMA) that provides an RDMA-like interface for VMware ESXi guests. vRDMA uses the Virtual Machine Communication Interface (VMCI) virtual device to interact with ESXi. The vRDMA interface is designed to support snapshots and VMware vMotion® so the state of the virtual machine can be easily isolated and transferred. This paper describes our vRDMA design and its components, and outlines the current state of work and challenges faced while developing this device.
Categories and Subject Descriptors
C.2.5 [Computer-Communication Networks]: Local and Wide-Area Networks- High-speed;
C.4 [Performance of Systems]: Modeling techniques;
D.4.4 [Operating Systems]: Communication Management- Network Communication;
Algorithms, Design, High Performance Computing, Management
InfiniBand, Linux, RDMA, Subnet Management, Virtualization, virtual machine
Paravirtualized devices are common in virtualized environments [1-3] because they provide better performance than emulated devices. With the increased importance of newer high-performance fabrics such as InfiniBand, iWARP, and RoCE for Big Data, High Performance Computing, financial trading systems, and so on, there is a need [4-6] to support such technologies in a virtualized environment. These devices support zero-copy, operating system-bypass and CPU offload [7-9] for data transfer, providing low latency and high throughput to applications. It is also true, however, that applications running in virtualized environments benefit from features such as vMotion (virtual machine live migration), resource management, and virtual machine fault tolerance. For applications to continue to benefit from the full value of virtualization while also making use of RDMA, the paravirtual interface must be designed to support these virtualization features.
Currently there are several ways to provide RDMA support in virtual machines. The first option, called passthrough (or VM DirectPath I/O on ESXi), allows virtual machines to directly control RDMA devices. Passthrough also can be used in conjunction with single root I/O virtualization (SR-IOV)  to support the sharing of a single hardware device between multiple virtual machines by passing through a Virtual Function (VF) to each virtual machine. This method, however, restricts the ability to use virtual machine live migration or to perform any resource management. A second option is to use a software-level driver, called SoftRoCE , to convert RDMA Verbs operations into socket operations across an Ethernet device. This technique, however, suffers from performance penalties and may not be a viable option for some applications.
With that in mind, we developed a paravirtual device driver for RDMA-capable fabrics, called Virtual RDMA (vRDMA). It allows multiple guests to access the RDMA device using a Verbs API, an industry-standard interface. A set of these Verbs was implemented to expose an RDMA-capable guest device (vRDMA) to applications. The applications can use the vRDMA guest driver to communicate with the underlying physical device. This paper describes our design and implementation of the vRDMA guest driver using the VMCI virtual device. It also discusses the various components of vRDMA and how they work in different levels of the virtualization stack. The remainder of the paper describes how RDMA works, the vRDMA architecture and interaction with VMCI, and vRDMA components. Finally, the current status of vRDMA and future work are described.
The Remote Direct Memory Access (RDMA) technique allows devices to read/write directly to an application’s memory without interacting with the CPU or operating system, enabling higher throughput and lower latencies. As shown on the right in Figure 1, the application can directly program the network device to perform DMA to and from application memory. Essentially, network processing is pushed onto the device, which is responsible for performing all protocol operations. As a result, RDMA devices historically have been extremely popular for High Performance Computing (HPC) applications [6, 7]. More recently, many clustered enterprise applications, such as databases, file systems and emerging Big Data application frameworks such as Apache Hadoop, have demonstrated performance benefits using RDMA[6, 11, 12].
Figure 1. Comparing RDMA and Sockets
While data transfer operations can be performed directly by the application as described above, control operations such as allocation of network resources on the device need to be executed by the device driver in the operating system for each application. This allows the device to multiplex between various applications using these resources. After the control path is established, the application can directly interact with the device, programming it to perform DMA operations to other hosts, a capability often called OS-bypass. RDMA also is said to support zero-copy since the device directly reads/writes from/to application memory and there is no buffering of data in the operating system. This offloading of capabilities onto the device, coupled with direct user-level access to the hardware, largely explains why such devices offer superior performance. The next section describes our paravirtualized RDMA device, called Virtual RDMA (vRDMA).
3. vRDMA over VMCI Architecture
Figure 2 illustrates the vRDMA prototype. The architecture is similar to any virtual device, with a driver component at the guest level and another at the hypervisor level that is responsible for communicating with the physical device. In the case of our new device, we include a modified version of the OpenFabrics RDMA stack within the hypervisor that implements the core Verbs required for RDMA devices. Using this stack allows us to be agnostic with respect
to RDMA transport, enabling the vRDMA device to support InfiniBand (IB), iWARP, and RoCE.
Figure 2. vRDMA over VMCI Architecture2.
We expose RDMA capabilities to the guest using VMCI , a virtual PCI device that supports a point-to-point bidirectional transport based on a pair of memory-mapped queues and datagrams serving as asynchronous notifications. Using VMCI, we construct communication endpoints between each guest and an endpoint in the hypervisor called the vRDMA VMCI endpoint, as shown in Figure 2. All guests connect with the hypervisor endpoint when the vRDMA driver is loaded in the guests.
To use RDMA in a virtual environment, guest operating systems use the standard OpenFabrics Enterprise Distribution (OFED) RDMA stack, along with our guest kernel vRDMA driver and a user-level library (libvrdma). These additional components could be distributed using VMware Tools, which already contain the VMCI guest virtual device driver. The OFED stack, the device driver, and library provide an implementation of the industry-standard Verbs API for each device.
Our guest kernel driver communicates with the vRDMA VMCI endpoint using VMCI datagrams that encapsulate Verbs and their associated data structures. For example, a ‘Register Memory’ Verb datagram contains the memory region size and guest physical addresses associated with the application.
When the VMKernel VMCI Endpoint receives the datagram from the guest, it handles the Verb command in one of two ways, depending on whether the destination virtual machine is on the same physical machine (intra-host) or a different machine (inter-host).
The vRDMA endpoint can determine if two virtual machines are on the same host by examining the QP numbers and LIDs assigned to the virtual machines. Once it has determined the virtual machines are on the same host, it emulates the actual RDMA operation. For example, when a virtual machine issues an RDMA Write operation, it specifies the source address, destination address, and data size. When the endpoint receives the RDMA Write operation in the Post_Send Verb, it performs a memory copy into the address associated with the destination virtual machine. We can further extend the emulation to the creation of queue pairs (QPs), CQs, MRs (see  for a glossary of terms) and other resources such that we can handle all Verbs calls in the endpoint. This method is described in more detail in Section 4.4.
In the inter-host case, the vRDMA endpoint interacts with the ESXi RDMA stack as shown in Figure 2. When a datagram is received from a virtual machine, it checks to see if the Verb corresponds to a creation request for communication resources, such as Queue Pairs. These requests are forwarded to the ESXi RDMA stack, which returns values after interacting with the device. The endpoint returns results to the virtual machine using a VMCI datagram. When the virtual machine sends an RDMA operation command, such as RDMA Read, RDMA Write, RDMA Send, or RDMA Receive, the endpoint directly forwards the Verbs call to the RDMA stack since it already knows it is an inter-host operation.
4. Components of vRDMA
This section describes the main components of the vRDMA paravirtual device and their interactions.
The libvrdma component is a user-space library that applications use indirectly when linking to the libibverbs library. Applications (or middleware such as MPI) that use RDMA link to the device-agnostic libibverbs library, which implements the industry-standard Verbs API. The libibverbs library in turn links to libvrdma, which enables the application to use the vRDMA device.
Verbs are forwarded by libibverbs to libvrdma, which in turn forwards the Verbs to the main RDMA module present in the guest kernel using the Linux /dev file system. This functionality is required to be compatible with the OFED stack, which communicates with all underlying device drivers in this way. Figure 3 illustrates the RDMA application executing a ‘Query_Device’ Verb.
Figure 3. Guest vRDMA driver and libvrdma
4.2 Guest Kernel vRDMA Driver
The OFED stack provides an implementation of the kernel-level Verbs API called the ib_core framework. The framework allows device drivers to register themselves using callbacks for each Verb. Each device driver must provide implementations of a mandatory list of Verbs (Table 1). Therefore, our vRDMA guest kernel driver must implement this list of Verbs to register successfully with the OFED stack. Underneath these Verbs calls we communicate with the vRDMA endpoint to ensure valid responses for each Verb. Next, using the Query_Device Verb as an example, we describe how the Guest kernel driver handles Verbs.
Table 1: Verbs supported in the vRDMA prototype.
As shown in Figure 3, the Query_Device Verb is executed by the application. It is passed to the ib_core framework, which calls the Query_Device function in our vRDMA kernel module using the registered callback. The vRDMA guest kernel module packetizes the Verbs call into a buffer to be sent using a VMCI datagram. This datagram is sent to the vRDMA VMKernel VMCI Endpoint that handles all requests from virtual machines and issues responses.
4.3 ESXi RDMA stack
The ESXi RDMA stack is an implementation of the OFED stack that resides in VMKernel and contains device drivers for RDMA devices. In our prototype, the ESXi RDMA stack mediates access to RDMA hardware on behalf of our vRDMA device in the guest. Other hypervisor services, such as vMotion and Fault Tolerance, could use this stack to access the RDMA device.
4.4 VMKernel vRDMA VMCI Endpoint
The main role of the VMKernel vRDMA VMCI endpoint is to receive requests from virtual machines and send appropriate responses back by interacting with the ESXi RDMA stack. Most requests from virtual machines are in the form of Verbs calls. Responses depend on the location of the destination virtual machine. As mentioned in Section 3, the vRDMA endpoint handles the Verb command in two ways depending on whether the destination virtual machine is on the same host. To decide whether the virtual machine is on the same host, the endpoint consults a list of communication resources used by the virtual machine: QP numbers, CQ number, MR entries, and LIDs. These identify the communication taking place and, therefore, the virtual machines.
For example, the endpoint can identify the destination virtual machine when the source virtual machine issues a Modify QP Verb, which includes the destination QP number and LID. It matches these with its list of virtual machine communication resources, connecting the two virtual machines in the endpoint when it finds a match. Once connected, the Modify QP/CQ Verb is not forwarded to the RDMA stack. Instead, the endpoint returns values to the virtual machine, stating the Verb completed successfully. When the source virtual machine subsequently executes an RDMA data transfer operation, the vRDMA endpoint performs a memory copy based on the type of operation. For example, when a virtual machine issues an RDMA Write, it specifies the source, destination address, and data size. The endpoint performs the memory copy when it receives the Verb. Figure 4 shows the vRDMA architecture when virtual machines reside on the same host.
When virtual machines are determined to be on different hosts based on a previous Modify QP Verb call, the vRDMA endpoint forwards any Verbs received to the RDMA stack in the VMKernel. After this check, the endpoint forwards all Verbs calls to the RDMA stack. The RDMA stack and the physical device are responsible for completing the Verb call. Once the Verb call completes, the endpoint accepts the return values from the RDMA stack and sends them back to the source virtual machine using VMCI datagrams.
Figure 4. vRDMA architecture for virtual machines on the same host
Figure 5 shows the Query_Device Verb being received by the vRDMA endpoint. As an optimization, the values for such Verbs can be cached in the endpoint. Therefore, the first Verb call is forwarded to the RDMA stack, which sends a Management Datagram (MAD) to the device to retrieve the device attributes. Additional Query_Device Verb calls from the virtual machine can be returned by the endpoint using the cached values, reducing the number of MADs sent to the device. This can be extended to other Verbs calls. The advantage of this optimization: mimicking or emulating the Verbs calls enables RDMA device attributes to be provided without a physical RDMA device being present.
Figure 5. VMKernel vRDMA Endpoint and ESXi RDMA stack
Emulating these Verbs calls enables us to store the actual device state, containing the QP, CQ, and MR structures within the vRDMA endpoint, which is extremely useful in allowing RDMA-based applications to run on machines without RDMA devices and without modifying the applications to use another transport. This is an important attribute of our vRDMA solution.
5. Current Status and Future Directions
The vRDMA prototype is feature complete, with some testing and debugging remaining. We expect half-roundtrip vRDMA latencies to be about 5μs, lower than the SoftRoCE over vmxnet3 option, but higher than what one can achieve in the bare-metal, passthrough, or SR-IOV VF cases. We intend to measure and report latencies and bandwidths for our prototype when testing is completed.
5.1 Supporting RDMA without RDMA Devices
The Verbs API is an abstraction for the actual functionality of the RDMA device, and device drivers provide their own implementation of these verbs to register with ib_core. It is possible for the device driver to emulate Verbs by returning the expected response to the ib_core framework without interacting with the device. In our prototype, the guest vRDMA driver acts as the RDMA device driver and the vRDMA endpoint acts as the RDMA device, emulating Verbs calls. This layering and abstraction enables the vRDMA endpoint to use any network device and support Verbs-based applications without a physical RDMA device being present.
5.2 Support for Checkpoints and vMotion
One of the main advantages of a paravirtualized interface is the ability to support snapshots and vMotion. Because the state of the vRDMA device is fully contained in guest physical memory and VMCI device state, features such as checkpoints, suspend/resume, and vMotion can be enabled. Additional work will be required to tear down and rebuild underlying RDMA resources (QPs and MRs) during vMotion operations. This is work we are interested in exploring.
5.3 Subnet Management
One of the bigger challenges is to integrate paravirtual RDMA interfaces with subnet management. Consider the InfiniBand case in which the Subnet Manager (SM) assigns Local IDs (LIDs)  to IB ports and Global IDs (GIDs) to HCAs. One way to maintain addressability is to let ESXi query the IB SM for a list of unique LIDs and GIDs assignable to the virtual machines. In a large cluster with multiple virtual machines per host, the 16-bit range limits the number of virtual machines with unique LIDs. We might need to modify the SM to provide more LIDs in the subnet with virtual machines. Another alternative is to extend VMware® vCenter™ to be the “subnet manager” for virtual RDMA devices, and assign unique LIDs and GIDs within the vCenter cluster.
6. Related Work
While virtualization is very popular in enterprises, it has not made significant inroads with the HPC community. This can be attributed to the lack of support for high-performance interconnects and perceived performance overhead due to virtualization. There has been progress toward providing access to high-performance networks such as InfiniBand [4, 16] to virtual machines. With our prototype, we do not expect to meet the latencies as shown in . We can, however, offer all the virtualization benefits at significantly lower latencies than alternative approaches based on traditional Ethernet network interface cards (NICs).
While there has been work to provide the features of virtualization [17, 18] to virtual machines, these approaches have not been widely adopted. Therefore, the disadvantage of this virtual machine monitor (VMM)-bypass approach is the loss of some of the more powerful features of virtualization, such as snapshots, live migration, and resource management.
This paper describes our prototype of a paravirtual RDMA device, which provides guests with the ability to use an RDMA device while benefiting from virtualization features such as checkpointing and vMotion. vRDMA consists of three components:
- libvrdma, a user-space library that is compatible with the Verbs API
- Guest kernel driver, a Linux-based module to support the kernel-space Verbs API
- A VMkernel vRDMA Endpoint that communicates with the Guest kernel driver using VMCI Datagrams
A modified RDMA Stack in the VMKernel is used so the vRDMA endpoint can interact with the physical device to execute the Verbs calls sent by the guest. An optimized implementation of the vRDMA device was explained, in which the data between virtual machines on the same host is copied without device involvement. With this prototype, we expect half round trip latencies of approximately 5μs since our datapath passes through the vRDMA endpoint
and is longer than that of the bare-metal case.
We would like to thank Andy King for his insight into our prototype and for lots of help in improving our understanding of VMCI and vmware-tools. We owe a deep debt of gratitude to Josh Simons, who has been a vital help in this project and for his invaluable comments on the paper.
- The VMWare ESX Server. Available from: http://www.vmware.com/products/esx/
- Barham, P., et al. Xen and the Art of Virtualization. In Proceedings of SOSP, 2003.
- Microsoft Hyper-V Architecture. Available from: http://msdn.microsoft.com/en-us/library/cc768520.aspx
- Liu, J., et al. High Performance VMM-Bypass I/O in Virtual Machines. In Proceedings of USENIX Annual Technical Conference, 2006.
- Ranadive, A., et al. Performance Implications of Virtualizing Multicore Cluster Machines. In Proceedings of HPCVirtualization Workshop, 2008.
- Simons, J. and J. Buell, Virtualizing High Performance Computing. SIGOPS Oper. Syst. Rev., 2010.
- Liu, J., J. Wu, and D.K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand. International Journal of Parallel Programming, 2004.
- Liu, J. Evaluating Standard-Based Self-Virtualizing Devices: A Performance Study on 10 GbE NICs with SR-IOV Support. In Proceedings of International Parallel and Distributed Processing Symposium, 2010.
- Dong, Y., Z. Yu, and G. Rose. SR-IOV Networking in Xen: Architecture, Design and Implementation. In Proceedings of Worksop on I/O Virtualization, 2008.
- SystemFabricWorks. SoftRoCE. Available from: http://www.systemfabricworks.com/downloads/roce
- Sayantan Sur, H.W., Jian Huang, Xiangyong Ouyang and Dhabaleswar K. Panda. Can High-Performance Interconnects Beneﬁt Hadoop Distributed File System? In Proceedings of Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, 2010.
- Mellanox Technologies. Mellanox Unstructured Data Accelerator (UDA). 2011; Available from: http://www.mellanox.com/pdf/applications/SB_Hadoop.pdf
- VMware, Inc. VMCI API; Available from: http://pubs.vmware.com/vmci-sdk/
- Wickus Nienaber, X.Y., Zhenhai Duan, LID Assignment In InfiniBand Networks. IEEE Transactions on Parallel and Distributed Systems, 2009.
- InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.2.
- Huang, W., J. Liu, and D.K. Panda. A Case for High Performance Computing with Virtual Machines. In Proceedings of International Conference on Supercomputing, 2006.
- Huang, W., et al. High Performance Virtual Machine Migration with RDMA over Modern Interconnects. In Proceedings of IEEE Cluster, 2007.
- Huang, W., et al. Virtual Machine Aware Communication Libraries for High Performance Computing. In Proceedings of Supercomputing, 2007.
- VMware KB 1001805, “Choosing a network adapter for your virtual machine”: http://kb.vmware.com/kb/1001805
1Adit was an intern at VMware when working on this project.
2The “RDMA Device” shown here and in other figures refers to a software abstraction in the I/O stack in the VMKernel.