As cloud-computing deployments scale to thousands of hosts, host-management facilities have become critical cloud architecture components. Without scalable management facilities, machine configuration, maintenance, and health monitoring become intractable tasks that put both the practical scalability and the stability of the cloud at risk. The Common Information Model (CIM) architecture combined with host-inventory services enables an administrator to issue management operations from a single station to one or many hosts and to have those operations serviced by vendor-specific CIM provider software on each host. Further, this model enables programmatic control of a federation of hosts. But although the CIM architecture enables developers to formally define the host-to-client interfaces that their CIM provider supports, it provides no guidance for defining the provider-to-kernel interface required to service those provider operations. On existing operating systems, including the VMware vSphere® platform, developers must rely heavily on low-level interfaces for provider-to-kernel communication. The practical consequence of this has been that developers replicate effort implementing common transport tasks, such as parameter marshalling and event delivery. Further, the effective programming interface that developers build atop lowlevel interfaces lacks any inherent support for versioning, which complicates ongoing compatibility between user components (such as CIM providers) and kernel components (such as drivers).
This article introduces a new high-level API-definition and callbackinvocation mechanism that enables developers to formalize the bidirectional user/kernel interface that they want to build and support over time. This mechanism, first delivered with vSphere 5.5, provides high-level callback and parameter delivery infrastructure between kernel- and user-space endpoints, implements versioning support for developer-defined APIs, and supports both asynchronous and synchronous semantics for callback invocation. The infrastructure provided by this mechanism requires a formalized definition of the management operations to be invoked, yet it enables heterogeneous implementation of those operations in a manner similar to the concept of inheritance. The requirement for high-level interface definitions combined with support for heterogeneity enables cloud-scale interoperability and standardization. In addition to the released variant of this management infrastructure, similar infrastructure was implemented as a prototype for FreeBSD, thus demonstrating the potential to enable developers to define cross-platform user/kernel APIs. This article furthermore proposes extensions to the existing mechanism, including virtualized execution of management software off-host and tooling support to automate extraction of the required high-level API definition from developer software.
General Terms: management, design, human factors, standardization, languages.
Keywords: virtualization, management, device drivers, software development, remote procedure calls
The tremendous success of virtualization technologies in recent years is owed largely to the ability of hypervisor developers to translate that technology into real-world business value. Server consolidation and migration technologies provide for scalable performance, reduced power consumption, and enhanced reliability on commodity x86
server hardware. For IT departments, these benefits translate to cost savings. Specifically, many operations that were possible but prohibitively complex or costly to perform in the physical world (such as manually rebalancing workloads on physical servers to optimize power consumption, redeploying server software onto new hardware, or snapshotting server state for backups) could be directed by automated software or mouse clicks via virtualization technology. Thus, a key reason virtualization is successful is that it is a large-scale manageability-enablement technology.
As hypervisor developers push virtualization technology toward cloud scale, it is critical to ensure that management capabilities continue to scale with the underlying technology. Cloud service providers often compete on cost, offering compute and storage units on a virtualized substrate for a certain rate. But to the extent that managing hundreds or thousands of physical hosts becomes an intensive or cumbersome task, inefficient management scalability represents an added cost. More directly for virtualization developers, efficient management scalability is critical to users who are considering deploying a virtualized substrate on hundreds or thousands of hosts, but who must determine staffing and complexity overhead for management.
In anticipation of this challenge, VMware has developed large-scale host-manageability technologies. VMware has iteratively developed and improved its host and data center–management vSphere APIs since version 3.5, enabling automated host and virtual machine (VM) management for standardized operations . To enable pluggable,
vendor-specific manageability operations and monitoring, VMware® ESX® 3.5 also added support for industry-standard CIM-based management . vSphere 5.0 later introduced a pluggable command-line interface (ESXCLI) to support vendordefined host management, thus allowing CLI-based management of vendor-specific operations . To address common server-health monitoring and automated response capabilities, VMware and its partners maintain a server certification program. A critical mission
of this program is that it ensures that VMware-compatible server platforms export their basic health and configuration status through the local VMware® ESXi™ system and out to VMware® vCenter Server™ or a vSphere Client.
CIM and ESXCLI support established mechanisms for third-party vendors to export management operations off-host. However, these mechanisms define and formalize the client-to-host interface. Like other operating system platforms, the individual ESXCLI and CIM plug-ins had to rely on low-level interfaces (such as character devices or procfs) to talk to the underlying kernel modules they might interact with. Defining the software contract between user-space components and kernel space required every partner organization using the vSphere platform to implement its own solutions for translating common high-level tasks to the low-level primitives supported by user/kernel transports. This included implementing manual data marshalling across the user/kernel boundary for their various shared data types, reinterpretation of data types if the different pieces had different data lengths (such as what happens with 32-bit CIM providers and 64-bit kernel modules on the ESXi platform), and implementation of asynchronous event-delivery mechanisms from kernel modules. Though these common operations had to happen on other operating systems, VMware and its partners observed that some of the low-level differences between the vSphere platform and other platforms (such as the mismatched data sizes of user and kernel software and semantic differences in how its scheduler implements kernel-to-user notifications) sometimes exposed subtle assumptions in partner software. These assumptions could lead to software malfunctions on the vSphere platform or on other platforms if those assumptions were ever invalidated. Though these issues could always be worked around, they highlighted the brittleness of using such low-level interfaces and underscored the replicated developer effort required for several partner organizations to all implement the same basic operations just to invoke methods across the user/kernel boundary.
The low-level nature of the user/kernel interfaces available on existing operating systems, including the vSphere platform, also leads to some software-engineering challenges. Partner organizations developing management solutions must tightly couple the development of any kernel module with the management software that communicates with that kernel module. From a practical perspective, this puts pressure on the various development teams as they try to coordinate their feature development with certification testing.
Any change to a driver requires restarting certification testing. On the vSphere platform, CIM providers are also subject to prerelease testing. Thus, changes to manageability interfaces can have far-reaching practical effects. The low-level nature of the user/kernel interface that these components use often means that changing one component triggers a change in the other.
To further enhance the vSphere platform’s on-host manageability, VMware developed a new user/kernel manageability architecture. VMware delivered this new architecture as part of its VMkernel APIs (VMKAPIs) in vSphere 5.5. VMware designed this management architecture specifically so that third-party developers can define a high-level software contract between kernel modules and management software. This management architecture is coordinated with other new APIs in vSphere 5.5 focused on supporting device drivers via VMKAPI.
This new architecture inherently supports versioning of developerdefined interfaces. Versioned interfaces enable freedom from the lock-step development practices that are sometimes required with lower-level interfaces. Versioning also provides the necessary infrastructure for maintaining backward compatibility of manageability interfaces.
Unlike low-level character-device, file-system, or socketlike interfaces, this new architecture provides a remote procedure call (RPC) abstraction for invoking callbacks across the user/kernel boundary. The architecture automatically marshals callback parameters and delivers them across the user/kernel boundary to the target software. In contrast to other RPC systems, however, the new VMware architecture is fundamentally bidirectional. Kernel software can invoke callbacks in user space, and user software
can invoke callbacks in kernel space.
Additionally, this new architecture supports a form of inheritance through dynamic kernel instances. Managed kernel instances all have the property of supporting the same high-level manageability contract that a developer has defined. Using the VMware management framework, a kernel module can register a new kernel instance that implements some or all of a defined management contract in instance-specific ways. User-space software can discover these instances and subsequently choose to invoke a callback on just that instance. Analogously, a kernel-space caller can advertise its instance identifier to user space when invoking a user-space callback.
Perhaps most important of all, however, is that establishing a highlevel boundary with high-level semantics enables virtualization of the execution context of management software. Though software that directly interacts with a kernel module usually executes in the user-space context of that kernel module, the new VMware management architecture does not make assumptions about where the management software runs. Its RPC-like abstraction simply requires registration and callback-delivery services, but those services can be delivered off-host as well.
The remainder of this article describes existing manageability frameworks and gives an overview of the new VMware on-host manageability framework. Section 2 provides some background about previous host- and device-management architectures.
Section 3 describes the new VMware high-level management framework, including its implementation on vSphere and description of a prototype for FreeBSD. Section 4 discusses extensions to the new VMware management architecture that could further benefit developers, including management-software virtualization. Section 5 discusses previous work, and Section 6 provides some concluding remarks.
The vSphere host environment, much like traditional operating systems, supports a set of standard interfaces for exporting developer-defined management operations off-host. Those interfaces include the network-facing CIM and ESXCLI interfaces and a set of user/kernel interfaces that the CIM and ESXCLI interfaces communicate with. The new VMware manageability framework serves as a replacement for those user/kernel interfaces on vSphere 5.5 but could, as described in section 5, also be virtualized beyond the host boundary to subsume networked CIM client/provider operations. This section describes the existing interfaces for exposing vendor-specific management
operations and describes the interfaces used for interacting with kernel modules. This section also outlines the challenges these interfaces pose to third-party developers.
2.1 CIM Architecture
To support additional configuration and monitoring of I/O adapters (storage adapters, network interface cards), kernel module software (network filters, storage plug-ins), and vendorspecific platform monitoring not supported by baseline servermonitoring infrastructure, the vSphere platform implements support for CIM. CIM is a programming mechanism for defining managed objects, operations that can be performed on those objects, and monitoring of events that can be generated by those objects. CIM is supported on vSphere hosts and on Windows 2000 and later, and it is available for Linux.
Figure 1 depicts the CIM architecture as implemented on the vSphere platform. In the CIM architecture, provider software (sometimes referred to as a CIM plug-in) runs on the managed host and registers itself with the CIM broker. During this registration, the CIM broker becomes aware of the object types and operations that the provider supports. Remote management software intending to perform CIM operations connects to the host’s CIM broker process and issues a request to the broker. As depicted in Figure 1, the remote management software must speak the CIM protocol (and thus it is a CIM client to the CIM provider on the host). CIM client software can be a vendor-specific standalone piece of software, a vSphere Client plug-in, or a standard client if the provider is implementing a standards-compliant service.
After the broker receives a request from a CIM client, the broker marshals incoming parameters and passes the request to the one or more CIM providers registered on the host that support the requested operation. CIM providers run in user space. Therefore, if a CIM provider intends to configure or manage a kernel entity (such as setting a parameter in a device driver or registering for notification of particular events from a driver), the CIM provider must implement that operation itself across the user/kernel boundary.
In addition to CIM, the vSphere platform also supports device configuration from ESXCLI plug-ins running on the host. This article does not discuss the ESXCLI architecture, but for purposes of configuration of devices, the architecture is similar to that of a CIM provider. Like a CIM provider, an ESXCLI plug-in is a userspace piece of software that can be tasked with carrying out a configuration operation on a device that is invoked remotely over a network. When ESXCLI plug-ins must configure hardware, they must implement those operations themselves by communicating with an entity in the kernel to carry out the configuration.
2.2 User/Kernel Communications on the vSphere Platform Using Character Devices
On the vSphere platform, the primary supported method for partners to communicate from user space to kernel space is via character devices. Character devices appear as file nodes in the file system and support traditional file operations. Kernel modules exporting character-device interfaces register a series of file-operations handlers. When a user-space application performs a file operation (open, read, write, poll, ioctl, or close) on a file corresponding to that node in the file system, the associated handler that has been registered with that character-device interface is triggered.
Thus, partners that want to implement a particular high-level operation on a driver or device, such as fetching diagnostic statistic data, must do so atop the low-level character-device semantics. To fetch diagnostic statistic data, user-space software would likely issue an integer-coded ioctl request and a pointer to a user-space buffer to a character device. The kernel- and user-space components would have to agree on the format for the statistic data. Upon servicing the ioctl request, the device driver would need todecode the request number and use kernel-to-user data-copying APIs.
A much more complicated, yet common, use case is that of supporting asynchronous event delivery from a kernel module to user space. Typically this is required to implement device monitoring, such as when a CIM provider should raise an alert during a failure condition. To support event delivery, the character device must implement a poll() handler and indicate when a qualifying event (usually preconfigured via a previous configuration request to the character device) has happened. On the user-space side, the application must use poll() or select() to sleep on its character-device file until that file reports having an event (typically by reporting that the file is readable). For CIM providers, this means that the provider must implement thread semantics, because the CIM provider must also be available to service incoming configuration requests from the CIM broker while its monitoring thread is blocked waiting for events to be reported from the character device. When the character device does indicate that an event has happened, the monitoring thread must call in to the character device (typically via a read() or ioctl()) to inspect which event has happened to trigger the awakening.
2.3 Alternatives to Character Devices on vSphere
Though character devices have previously been the most widely available interface for third-party use, VMware also implements several other interfaces for interacting across the user/kernel boundary. These interfaces bear varying levels of similarity to user/kernel interfaces present on other operating systems. Furthermore, these interfaces might or might not have the property of binary compatibility; thus they might not be stable interfaces that are appropriate for third-party use. But the other VMware user/kernel interfaces bear some amount of consideration with respect to whether they might be a readily accessible, suitable improvement over character devices or other contemporary technologies.
For example, the vSphere platform implements the VMkernel sysinfo interface (VSI) for setting exported parameters from the kernel. Unlike low-level character-device interfaces, VSI handlers can copy typed data objects across the user/kernel boundary. VSI also supports a variant of versioning, though it does not support multiple versions simultaneously; a user-space application accessing VSI must be of the exact same version as the kernel’s VSI data structures. VSI supports strongly typed data types and implements a namespace corresponding to various components in the system that user-space software might want to interact with at runtime, and thus it is similar in that regard to sysfs-type systems. VSI is not a binary-compatible interface suitable for third-party consumption, however.
Additionally, VMware currently implements support for a /proc file system. Unlike with VSI, third-party partners can leverage the /proc file system. But like character-device interfaces, the handlers are low-level read and write operations. Unlike character devices, neither VSI nor the /proc file system supports the notion of session (file) semantics for operations, nor do they support asynchronous event delivery (through polling or otherwise) to user space.
Finally, VMware also implements vmklink sockets, which provide socket semantics between a user-space and kernel-space endpoint. Vmklink sockets are similar to other user-to-kernel socket abstractions, such as netlink sockets on the Linux operating system. Though vmklink sockets are available as part of the VMKAPI set, they exist as a subset of those APIs that are not binary-compatible. This makes them a poor choice for developers to use in their kernel software. Regardless, they provide another point of comparison in user/kernel interface technology. The socket semantics of vmklink sockets provides the capability to build event-notification mechanisms and data transport between user space and the kernel. However, the constructs are low-level in a manner similar to character devices. As with traditional sockets, software using vmklink sockets must implement datagram encoding and decoding on both sides of the socket to send configuration or event messages.
2.4 Challenges for Developers
Though the character-device interface supports the requisite functionality for partners implementing configuration and monitoring software, it has some significant drawbacks. Primarily, the low-level interface supported by character devices leads many developers to implement functionally equivalent or similar infrastructure code in their kernel modules and applications. To call a function in a kernel module with some given parameters, every character device ends up implementing basic request-decoding and parameter marshalling.
Parameter marshalling is something that many developers get wrong at some point, sometimes in subtle ways. For example, they might not correctly support copying between 32-bit user-space applications (such as the CIM environment on vSphere) and the 64-bit VMkernel, or they might not correctly configure compilation of their user-space application and their driver to agree on the layout of a data structure.
Similarly, developers tend to stumble when correctly implementing event delivery from the kernel to their user-space application. The low-level poll semantics requires that developers build carefully constructed, correctly synchronized state machines to toggle their poll handler between the state of being “readable” (to enable the user-space application to wake up) and having been read (after the event has been read from user space and delivered to kernel space).
Additionally, character-device interfaces have no notion of versioning. A CIM provider from a particular vendor might need to support multiple drivers. This happens when vendors must support different generations of hardware. But the driver itself can evolve over time to add new features, such as when firmware is upgraded and new debugging facilities become available. A CIM provider simply inspecting a character-device node has no way of knowing which high-level operations built atop the low-level interfaces are actually supported. Some developers can and do implement their own basic versioning system (such as having a known ioctl command in which the driver reports back a number corresponding to the services can provide). But these systems are cumbersome and can lead to subtle breakages when a data-structure definition changes slightly and the CIM provider and a new driver version no longer agree.
The consequence of this is that driver and management software usually are very tightly integrated to ensure that these somewhat cumbersome pieces fit correctly.
3. vSphere High-Level Management Infrastructure
To address the challenges associated with existing user/kernel communication mechanisms, VMware designed and implemented a new high-level infrastructure for invoking callbacks and sending events across the user/kernel boundary. This mechanism has two main components: infrastructure for registering and connecting high-level user/kernel interface definitions, and infrastructure for invoking callbacks (including parameter marshalling) to compatible instances of the same interface definition. VMware has also designed and implemented optional tooling that generates the necessary high-level interface definition from existing application and kernelmodule C code. Regardless of whether the high-level interface definition is manually specified by a developer or is automatically generated, the underlying infrastructure provides the requisite support for connectivity and versioning.
Software using the new VMware management infrastructure must supply a high-level interface definition to a VMware API. This high-level definition effectively describes an API that both the kernel-side and user-side components support. Thus, this definition represents the software contract that these components agree to and that the intermediary VMware infrastructure will facilitate. The API being defined by the developer, as described by the high-level definition, includes
- A name.
- A version (major/minor/update/patch).
- The list of callbacks that can be invoked.
- Where each callback resides (user or kernel space).
- Whether the callback is synchronous or asynchronous. (Synchronous callbacks to user space from the kernel are currently not supported.)
- How many parameters each callback has.
- Whether a parameter is an input, output, or input/output parameter. (Asynchronous callbacks only support input parameters.)
- The size of each parameter.
When automatic tooling is used, these parameters to the API can be generated automatically and emitted to a header file that is used by both the user-space application and the kernel module.
Figure 2 depicts a CIM provider communicating to a kernel module using the high-level VMware management infrastructure. As is shown, the CIM provider communicates through a user-space management library, which in turn communicates with a kernelspace API implementation. The library and kernel facilities combine to provide a transport between user-space software and kernel modules. Through these facilities, a module and a CIM provider can communicate as though they are directly invoking a callback. The transport copies parameters, handles asynchronous event delivery, and ensures that the user-space and kernel-space components are using a compatible version of the API.
During kernel-module initialization, a module that wants to communicate with a compatible user-space component must register all of its management descriptions with the kernel-facing management API. Each management description enumerates what is effectively a separate management API that a component (either kernel module or user-space software) is compatible with.
Multiple management descriptions of different types (or of the same type but of different compatibility levels) can be supported simultaneously. For example, a module can register and service both a standard configuration and event-reporting API (which could be defined by a standards body or by VMware) and a vendor-specific API. As another example, a single driver could support two versions of the same API (perhaps to add support for a new CIM provider that might use a new version of the API). Supporting separate versions of an API simply requires registering the corresponding descriptions with the VMkernel management API, as depicted in Figure 2. Registration of each API gives back an opaque handle. To invoke a user-space callback (which is equivalent to sending an asynchronous event with parameter data), code in the kernel module invokes the VMkernel management API, passing as parameters the management handle that was registered, the callback ID to be invoked, the instance identifier from which the callback originates (if applicable), and the parameters to the callback, which must be pointers. When the module is unloaded, the module must invoke the transport API with each registered handle to deregister those API instances from the transport.
The life cycle of a user-space component is nearly identical. A userspace application that wants to communicate with a kernel module registers its management description with the user side of the transport, as implemented in the VMware management library. If the registration process is successful, an opaque handle is passed back. Note that registration can fail if no matching kernel endpoint supporting the same type and version of the management description is available. Assuming registration succeeds, the registered handle is then used when callbacks in the kernel are invoked. As in the kernel-caller case, parameters to the callbacks must be pointers.
When the application is closing (or if it crashes), the kernel side of the transport detects the closure and cleans up any remaining state in the kernel.
3.1 Instances and Callback Invocation
As depicted in Figure 2, kernel modules can register instances that can be targets of a management operation. A management API description registered with the high-level management service can contain operations that are global in nature, such as getting module-wide statistics or debugging information. However, the API description can also contain operations that normally pertain to a specific instance that the module controls, such as getting or setting a policy for an individual I/O controller that a module manages.
Further, callbacks sent to user space can originate from a specific instance. The high-level management description effectively outlines the semantics of the management operations that user-space and kernel-space software agree must be supported, but instances can implement those semantics in different ways. Further, it is desirable to be able to enumerate instances of managed elements and to be able to address a specific instance when invoking a callback. The high-level VMware management infrastructure explicitly supports per-instance semantics in three ways: registration (including support for a form of inheritance), discovery, and callback routing. First, the infrastructure supports dynamic registration (and unregistration) of instances at runtime. When a module creates an instance (such as a logical device) that has instance-specific management methods associated with it, a module then registers an instance identifier to the management handle that refers to its management definition. When registering a new instance identifier, a module can optionally supply new, replacement callback functions for some or all of the callbacks that exist in the associated management definition. These replacement callbacks effectively override the default, registered callbacks when that specific callback identifier is invoked for this newly registered instance identifier. Thus, supplying new callbacks enables a form of inheritance from the base, default definition of the callback methods. Note, however, that replacement callbacks must take the same number of parameters and that they must be the same size. The infrastructure also supports dynamic unregistration of instances at runtime, which might be required if the underlying unit being managed (a logical device, for example) is removed or destroyed.
Secondly, the vSphere high-level management infrastructure supports instances by providing a discovery mechanism to user space. User-space software that uses a management definition that is compatible with the kernel module’s definition can, through the VMware management library, query for existing instances. This query provides the unique identifier for each instance and a friendly display name. The display name is the name that was provided by the kernel module when it registered the associated instance.
The instance identifiers reported by the instance-discovery query can then be used with the third portion of instance support: instance routing for callback invocation. Every time a developer invokes a callback, one parameter of the invocation is an instance identifier. When a callback is invoked from the kernel, the instance identifier refers to the instance that is generating the callback. When a callback is invoked from user space, the instance identifier refers to the instance to which the callback should be addressed. Instances are optional; if a callback does not refer to an instance, the instance parameter can be specified as a reserved NO_INSTANCE value. When unique instance identifiers are used in callback invocation, the management infrastructure routes the callback request accordingly. When a kernel callback is invoked from user space to a specific instance, the instance identifier is carried along and presented as a parameter to the target callback. Thus, the target callback can use this parameter to look up the instance-specific information that the callback is intended to service. Further, if a kernel module has registered an instance-specific, overriding callback method for that specific callback, the overriding callback method is invoked instead of the default callback method. When invoking a user callback from kernel space from a specific instance, the caller (i.e., the kernel module) provides its instance identifier as the parameter to callback invocation. This parameter gets carried along to user space by the management infrastructure and is supplied to the target callback as a parameter. So in this case as well, the target is made aware of which instance this callback invocation is associated with.
Note that this management infrastructure supports N-to-1 communication between N user-space applications and 1 kernel instance. (When no instance identifier is provided, the default target callback still serves as a single endpoint.) However, the reverse is also true: The management infrastructure provides 1-to-N communication when communicating from the kernel to user space. Specifically, there can be N user-space applications simultaneously connected to the kernel that all use the same high level
management definition (albeit with unique implementations for target callbacks in their applications), but only one kernel module can have that same definition. Thus, when a kernel module sends a callback request, it is effectively broadcast to every application that is currently connected. This enables a kernel module to send notifications to applications that have, by virtue of having connected using the high-level management definition, declared their interest in notifications from kernel space about the components they are managing.
As discussed in section 2.2, the fundamental operations required for user/kernel communication, including asynchronous event delivery, can be implemented on top of character devices. The problem with relying on raw character-device interfaces is that they lack many of the rich semantics developers might want to use, and they offer no inherent versioning. Starting from the observation that character devices are sufficient as a communication channel, the proposed management mechanism (its user-space library and kernel-space interfaces) has been implemented using character devices. However, the underlying implementation is not exposed to developer software that uses the transport. The same semantics could be built atop other user/kernel interfaces, including VSI, vmklink sockets, or abstractions present in other operating systems.
During this development, VMware chose a character device–based implementation to simplify any portability of this infrastructure to other OS environments.
In the kernel-side implementation of the management infrastructure, the infrastructure registers one character device node per management description. Thus, a single
module or driver registering three management descriptions (or three versions of similar descriptions) would cause three character device nodes to be created. Additionally, the infrastructure maintains a control interface for servicing discovery of currently registered management descriptions and any associated instances. When a user-space application attempts to register a management description for subsequent use in receiving or delivering callbacks, the user-side implementation of the infrastructure uses this control interface to find a compatible management description. The infrastructure implementation defaults to considering an exact major/minor/update/patch match of a management description to be compatible, but if only the major number matches, the management infrastructure stills consider this a match. If two descriptions have the same major number but different minor numbers, the highest minor number is chosen. The update and patch numbers of the version are used as subsequent tiebreakers.
Assuming that a compatible match is available, the user-side implementation of the infrastructure keeps the file descriptor to the corresponding character device open, creates a pthread dedicated to listening for incoming events from the kernel (using select()), and returns an opaque handle that, internally, contains this metadata. To invoke a callback in kernel space from user space, the transport creates temporary storage to hold the parameters (according to their specified sizes, as defined in the description that was registered) and either does a write() operation on the corresponding character device or performs an ioctl(). For asynchronous callback invocation, write() is used. (Before the write() can happen, a callback descriptor header must be built in user space that encodes the callback identifier to be invoked and the metadata about the callback, including parameter count and sizes that will follow the header). For synchronous callback invocation, ioctl() is used, in which case the callback descriptor is encoded in an ioctl request. This enables output and input/output parameters to be copied back by the user side of the transport from the kernel after the callback has been dispatched and executed inside the kernel. On the kernel side, the write() and ioctl() handlers for the underlying character device are implemented to receive and process callback descriptors, unpack data, and invoke callbacks.
To invoke a callback in user space from kernel space, the kernel-side implementation of the management infrastructure first allocates temporary storage for a callback descriptor and the parameters to be sent. Given that limited memory is available for such buffering, this does have the effect of limiting the maximum parameter size sent with an event up to user space. After the descriptor is built and parameters are copied into the transport, the kernel side sends a wakeup request on all open files associated with the character device being used for the kernel side of the management description. This has the consequence of waking all sleeping user-side threads that are waiting for events (or if they are not sleeping, has them reevaluate the poll function). The user side of the infrastructure then inspects the poll state of the character device and finds that it is readable (corresponding to an event being available). The user side reads the callback request, processes it, reads out the parameters corresponding to the callback, and finally dispatches the user-space callback, effectively delivering the event inside the listening thread.
Because character devices and pthreads are standard constructs in most existing operating systems, it is possible to implement this management infrastructure on other operating systems as well. In addition to the version of the infrastructure delivered with vSphere 5.5, VMware developed an internal prototype of the infrastructure for FreeBSD. Thus, it is possible to deliver the compatibility, versioning, and high-level semantic benefits of the VMware high-level management infrastructure to other platforms.
As previously established, the high-level VMware management infrastructure requires both kernel and user components to register a fairly exhaustive and detailed description of the management operations (i.e., callbacks) that can be invoked. The parameters of this description, however, are determinable from a compiler if the parameter types are statically analyzable. In the common case, callback functions actually do take determinable data types as parameters, such as pointers to structures of known type. Further, the infrastructure requires parameter types to be based on fixedlength types, such as VMKAPI’s fixed-length integer types.
Given these requirements of the management framework, a compiler can determine the description parameters that must be supplied to the management framework. The compiler can determine if a parameter type satisfies the requirements of the management framework (i.e., that the parameter is derived from fixed-length types and does not have any variable or extra padding). Further, the compiler can determine the size of a parameter, which must be provided as partof the management description provided to the framework.
VMware has leveraged these properties of the management framework to develop additional tooling that can automatically create a detailed management description from a small description of the name, vendor, version, and a list of callback function names that are to be used with the management framework. This tooling is depicted in Figure 3. In Figure 3, the steps executed by the tooling are depicted inside diamond-shaped boxes, and source code either provided by the developer or generated by the tooling
is depicted in rectangular boxes.
As depicted in Figure 3, the tooling starts in step 1 by preprocessing and parsing developer-supplied callback implementation code and its data-type definitions. The tooling uses the results of this parsing and analysis in step 2 to generate intermediate source files. These source files contain a fully described API signature data structure and a set of macros. The API signature includes the compiler-derived size of each type to be used as payload to a callback function. The macros defined in the generation step are usable by developer-supplied source code for invoking callback functions. In step 3, the tooling creates a build dependency by the developer-provided source onto these intermediate, automatically generated files. Finally, in step 4, the tooling invokes the compiler to compile the intermediary source alongside the developer-provided implementation code.
This tooling can dramatically reduce the amount of descriptive code that a developer would otherwise need to supply manually. A management description capable of handling tens of callbacks can lead to hundreds of lines of descriptive code in its API signature. Manually supplying that code is an unnecessary burden to developers. Further, using automatic generation ensures that a change in the signature of a callback function will not be overlooked and not propagated to the management description.
The new VMware high-level on-host management architecture is fundamentally an RPC-like system. But because it requires a highlevel enumeration of those RPCs to function and because it supports a form of inheritance for managed items (i.e., instances), it also enables interoperability. As long as the user-space component and kernel-space component register with an equivalent description (something that the infrastructure verifies), the user-space component and kernel-space component need not be tightly coupled in development.
Further, it is possible for vendors to leverage a third-party– defined management API description (which is effectively the contract of callbacks that must be supported and events that can be delivered) rather than define their own. If a vendor were to implement a predefined API in its driver or module and if the rest of the management stack for that predefined API existed (i.e., a CIM provider and a CIM client), then the vendor would only have to implement its driver and the driver’s management callbacks to ensure plug-and-play manageability in the vSphere environment. This is similar to the case in which a vendor might create a standards-compliant CIM provider and rely on a third-party (standards-compliant) CIM client rather than necessarily having to develop its own.
When vendors can rely on such standards and use off-the-shelf or prepackaged management software, it greatly reduces the development burden for those vendors. It also reduces time-tomarket constraints and provides VMware customers with a more consistent manageability experience, even in the presence of heterogeneous cloud deployments. Achieving a more consistent manageability experience would help to enable more scalable management and to reduce management costs for customers and cloud providers. Of course, the pluggable nature of the VMware high-level management infrastructure enables vendor-specific extensions alongside any standard operations.
VMware has made efforts toward effectively predefining network and storage driver instrumentation. Network drivers can implement a standard set of manageability callbacks or handlers, such as for changing the maximum transmission unit (MTU) size or enabling or disabling computational offload assist features. This effectively sets a standard, extensible mechanism for implementing defined configuration operations in a driver. Previously such a mechanism relied on an ioctl handler. Unfortunately, because it is ioctl-based, it does not support notification to user space. Thus, defining a functional superset of ioctl parameterization and event delivery (e.g., to report if some configurable number of Layer 2 errors have occurred) using the new high-level management infrastructure enables VMware to expand the set of standard manageability operations. This expansion could include some operations—such as statistics reporting—that can currently require vendor-specific management software to propagate up through the stack. For storage, VMware is actively developing the IODM (I/O device manager) technology, which is a set of instrumentation throughout the driver and storage stack. IODM collects statistics and events on storage paths and devices in the system and then aggregates them for reading and dumping from user space. Historically, this reading and dumping has happened via character-device nodes. However, IODM on vSphere 5.5 leverages the new VMware highlevel management architecture. Leveraging this architecture reduced the developer effort required to support asynchronous event delivery to monitoring software.
The idea of establishing a fixed interface for manageability and event delivery that vendors adhere to and then plumbing that interface up through a generic management stack is critical to the current success of the VMware server-certification program. Server certification at VMware works by having server vendors adhere to Intelligent Platform Management Interface (IPMI) hardware interfaces that VMware supports for manageability. VMware then provides a kernel module that interacts with the vendor’s IPMI hardware interface, a CIM provider, and CIM client functionality inside vSphere Client. This architecture is depicted in Figure 4. By adhering to this standard and working with partners, VMware enabled a consistent, compatible mechanism for managing the core server platform for vSphere hosts. As is depicted in Figure 4, partners conforming to these standard IPMI interfaces do not have to supply their own CIM provider or vSphere Client plug-in for manageability; they are able to leverage preexisting software from VMware for a seamless management interface to users.
In addition to enhancing its management capabilities, VMware is examining ways to virtualize the execution environment for management software using the new high-level management architecture. Like other RPC-like systems, this new management framework does not inherently assume that the endpoints of the connected RPCs are on the same host. The abstraction provided simply requires that a callback can be executed across some boundary after an initial negotiation for connection has happened, through a VMware library.
One possibility is for management software to move outside of the ESXi host environment (such as where CIM providers execute) and instead execute in a remote operating system or a virtualized appliance environment. In such an environment, third-party software would have the benefit of additional third-party libraries or execution environments. Further, such an abstraction might eliminate the need for developers to separately model their management operations via CIM. Instead of exporting device or module manageability operations off-host through a CIM provider, the kernel module could use the VMware management framework and VMware-provided network encapsulation to directly export management off-host.
Figure 5 depicts such a potential implementation. In this implementation, the high-level management framework introduced in this article is encapsulated over a network transport. Management software communicates to an encapsulation layer, which is effectively a generic shim (including support for credential authentication) to the underlying management framework on a remote host. This shim would support the same RPC semantics, versioning support, and instance support as the core framework, because none of those features assumes local software execution. Thus, it is possible to isolate management software in an execution context outside of the local ESXi host environment. This could be very beneficial to third-party developers, because the ESXi host environment is much more resource-constrained than traditional operating systems. There is very limited third-party library support, for example. The architecture depicted in Figure 5 would enable third-party developers to deliver feature-rich software alongside other middleware stacks, such as a Java runtime environment or database service, that would not be appropriate in the ESXi environment.
Note that Figure 5 does not assume anything specific about the remote management software application. Such an application could be a third-party piece of software, but it could also be a management plug-in (such as a vSphere Client plug-in) to another application. As depicted in Figure 1, vSphere Client plug-ins commonly communicate to vendor-specific management operations on a host via CIM operations. In a potential implementation of the architecture in Figure 5, the remote application could be a vCenter Client plug-in and invoke the remote kernel module’s callbacks directly. Further, it is possible to enhance the tooling described in section 3.3 to automatically generate stubs for simple vCenter Client plug-in operations. Specifically, simple get or set operations would be compiler-producible, enabling a simple mechanism for a vCenter Client plug-in to call all the way into a remote kernel module and perform management operations on it. Similar extensions to support kernel-to-client communications are possible. Supporting such optional automatic generation would serve to eliminate some of the developer effort required today to create vCenter Client plug-ins.
5. Related Work
The VMware high-level management infrastructure presented here is fundamentally an RPC-like mechanism for issuing callbacks into and out of a kernel module. Though the demonstration mechanism here is between user-space management software and kernel-space modules, the system still resembles an RPC-like system, albeit with some notable differences. Previous research has examined different architectures for user/kernel interprocess communication, and RPC-based distributed systems have been researched extensively.
Wang et al. give an overview of available user/kernel interfaces on the Linux operating system, including sysfs, procfs, relayfs, and netlink sockets . These mechanisms vary in efficiency and in the hierarchy of objects or files that are presented to user space. Relay is notably optimized for efficient data delivery . The low-level file or socket abstractions, however, are similar to those present on other operating systems. As  describes, these interfaces vary in their capability to perform bidirectional communication with user-space software, and they find netlink sockets to be the most versatile available interface. However, none of these interfaces supports the versioning or high-level-callback abstractions of the management infrastructure described in this article.
Several researchers have also examined the issue of relocating or repartitioning the traditional execution domain of kernel components. Pariccha and Gonsalves built a universal serial bus (USB) encapsulation layer over Internet protocol (IP) that enables driver software to run on a remote server while the actual hardware is connected to a remote thin client . Though the standard encapsulated is lowlevel USB semantics, the idea of virtualizing the interface to move the execution from on-host to off-host is similar to the concept explored in section 4 for relocating on-host management software. Shüpback et al. propose a repartitioning of device drivers into hardware-access routines and constraint-based algorithmic definitions . Butt et al. propose dividing device drivers in to userspace and kernel-space components and use a cross-boundary RPC to connect them . Law and McCann introduced the componentized, microkernel-like Go! operating system, whose components were isolated by a high-performance RPC system . In these works, such repartitioning could be used to move the execution domain of certain components off-host as well, albeit at varying degrees of efficiency loss. The VMware high-level management infrastructure differs from these in its inherent modeling of and support for instances, its support for versioning, and its mechanisms to support interoperability through publishing a developer-defined management description.
Researchers have also examined RPC-based systems more generically, focusing on the properties that the RPC infrastructure itself supports. Bershad et al. introduce the Lightweight Remote Procedure Call mechanism for optimizing the performance of |RPCs between endpoints on the same machine but within a single operating system . Chen et al. introduce RPC implementation optimizations for inter-VM workloads . Oey et al. examine mechanisms to deliver higher-level, more robust RPC mechanisms and compare the overhead versus lower-level constructs . In all of these cases, researchers shared with VMware the motivation to deliver high-level RPC semantics that reduce engineering costs for developers. The VMware high-level management framework differs uniquely from these implementations in its support for bidirectional method definitions, its built-in heterogeneous instance modelling, and its encapsulation of a set of methods as a single, versioned unit that the RPC service itself can broker.
However, Dave et al. created an RPC implementation that does encapsulate the set of methods that a specific RPC service supports . This work introduces the abstraction of dynamic RPC proxy objects. This is similar in concept to the encapsulation of a management description in a single definition. Further,  proposes support for versioning of proxy objects, which is similar to the VMware high-level management framework’s support for versioning of management definitions. The model for objects being brokered, however, is significantly different from the model used here for managed kernel objects. Notably, the proxy-object RPC model does not support bidirectional method invocation. A proxy object advertises the methods that can be invoked, but it does not simultaneously advertise the methods that must be invokable by it. This is fundamental to the notion of establishing a programmer contract for management software and the items it manages. And finally, the proxy-object RPC model does not model or support instances of heterogeneous objects that share the same semantics but have different implementations. As described in section 3.1, this is fundamental to supporting device-driver and pseudo-device management.
As data center deployments increase in scale, cloud-infrastructure developers must ensure that such deployments are manageable in a scalable, automatable fashion. Management of physical infrastructure often requires management and monitoring of vendor-specific components, including I/O devices and pseudo-devices (such as storage and network filters). Thus, cloud-infrastructure providers must enable third-party vendors to deliver large-scale manageability solutions, or else customers simply cannot stand up, modify, or monitor their cloud infrastructure without an intractable level of human involvement.
Following on previous technologies that enable pluggable, vendorspecific manageability, this article introduces a new high-level management framework that presents an RPC abstraction for invoking callbacks across the user/kernel boundary on an ESXi host. With its support for bidirectional communication, this management framework presents a simplified, developer-friendly interface for building management software and servicing management requests in kernel modules. The new VMware management framework also builds in support for managed instances, which enables a kernel module to dynamically register new items (such as logical devices) that can service management operations. Because each instance can have its own implementation of a particular management operation, this framework enables third parties to build management interfaces that can be serviced with heterogeneous, implementation-specific methods. Finally, this new management infrastructure builds in support for versioning. Through this versioning support, third-party developers can build management software that can simultaneously support previous versions of drivers or kernel modules, and kernel modules can simultaneously support multiple versions of management software. This helps break the lock-step development cycle that some organizations experience as they develop both management software and the item that it manages (such as a kernel module). In addition to presenting this new management framework, this article notes that the framework enables new methods of management-software partitioning. Specifically, the management framework creates an effective virtualization layer, enabling management software to execute outside of the host environment where the corresponding kernel module it manages is located. This offers developers access to additional libraries, because the ESXi host environment is fundamentally resource-constrained and designed primarily to run VMs. Finally, this article notes that further layering and tooling built atop the VMware high-level management framework might further reduce developer effort in the future, enabling developers to eliminate certain portions of the management software stack that they currently must implement.
1. Brian Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy, “Lightweight Remote Procedure Call,” ACM Transactions on Computer Systems, vol. 8, no. 1, pp. 37–55, February 1990.
2. S. Butt, V. Ganapathy, M. M. Swift, and Chih-Cheng Chang, “Protecting Commodity Operating System Kernels from Vulnerable Device Drivers,” Proceedings of the Computer Security Applications Conference, pp. 301–310, 2009.
3. Hao Chen, Lin Shi, Jianhua Sun, Kenli Li, and Ligang He, “A Fast RPC System for Virtual Machines,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 7, pp.
1267–1276, July 2013.
4. A. Dave, M. Sefika, and R. H. Campbell, “Proxies, application interfaces, and distributed systems,” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, Dourdan, pp. 212–220.
5. William Lam. (2012, February) VMware vSphere Blog.
6. G. Law and J. McCann, “A New Protection Model for Component-based Operating Systems,” Proceedings of the Performance, Computing, and Communications Conference, pp. 537–543, 2000.
7. M. Oey, K. Langendoen, and H. E. Bal, “Comparing kernelspace and user-space communication protocols on Amoeba,” in Proceedings of the 15th International Conference on Distributed Computing Systems, pp. 238–245, 1995.
8. Barun Kumar Parichha and T. A. Gonsalves, “Remote Device Support in Thin Client Network,” Proceedings of the Third Annual ACM Bangalore Conference (COMPUTE ‘10), pp. 21:1–21:4, 2010.
9. Adrian Schüpbach, Andrew Baumann, Timothy Roscoe, and Simon Peter, “A Declarative Language Approach to Device Configuration,” ACM Transactions on Computer Systems, vol. 30, no. 1, pp. 5:1–5:35, February 2012.
10. VMware. (2013, September) CIM SMASH/Server Management API.
11. VMware. (2013, September) VMware vSphere Web Services SDK Documentation.
12. Bei Wang, Bo Wang, and Qingqing Xiong, “The comparison of communication methods between user and Kernel space in embedded Linux,” International Conference on Computational Problem-Solving, 2010.
13. Tom Zanussi, Karim Yaghmour, Robert Wisniewski, Richard Moore, and Michel Dagenais, “relayfs: An Efﬁcient Uniﬁed Approach for Transmitting Data from Kernel to User Space,” Proceedings of the Ottowa Linux Symposium, pp. 494–506, 2003.