VProbes: Deep Observability Into the ESXi Hypervisor

Martim Carbone
VMware, Hypervisor Team
mcarbone@vmware.com

Alok Kataria
VMware, Hypervisor Team
akataria@vmware.com

Radu Rugina
VMware, Hypervisor Team
rrugina@vmware.com

Vivek Thampi
VMware, Hypervisor Team
vithampi@vmware.com

Abstract

This paper presents VProbes, a flexible dynamic instrumentation system aimed at providing deep observability into the VMware® ESXi™ hypervisor and virtual machines (VMs) running on it. The system uses a high-level scripting language with which users define system events (called probes), and the associated actions at those events, typically to collect specific pieces of data. VProbes users can quickly collect customized pieces of data at arbitrary points in the system, using just a handful of lines of VProbes scripting. VProbes is a safe system, has no overhead when not in use, is scalable, allows instrumentation of ESXi on the fly, and can be used in all ESXi build types. The VProbe system is specifically designed to observe the main layers of the VMware software stack: the guest OS, the Virtual Machine Monitor (VMM), and the ESXi kernel. The tool makes it easy to collect various pieces of data at each layer, correlate events across these layers, or trace events from the guest all the way down to the hardware devices.

VProbes has been successfully used internally at VMware by code developers, the performance team, and by our customer support organizations. In certain cases, the support teams estimated that they saved several weeks of staff hours when troubleshooting difficult issues. VProbes has also been used as the underlying data collection and system instrumentation mechanism in the Dynamic Driver Verifier (DDV) tool that the VMware ecosystem uses for developing ESXi drivers.

1. Introduction

As software systems get increasingly complex, debugging, monitoring, or understanding system behavior becomes increasingly difficult. The problem is even more challenging in a virtualization environment, where the hypervisor adds more layers to the software stack. Understanding the interactions among the different layers and identifying the component that causes a certain issue—such as a VM hang, high disk latency, or slow network—requires tools that quickly gather low-level data from a running system.

VProbes is an internal tool developed at VMware to help this process by collecting system data that provides detailed observability into all the layers of the ESXi software stack, from the hypervisor to the guest running inside the VM. VProbes uses dynamic instrumentation and is similar in spirit to other tools developed for instrumenting operating system kernels and applications, such as DTrace [4] or SystemTap [12]. VProbes generalizes probing of a single operating system to probing across the multiple layers of a virtualization stack. To use VProbes, users write simple scripts in a C-like language. The script defines a set of events to be observed and the actions corresponding to those events. To troubleshoot an issue, users can start asking high-level questions about the system, then iteratively refine their queries using more detailed VProbes scripts, forming hypotheses and collecting specific data to verify their conjectures. VProbes can be used to help answer questions such as why an application’s performance degrades when running in a VM compared to running on bare metal; which VMs or services are using the most I/O bandwidth, CPU cycles, or memory; or how application or guest OS behavior affects the hypervisor and the virtual infrastructure. VProbes has a number of features that make it appealing to a wide range of users:

  • Dynamic – It uses dynamic instrumentation to probe a running system on the fly, without having to recompile or reboot ESXi. This is a key feature that distinguishes it from common practices of adding print statements in the source code. Avoiding rebuilds and reboots can significantly improve developer productivity and speed up the turnaround for VMware support teams when they troubleshoot customer issues.
  • Flexible – It exposes a scripting-language interface to facilitate the process of asking arbitrary queries and collecting customized pieces of data. Together with the dynamic aspect, VProbes enables developers and support engineers to iterate through different hypotheses, gradually zooming in on the answer they are looking for.
  • Available everywhere – It is available in all build types, including release builds (but currently available only for internal use, e.g., by the VMware support teams). Because release builds trade off certain pieces of debug information, such as logging or code assertions, in favor of performance, VProbes can provide invaluable visibility into systems running release builds of ESXi.
  • End-to-end – It allows inspection of the whole virtualization stack, from the guest to the hypervisor, tracking events or making correlations across layers.
  • Safe – It ensures that a script will not cause the system to hang or crash due to an infinite loop or a bad pointer.
  • Free – When not in use, VProbes causes no overhead or penalty in the running system. When a VProbes script is unloaded, the system goes back to its original state, with zero overhead. The rest of the paper is organized as follows. Section 2 describes the VProbes programming system. Section 3 illustrates a number of common VProbes usage patterns. Section 4 discusses real-world uses of VProbes. Section 5 presents an experimental evaluation. Section 6 discusses related work. Section 7 concludes the paper and discusses future directions.

2. VProbes Programming System

The VProbes programming system is a framework expressly designed for observing ESXi in a production environment. It consists of a high-level language, a supporting runtime, and a collection of instrumentation engines.

The VProbes language, called Emmett, has been tailored from the ground up for programmatically observing and collecting system information, and for processing and reporting that information on the fly. Users write scripts consisting of one or more probes describing events of interest, and blocks of code that are executed when they happen.

There are three instrumentation domains in VProbes, which represent major components of ESXi: VMK (VMkernel, the ESXi operating system kernel), VMM (the Virtual Machine Monitor, which creates a virtual execution environment with one or more virtual CPUs, for running guest operating systems), and VMX (the Virtual Machine Extension, a user-level companion process to the VMM).

Figure 1 shows a graphical representation of these domains. Each of these components has a VProbes engine built into it, providing low-level binary instrumentation capabilities and runtime support for executing scripts. User scripts are first compiled into a safe, type-checked intermediate form and dispatched to intended target domains.

Figure 1. VProbes Instrumentation Domains in ESXi.

Figure 1. VProbes Instrumentation Domains in ESXi.

In addition to the three domains described above, VProbes also has a GUEST instrumentation domain representing the guest operating system running in the VM. Instrumentation of guest software is implemented in the VMM and VMX.

2.1 Concurrency
ESXi components are multithreaded software systems. Events of interest can occur on any thread of execution, and potentially simultaneously as they are executed on different physical CPUs. Consequently, probes can fire concurrently.

VProbes has been designed to deal with this inherent concurrency in the system as efficiently as possible. The language runtime is innately thread-aware and safe. Basic integer and string types provide perthread semantics. Integer variables can also be explicitly declared as shared between threads with guarantees of atomicity and sequential consistency. More-complex shared data structures—such as bags, which are used to store integer key-value pairs, and aggregates, which are used to build histograms—are designed not only to be thread-safe, but also to provide wait-free guarantees.

Output generated by probes, such as the output of a printf statement, is queued in per-thread or per-CPU buffers. It is eventually serialized in global time order and made available to the user, when the system has the resources to do so.

2.2 Safety
Safety is central to the design of VProbes. Users cannot inadvertently impede the normal functioning of a system, modify system state, or perform illegal operations that can corrupt or crash the system. VProbes scripts are compiled to an intermediate form with static type checking. This intermediate form is further translated into machine code with strict runtime bounds checking.

Execution of probes is time bounded, which ensures that probes cannot cause the system to stall, guarding against inadvertent uses of looping constructs that continue forever. VProbes also does accounting of overhead incurred due to probe execution and performs rate-limiting if necessary.

Low-level systems such as ESXi have transitional phases during which the state of the CPU might not be reliable—for example, the small window of time as a system call transitions from the VMM to the VMkernel, or during world switch, as CPU state is being saved and restored. VProbes has been hardened for these scenarios, either by ensuring that we make reliable assumptions about the system, or by disallowing probe execution in these small windows. To increase confidence in VProbes safety, our verfication team designed a random test generator [3] that injects random, machine-generated scripts into ESXi. This tool has uncovered a number of corner cases in critical parts of the kernel and enabled us to harden the VProbes system to safely handle those cases.

2.3 The Emmett Language
Emmett is the high-level, domain-specific language (DSL) for interacting with the VProbes instrumentation framework. Its syntax and semantics are heavily borrowed from the C programming language, which allows for a lower barrier to entry for users, who we expect are familiar with low-level system software written in C, such as ESXi. In the following subsections, we provide an overview of the language.

2.3.1 Probes
A probe in the Emmett language consists of a name that resolves to a system event or instrumentation point, called a probe point. An associated block of code is executed when the system “hits” that probe point, an event also known as a probe fire. A probe name is a series of colon-separated tokens that resolve to a probe point, starting with the domain (one of GUEST, VMM, VMX, or VMK), followed by the type of the probe (ENTER, EXIT, PROFILE, etc.), and a location identifier (function name, raw linear address, etc.). A probe can also receive arguments. In the VMK domain, for instance, an ENTER probe receives the arguments to the invoked function. Probes can be broadly categorized into

  • Dynamic probes, which fire at arbitrarily selected control points identified by function entry, exit, or instruction offsets, or at instructions that perform data reads and writes to specific linear addresses in the target system. Dynamic probes provide arguments relevant to the probe type, such as the arguments to the function call in entry probes, value being returned by the function in exit probes, and exception frames in offset probes.
  • Static probes, which fire at predefined points in the system. They are placed by developers using special markers in the source code. These probe points usually represent a system’s architectural points of interest, making them observable without details of underlying implementation. Static probes also provide contextually relevant arguments as predetermined by the system designer.
  • Periodic probes, which fire at a time-periodic rate. The period can be fixed or programmable. They are useful for profiling and performing periodic reporting of data.

2.3.2 Data Types
Emmett provides basic integer and string types for variables, and a standard set of arithmetic operators and built-in functions for manipulating them. Variable values are persistent across probe fires and, by default, instantiated on a per-thread basis. For dynamic storage and lookup of integer key-value pairs, the language also provides bags, which are a highly efficient, shared, lockless data structures with fast insertion, lookup, and removal times. Emmett also provides native support for building histograms in the form of aggregates. Aggregate variables can be used to collect integer value samples distributed into buckets identified by a combination of integer and string keys. A built-in function called printa can customize and print the aggregate. Adding samples to an aggregate variable is fast and wait-free.

2.3.3 Compound Statements
The Emmett language includes the following compound statements:

  • if-else branching
  • for, while. and do-while generic loops
  • foreach loops for iterating over bags
  • try-catch construct for handling exceptions

Unlike C, Emmett has no unstructured control-flow statements (goto).

2.3.4 Target Inspection
VProbes has built-in support for accessing target system memory, which is often crucial in analyzing the state of the system. One primitive way of accessing raw system memory is to use the built-in getvmw() function , which takes a linear address and returns the 8-byte value at that location. For more-structured access to memory, Emmett also provides special support for pointers and composite types (such as structures, unions, and arrays), which can only be used for target memory inspection. Emmett natively supports pointers to standard C types (int*, char*, etc.), which can be assigned linear addresses in the target domain and dereferenced using the * operator. For access to more complex objects in memory, users can define and use composite types. For probes in the VMkernel domain only, Emmett also supports a special $module.typename notation, which imports data types from the target system, obviating the need to define them in user scripts. This greatly improves usability by making access to complex objects in memory seamless.

In addition to accessing memory, the Emmett language includes a variety of built-in variables and functions that provide access to the current call stack, virtual and physical CPU register state, the time stamp counter (TSC), and so on.

3. Observability Patterns

The generality of VProbes as a multilayer observability tool makes it useful in a wide variety of use cases, including debugging, macro or micro performance analysis, and general system introspection. Using real-world examples, this section builds upon the basics described in section 2 and illustrates some of the most common and useful VProbes observability patterns that are being used at VMware.

3.1 Tracing
One of the simplest and most common observability patterns is tracing the execution of a function. This usually involves printing a message to the screen or a log file at every invocation of the function, or a subset of those. This pattern is commonly used to generate a chronological log of a certain type of event, which can then be used for performance analysis or debugging purposes.

The following script is an example of tracing with VProbes:

VMK:ENTER:user.Linux32_Write(
    int fd,
    int buf,
    uint32 size) {
  if((string)curworld()->name ==
    “sshd”) {
    printf(“%s: size = %u bytes\n”,
      PROBENAME, size);
  }
}

Placement of a dynamic probe at the start of ESXi kernel function user.Linux32_Write() causes the probe body to be executed every time the function is called. The example shows how Emmett can be used to conditionally print trace entries, helping reduce the output volume. In this case, an if statement is used so that only invocations associated with the sshd world are printed:


VMK:ENTER:user.Linux32_Write: size = 132 bytes
VMK:ENTER:user.Linux32_Write: size = 148 bytes
VMK:ENTER:user.Linux32_Write: size = 100 bytes
VMK:ENTER:user.Linux32_Write: size = 100 bytes
...

Contextual information about the call, such as the size argument, is also printed. This pattern can be applied to any function at any layer of the software stack. The following example shows how a GUEST probe can be used to trace the system calls being invoked inside a VM:

GUEST:ENTER:system_call {
  printf(“System call %x (CR3 %x)”, 
        RAX, CR3);
}

This script is intended for a Linux guest OS, where the system call number is passed on the rax register. When loading this script, the user must provide the Linux kernel symbol map on the command line. This enables the vprobe command to resolve the system_call symbol to its address and then instrument that address. The example generates as output a trace of system calls identified by their corresponding number, along with the present contents of the CR3 register, which can be used to identify the invoking process.

3.2 Counting
For certain use cases and types of events, the level of detail and volume of data generated by tracing can be excessive and cumbersome. Often, a statistical summary of events is more useful in finding answers. Using integer data types, events can be counted and printed periodically or at script unload. Beyond simple counting, histograms are a powerful statistical tool for summarizing data.
VProbes provides built-in support for generating histograms in the form of aggregates. The following script shows a typical use of aggregates:

aggr exits[1];
VMM:HV_Exit {
  exits[VMCS_EXITCODE]++;
}
VMM:VMMUnload {
  printa(exits);
}

This example instruments HV_Exit, a static probe in the VMM domain that fires every time that a VM running in hardwarevirtualization mode exits to the VMM. The exits variable is used to aggregate samples and build a histogram representing the number and frequency of VM exits, using the exit type (VMCS_EXITCODE) as key. VMMUnload is a probe that fires when a script is unloaded from the VMM. In this case, it is used to print the histogram of VM exits:

intKey0  count  pct%
   0x10    222  0.2%
    0x0    222  0.2%
    0x1   3187  3.5&
    0x7   3302  3.7%
   0x1e  13003 14.6%
    0xc  20131 22.6%
   0x30  48623 54.8%

The intKey0 column represents the VM exit code, and the count and pct% columns are self-explanatory. The output above tells us that more than half of the VM exits for the probed VM are related to EPT violations (exit reason 0x30). This information can be useful, for instance, for characterizing a workload running inside a VM.

3.3 Latency Measurements
A common observability pattern in performance analysis is to measure the latency of events, by calculating the time elapsed between the start and the end of the event being measured. This pattern translates into a VProbes script consisting of two probes representing the start and end points of the event, and using a global monotonic timestamp counter (TSC) to measure the time elapsed between those two probe fires.

The following script implements this pattern for measuring the latencies of hardware-virtualization exits:

int exittsc;
aggr lats[1];
VMM:HV_Exit {
  exittsc = TSC;
}
VMM:HV_Resume {
  if (exittsc > 0) {
    lat = TSC – exittsc;
    lats[VMCS_EXIT_REASON] <- lat;
  }
}
VMM:VMMUnload {
  printa(lats);
}

Static probes HV_Exit and HV_Resume fire at the start and end of a VM exit, respectively. TSC is a built-in global that returns the current value of the physical CPU’s TSC. The difference between the two measurements is the number of CPU cycles elapsed between the start and end of the VM exit. The latency sample is put in an aggregate, which is later printed as a histogram showing a distribution of VM exits, by the type of exit:

intKey0      avg   count     min       max     pct%
   0x10     4053     183    1494     28314     0.0%
    0x0    26797     215    3576   1765422     0.0%
    0x7     2914    2357    1632     38532     0.0%
    0x1    21297     866    1620    487578     0.0%
   0x1e    12397   10718    2574   2804826     0.6%
   0x30    25954   49394    1656   8600862     5.9%
    0xc    15625   16469   11574  25295526    93.2%

These results tell us that 93.2% of the VMM exit handling time is spent on a HLT instruction executed by the guest operating system (exit reason 0xc).

3.4 Profiling
VProbes makes it easy to build a system-wide profiler through periodic sampling of system state. Periodic probes are especially useful in this case, given their ability to periodically execute a body of code and collect information from the system.

This example illustrates VProbes’ multi-domain probing capabilities by jointly profiling different layers of the ESXi stack:

perhost aggr ticks[0][1];
VMM:TIMER:1msec {
  string backtrace;
  gueststack(backtrace, 3);
  ticks[backtrace]++;
}
VMK:PROFILE:1msec {
  string backtrace;
  vmwstack(backtrace, 3);
  ticks[backtrace]++;
}
VMK:VMKUnload {
  logaggr(ticks, 1);
}

This script combines one periodic probe in the VMM layer with another in the VMK layer, both with the same period. At each probe fire, a new sample containing the current GUEST/VMK stack trace is added to global aggregate ticks.

When the script is loaded against the VMkernel and all VMs running on the system, it produces a histogram that gives a global view of the host’s execution profile, both across different layers and within individual layers.
Below is a sample output generated by profiling an ESXi system with one VM over a period of 10 seconds. It shows that 73.7% of the samples came from a single code location in the guest:

...
vmkernel.Power_HaltPCPU+0x285
vmkernel.CpuSchedIdleLoopInt+0x61b 484 3.1%
vmkernel.CpuSchedTryBusyWait+0x2c6
[0x828547d3] 
[0x8281a187]                     1090 73.7%
[0x82b5e099]  


3.5 Inspection of System State

The examples so far show simple cases of system state introspection. The first example accesses an argument passed to the Linux32_ Write() function. The second and third read global hardware state: the VMCS exit code and the TSC. But sometimes the information required is not as easy to retrieve and requires parsing through data structure hierarchies in memory.

VProbes supports special classes of types—such as pointers, structs, and unions—to inspect target memory. The following script places a dynamic probe on the entry point of the VMkernel function responsible for processing a list of incoming network packets, and traverses that linked list at each invocation. At each node, it aggregates the source and destination IP addresses corresponding with the packet:

aggr rxconn[2][0];
VMK:ENTER:Net_AcceptRxList(
    void *dev,
    $vmkernel.PktList *list) {
  $vmkernel.PktHandle *pkt;
  pkt = list->csList.slist.head;
  while (pkt != 0) {
    $vmkernel.vmk_EthHdr *eh;
    $vmkernel.vmk_IPv4Hdr *ih;
    eh = pkt->frameVA;
    if (eh->type == 8) { // IPv4
      ih = &pkt->frameVA[14];
      rxconn[ih->saddr, ih->daddr]++;
    }
    pkt = pkt->pktLinks.next;
  }
}
VMK:VMKUnload {
  printa(rxconn);
}

The example illustrates the C-like syntax for dereferencing pointers and accessing struct fields, as well as a special $module.typename syntax used to reference types defined in the target domain. These types are automatically imported and do not need to be declared. VProbes guarantees that all operations involving target memory are safe by performing access checks beforehand.

4. Use Cases

In this section we look at a real-world use case in which VProbes was successfully used to debug a customer issue. We also describe several tools developed at VMware using VProbes.

4.1 Debugging Lock Contention Issues

A VMware support team received an initial report of a customer seeing sporadic ESXi hangs while performing file system operations. Initial analysis showed that the problem was caused by lock contention involving global semaphores in the file system layer. Existing tools did not provide enough information about the locks to pinpoint the source of contention.

The support team wrote a VProbes script to gather statistics about these global semaphores, which are acquired file system operations are performed. When writing the VProbes script, they first identified the probe points of interest:

  • Lock request – Fires when semaphore acquire is requested
  • Lock acquire – Fires when semaphore is acquired
  • Lock release – Fires when semaphore release is requested

After the probe points were identified using the TSC built-in and information about the current world (the equivalent of a thread in the ESXi kernel) at each point, they were able to build histograms for the average wait-time and hold-time for a semaphore and present this information categorized for each world.

Here is the VProbes script that was used:

perhost uint64 fsLockAddr;
bag   semRequestTime;
bag   semAcquireTime;
aggr  resultWait[1][1];
aggr  resultHold[1][1];
/* Get the address of the global lock */
VMK:VMKLoad {
  fsLockAddr =
    sym2addr(“vmkernel.fsLock”);
}
/* Lock acquire request */
VMK:ENTER:SemaphoreLockInt(int sem) {
  if (sem == fsLockAddr) {
    int wid;
    wid= curworld()->id;
    SemRequestTime[wid] <- TSC;
  }
}
/* Lock acquired */
VMK:OFFSET:SemaphoreLockInt:L363
                (ExcFrame *frame) {
  if (frame->rbx == fsLockAddr) {
    string worldname;
    int waitTime, wid;
    wid= curworld()->id;
    waitTime =
      ((TSC - SemRequestTime[hashKey])
      * 1000000) / TSC_HZ;
    SemAcquireTime[wid] <- TSC;
    wname = (string)curworld()->name;
    resultWait[wname, wid] <- waitTime;
  }
}
/* Lock release */
VMK:ENTER:Semaphore_Unlock(uint64 sem) {
  if (sem == fsLockAddr) {
    int worldID, heldTime;
    string wname;
    wid = curworld()->id;
    heldTime =
      ((TSC – SemAcquireTime[hashKey])
      * 1000000) / TSC_HZ;
    wdname = (string) curworld()->name;
    resultHold[wname, wid] <- heldTime;
  }
}
/* Print statistics */
VMK:VMKUnload {
  printf(“Lock wait statistics:\n\n”);
  printa(resultWait,
      “%1$s %0$d %count$d”
      “%min$d %max$d %avg$d”);
  printf(“Lock hold statistics:\n\n”);
  printa(resultHold,
      “%1$s %0$d %count$d”
      “%min$d %max$d %avg$d”);
}

Sample output of the script:

Lock wait statistics:
fsLock   helper50-3   1000014344    1  7  7  7
fsLock   helper50-4   1000014345    1  8  8  8
fsLock   FS3ResMgr    1000014340    2 10 10 10
Lock hold statistics:
fsLock   helper50-3    1000014344   1  8  8  8
fsLock   helper50-4    1000014345   1  8  8  8
fsLock   FS3ResMgr     1000014340   2 21 28 24

This information helped isolate the world that was heavily using the fsLock semaphore. After the world was identified, the support team refined the script to collect call chain information using the vmwstack built-in, to figure out the code that needed to be optimized.

4.2 Dynamic Driver Verifier
The Dynamic Driver Verifier (DDV) [5] is a tool intended to help VMware developers and partners working on ESXi device drivers uncover bugs and accelerate debugging after they are uncovered. It closes coverage gaps by performing comprehensive checks and code path coverage at the level of function calls by intercepting them from a Driver-under-Test (DuT) to the VMkernel and by checking them for erroneous patterns such use-before-initialization, double-free, boundary violations, and resource leaks. DDV also modifies the VMkernel’s responses to the calls, to artificially induce memory allocation failures.

DDV uses VProbes to dynamically intercept function calls related to the DuT internals, core kernel internal calls, and even guest OS calls, providing additional contextual information to analyze a bug. Internally, DDV has been successfully used to track down more than 35 driver bugs. These included incorrect error handling paths, missing error checks, memory leaks, and failures to release resources.

4.3 Packet Tracing Tool

PktTrace [14] is an internal network packet tracing tool aimed at examining the networking behavior in a software-defined data center. PktTrace is primarily meant as a verification tool for NetX, a network extensibility solution that enables VMWare partners to implement services such as intrusion detection or data compression using service VMs. NetX enables the VMs to implement those services by intercepting and manipulating network traffic according to a set of partner-defined filter rules. The goal of PktTrace is to verify that the system implementation obeys the network filter rules.

PktTrace uses VProbes along with pktCapture, a specialized packet tracing tool, to gain detailed visibility into the system and track individual packets through the data path, from the ESXi kernel, through the service VM, and to the destination VM, reporting violations of the network rules. The flexibility of VProbes enables PktTrace to intercept any point along this path. In addition, PktTrace can use the installed probes to compute and report latencies for different segments of the path.

4.4 Dynamic Data Tracker

The Dynamic Data Tracker [9] is an internal debugging tool for tracking dynamically allocated data structures in ESXi. Unlike hardware watchpoints, which are limited in number and in object size, DDT proposes a software watchpoint solution to enable tracking of unbounded numbers of large kernel structures.

The tool consists of two phases. In the first phase, DDT identifies the instructions that access the structure in question by allocating it on a protected page. Accesses to the data structure are intercepted via page faults, and the fault handler records the instruction that accesses the data structure. The second phase of DDT uses VProbes to instrument all the instructions identified in the first phase. DDT uses a default VProbes script that invokes the printf built-in to trace each access to the data structure. The flexibility of VProbes scripting enables users to further augment this script and perform customized actions at each memory access, such as printing backtraces or inspecting other parts of the system state.

5. Experiments

This section provides an evaluation of the VProbes runtime impact. Our experience, illustrated by the numbers in this section, is that there is no performance overhead of VProbes at low probe frequencies, typically up to a few thousand probes per second. At high probe frequencies, the overhead is highly dependent on

  • The probe rate
  • The probe type (static or dynamic)
  • The amount of work done in each probe

Our experiments were conducted on a two–quad-core Intel Xeon server with 12GB RAM and one NUMA node per core, running a recent build of ESXi. We used a four-vCPU, 4GB, 64-bit Windows Server 2008 VM running an IOMeter workload. The IOmeter application had two workers and single outstanding I/Os per disk target. To measure the impact of VProbes, we installed probes on the critical I/O path of the storage stack and observed the throughput in Mbs/sec as reported by IOMeter. The script computes latencies via aggregation, using a similar approach as the example in section 3.3.

5.1 Low-Frequency Probes
In the first experiment we ran IOmeter with sequential writes and different block sizes of 4KB, 16KB, and 32KB. This workload generated about 115 I/Os per second in all these configurations. By installing multiple probes on each I/O, we were able to control the probe fire frequency in these experiments. We used both static and dynamic probes, and varied the probe frequency from 100 probes to 1,000 probes per second. In all cases, no performance degradation was noticed in the throughput reported by IOMeter.

5.2 High-Frequency Probes
In a second experiment, we ran IOmeter with sequential reads and a block size of 4KB. This workload generates about 18,000 I/Os per second.

We installed 1, 2, 3, or 4 probes in the I/O path, which resulted in probe fire frequencies ranging from 18,000 to 72,000 probes per second. We first ran the experiment with static probes. Then we repeated the experiment with dynamic probes. Figure 2 shows the results. For static probes, the performance decrease in IOmeter throughput ranged between 3.08% to 3.8% relative to the case when no probes were installed. In the case of dynamic probes, the performance decrease ranged from 3% to 4.5%. We also enabled all probes in the script, 4 static and 4 dynamic probes per I/O, for a total of 144,000 probes per second. The performance drop in that case was 6.2%.

Figure 2. VProbes Overhead When Running at High Probe Frequencies, with Overhead Measured as I/O Throughput Degradation.

Figure 2. VProbes Overhead When Running at High Probe Frequencies, with Overhead
Measured as I/O Throughput Degradation.

These experiments demonstrate that in this latency measurement use case, VProbes has no runtime impact at low event frequencies, and relatively small impact as the probe fire frequency is increased.

6. Related Work

This section discusses several related performance monitoring and troubleshooting tools, as well as several generic dynamic instrumentation frameworks.

6.1 Scripting-Based Tools
DTrace [4] for Solaris and MacOS and SystemTap [12] for Linux are popular dynamic instrumentation systems for debugging, troubleshooting, and performance analysis. Both of these systems provide scripting languages to control how the system is instrumented and what data needs to be collected. They support dynamic function-boundary probes, as well as static probes tailored to their particular operating systems. VProbes is similar in spirit to DTrace and SystemTap but is specifically designed for the virtualization environment of ESXi. It is designed to observe the main three layers of the ESXi virtualization stack—the guest, the VMM, and the ESXi kernel—enabling it to track events or make correlations across these layers.

6.2 Specialized Tools
Several other tools have been developed for troubleshooting, debugging, and monitoring. Lttng [6] for Linux and ETW [15] for Windows are general-purpose tracing tools designed for fast collection of system traces with low runtime overhead. They both include a number of associated tools for manipulating the collected traces. Profiling tools such as perf [20], oprofile [19], and VTune [22] use hardware event sampling to gather performance data. Other profiling tools have been designed to intercept more-specific events: strace [21] collects and reports system call information; ltrace [2] monitors dynamic library calls; and GCC gprof [17] provides call graph profiling.
All of the above are specialized tools designed to extract specific pieces of information, namely traces and profiles. In contrast, VProbes is an open-ended tool that is not limited to particular collection points, pieces of data, or data formats.

6.3 Dynamic Instrumentation
Enabling dynamic instrumentation via binary translation (BT) is a common technique used in systems such as Dyninst [10], Pin [8], Valgrind [11], and Dynamo [1]. These provide frameworks and APIs for instrumenting user-space applications. They have been typically used for building program analysis tools. More recently, binary instrumentation has been used for kernel space as well [7]. Kprobes [18], KernInst [13], DTrace [4], and GDB [16] provide APIs or implement instrumentation frameworks that dynamically patch code with instructions that redirect the control to instrumentation routines. The dynamic instrumentation in VProbes is similar—we patch the code with instructions that trigger debug exceptions.

7. Future Work and Conclusions

This paper described VProbes, an internal troubleshooting and monitoring tool that uses dynamic instrumentation for the ESXi hypervisor and VMs running on it. VProbes exposes a scripting language interface that makes it easy to query the system for specific pieces of data, including event traces, histograms, latencies, backtraces, and many others. One of the unique features of VProbes is the ability to correlate events among the guest, the VMM, and the ESXi kernel. The tool has been successfully used by the VMware support teams, as well as in the Dynamic Driver Verifier partner development tool.

One possible direction of future work is the integration of various hardware debugging capabilities with VProbes. For example, VProbes could be extended to expose the rich set of Performance Monitoring Counter (PMC) events as new probes in the system. New hardware debugging capabilities such as Intel’s Precise Event Based Sampling (PEBS) or Processor Trace (PT) could also be exposed in VProbes, using the scripting language as a vehicle for easy setup and custom programming of the start and end points of the data collection.
Another possible direction of work is the extension of VProbes with support for user-space probing. Currently, the VMX process is the only user process that can be probed. Extending VProbes to all user processes could provide better observability into other components of the vSphere infrastructure, for example into the host agent (hostd) process.

Currently VProbes targets a single ESXi host. Future work could extend the system to allow the writing of scripts that probe multiple hosts in a cluster. Challenges include aggregating data from different hosts, ordering in time events from different hosts, and probing VMs while they are migrated from one host to another. Finally, VProbes can be used as a building block for other tools, such as specialized command-line debugging tools, or higher-level UI tools. More generally, VProbes can provide a flexible data source to new or existing analytics tools, such as VMware® vCenter™ Operations Manager™.

Acknowledgments

We would like to thank all past members of the VProbes and VMM team at VMware that contributed to this work with code, discussions, or code reviews. In particular, we’d like to thank the past members of the VProbes team, Keith Adams, Eli Collins, Robert Benson, Alex Mirgorodskiy, Matthias Hausner, and Ricardo Gonzalez. We also want to thank the current and past members of the VProbes and monitor verification teams, including Janet Bridgewater, Hemant Joshi, Tung Vo, Jon DuSaint, and Lauren Gao. Finally, we’d like to thank Bo Chen and the DDV team, as well as Chinmay Albal and the CPD organization for their feedback and feature requests that greatly improved VProbes.

References

1. Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: a transparent dynamic optimization system. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (2000), pp. 1–12.

2. Branco, R. R. Ltrace internals. In Ottawa Linux Symposium (2007).

3. Bridgewater, J., and Leisy, P. Improving software robustness using pseudorandom test generation. In VMware Technical Journal, Winter 2013 Edition (2013).

4. Cantrill, B., Shapiro, M. W., and Leventhal, A. H. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track (2004), pp. 15–28.

5. Chen, B. A runtime driver verification system using VProbes. In VMware Technical Journal, Summer 2014 Edition (2014).

6. Desnoyers, M. and Dagenais, M. The lttng tracer: A low impact performance and behavior monitor for GNU/Linux. In Ottawa Linux Symposium (2006).

7. Feiner, P., Brown, A. D., and Goel, A. Comprehensive kernel instrumentation via dynamic binary translation. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (2012), pp. 135–146.

8. Luk, C.-K., Cohn, R. S., Muth, R., Patil, H., Klauser, A., Lowney, P. G., Wallace, S., Reddi, V. J., and Hazelwood, K. M. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (2005), pp. 190–200.

9. Ma, M.-K., and Ravenna, N. The dynamic data tracker. In VMware Technical Journal, Summer 2014 Edition (2014).

10. Miller, B. P. and Bernat, A. R. Anywhere, any time binary instrumentation. In ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (Szeged, Hungary, 2011).

11. Nethercote, N., and Seward, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (2007), pp. 89–100.

12. Prasad, V., Cohen, W., Eigler, F., Hunt, M., Keniston, J., and Chen, B. Locating system problems using dynamic instrumentation. In Ottawa Linux Symposium (2005), pp. 49–64.

13. Tamches, A. and Miller, B. P. Fine-grained dynamic instrumentation of commodity operating system kernels. In Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (1999), pp. 117–130.

14. Zou, H., Mahajan, A. and Pandya, S. PktTrace: A packet lifecycle tracking tool for network services in a software-defined datacenter. In VMware Technical Journal, Summer 2014 Edition (2014).

15. Etw. http://msdn.microsoft.com/en-us/library/ms751538

16. Gdb. http://www.gnu.org/software/gdb

17. Gprof. https://sourceware.org/binutils/docs/gprof

18. Kprobes. https://www.kernel.org/doc/documentation/ kprobes.txt

19. Oprofile. http://oprofile.sourceforge.net/news

20. Perf. https://perf.wiki.kernel.org

21. Strace. http://sourceforge.net/projects/strace

22. Vtune. http://software.intel.com/en-us/intel-vtune-amplifier-xe