Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications

Jeffrey Buell
VMware performance
jbuell@vmware.com

Daniel Hecht
VMware Hypervisor group
dhecht@vmware.com

Jin Heo
VMware performance
heoj@vmware.com

Kalyan Saladi
VMware performance
ksaladi@vmware.com

H. Reza Taheri
VMware performance
rtaheri@vmware.com

Abstract

With the recent advances in virtualization technology, both in the hypervisor software and in the processor architectures, one can state that VMware vSphere® runs virtualized applications at near-native performance levels. This is certainly true against a baseline of the early days of virtualization, when reaching even half the performance of native systems was a distant goal. However, this near-native performance has made the task of root causing any remaining performance differences more difficult. In this paper, we will present a methodology for a more detailed examination of the effects of virtualization on performance, and present sample results.

We will look at the performance on vSphere 5.1 of an OLTP workload, a typical Hadoop application, and low latency applications. We will analyze the performance and highlight the areas that cause an application to run more slowly in a virtualized server. The pioneers of the early days of virtualization invented a battery of tools to study and optimize performance. We will show that as the gap between virtual and native performance has closed, these traditional tools are no longer adequate for detailed investigations. One of our novel contributions is combining the traditional tools with hardware monitoring facilities to see how the processor execution profile changes on a virtualized server.

We will show that the increase in translation lookaside buffer (TLB) miss handling costs due to the hardware-assisted memory management unit (MMU) is the largest contributor to the performance gap between native and virtual servers. The TLB miss ratio also rises on virtual servers, further increasing the miss processing costs. Many of the other performance differences with native (e.g., additional data cache misses) are also due to the heavier TLB miss processing behaviour of virtual servers. Depending on the application, ¼ to ½ of the performance difference is due to the time spent in the hypervisor kernel. This is expected, as all networking and storage I/O has to get processed twice: once in the virtual device in the guest, once in the hypervisor kernel.

The hypervisor virtual machine monitor (VMM) is responsible for only 1-3% of the overall time. In other words, the VMM, which was responsible for much of the virtualization overhead of the early hypervisors, is now a small contributor to virtualization overhead. The results point to new areas, such as TLBs and address translation, to work on in order to further close the gap between virtual and native performance.

Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of Systems – Design studies, Measurement techniques, Performance attributes.

General Terms: Measurement, Performance, Design, Experimentation.

Keywords: Tier-1 Applications, database performance, hardware counters.

1. Introduction

Over the past 10 years, the performance of applications running on vSphere has gone from as low as 20% of native in the very early days of ESX1 to what is commonly called near-native performance, a term that has come to imply an overhead of less than 20%. Many tools and methods were invented along the way to measure and tune performance in virtual environments [2, 1]. But with overheads at such low levels, drilling down into the source of performance differences between native and virtual has become very difficult.

The genesis of this work was a project to study the performance of vSphere running Tier-1 Applications, in particular a relational database running an Online Transaction Processing (OLTP) workload, which is commonly considered a very heavy workload. Along the way, we discovered that existing tools could not account for the difference between native and virtual performance. At this point, we turned to hardware event monitoring tools, and combined them with software profiling tools to drill down into the performance overheads. Later measurements with Hadoop and latency-sensitive applications showed that the same methodology and tools can be used for investigating performance of those applications, and furthermore, the sources of virtualization performance overhead are similar for all these workloads. This paper describes the tools and the methodology for such measurements.

1.1 vSphere hypervisor

VMware vSphere contains various components, including the virtual machine monitor (VMM) and the VMkernel. The VMM implements the virtual hardware that runs a virtual machine. The virtual hardware comprises the virtual CPUs, timers, and devices. A virtual CPU includes a Memory Management Unit (MMU), which can be implemented using EPT on modern Intel CPUs. The VMkernel is the kernel of vSphere, responsible for managing the resources of the physical system and providing services to the VMMs and user worlds. VMkernel is designed to efficiently run virtual machine workloads and provide strong isolation between VMs [2].

2. Experiments

2.1 Benchmarks

2.1.1 OLTP

The workload used in our OLTP experiments is based on the TPC-C benchmark [9]. We will refer to this workload as the OLTP benchmark. The OLTP benchmark is a non-comparable implementation of the TPC-C business model; our results are not TPC-C compliant, and not comparable to official TPC-C results.

TPC-C is a database benchmark with many small transactions. Of the five transaction types, three update the database; the other two, which occur with relatively low frequency, are read-only.

This workload fully utilizes the CPU, yet is sensitive to storage and networking latency. The I/O load is quite heavy and consists of small access sizes (2K-16K). The disk I/O accesses consist of random reads and writes with a 2:1 ratio in favor of reads. In terms of the impact on the system, this benchmark spends considerable execution time in the operating system kernel context, which is harder to virtualize than user-mode code. Specifically, how well a hypervisor virtualizes the hardware interrupt processing, I/O handling, context switching, and scheduler portions of the guest operating system code is critical to the performance of this benchmark. The workload is also very sensitive to the processor cache and translation lookaside buffer (TLB) hit ratios.

This benchmark is the biggest hammer we have to beat on vSphere and observe its performance.

2.1.2 Hadoop

Big Data workloads are an important new class of applications for VMware. While standard benchmarks are lacking, the canonical test is TeraSort running on Hadoop. It requires very high network and storage throughput, and also saturates the CPU during the most important phase of the test. Unlike the OLTP benchmark, the disk I/O consists of large sequential reads and writes and does not stress either the guest OS or VMkernel nearly as much. While TeraSort is not particularly sensitive to network latency, it is sensitive to the CPU overhead generated.

The tests reported here were based on running Cloudera’s distribution of Apache Hadoop, CDH 4.1.1 with MRv1. We will
focus on the performance of the “map/shuffle” phase of TeraSort. This phase saturates the CPU and exposes the virtualization over-head to the greatest degree.

TeraSort performance can be greatly improved by partitioning the servers into multiple smaller VMs. This is due to improved hardware utilization, not to reduced virtualization overhead. Therefore, similar to the OTLP workload, we report on the “apples-to-apples” configuration of a single VM per server configured with all the processors and memory.

2.1.3 Latency-Sensitive Applications

Latency applications include distributed in-memory data management, stock trading, and streaming media. These applications typically demand a strict requirement for response time of a trans-action, since violating the requirement may lead to loss in revenue. There has been growing interest to virtualize such applications. We specifically study response-time overhead in running request-response workloads in vSphere. In this type of workload (hereafter, these types of workloads may be referred to as RR workloads), the client sends a request to the server, which replies back with a response after processing the request. This kind of workload is very commonly seen: Ping, HTTP, and most RPC workloads all fall into the category. This paper primarily focuses on configurations in a lightly loaded system where only one transaction of request-response is allowed to run. Any overhead induced due to additional efforts performed by the virtualization layer then directly appear as extra transaction latency.

2.2 Benchmark Configurations

2.2.1 OLTP workload Configuration

Our experiments started with a 4-socket, 8-core Intel Nehalem-EX, and we eventually upgraded to 10-core Intel Westmere-EX processors. But to be able to compare with the original 8-way experiments on earlier vSphere releases, we configured only 2 cores per socket in the BIOS to have a total of 8 cores on the system. We did this to separate the basic overheads from any SMP overheads. We did take runs with 32 and 64 processors (32 cores, HT enabled) to study the performance at higher levels, which showed similar virtual-native ratios to the 8-way measurements. Our configuration consisted of:

  • 4-socket Dell R910 server
  • 10-core Intel Xeon E7-4870 (Westmere-EX) processors
    • For the 8-way runs, 2 of the 8 cores on each socket were enabled at the BIOS level
  • 4X6.4GT/s QPI speed; 1066MHz DDR3 memory clock
  • 512GB memory (32x16GB, 1066MHz, Quad Ranked RDIMMs)
  • SMT (hyper-threading) disabled in BIOS
  • Turbo mode enabled in BIOS for most experiments
  • Storage Hardware
  • Two EMC CLARiiON CX4-960 arrays for data
    • 16 LUNs over 62 73GB and 200GB EFD flash drives
  • One CX3-40 array for log
    • 30 15K RPM disk drives
  • Software Versions
  • Various builds of vSphere 4.0, 4.1, 5.0, and 5.1
  • Networking and storage I/O drivers were configured for maximum interrupt coalescing to minimize their CPU usage
    • Operating system (guest and native):
      • Original experiments started out with RHEL 5.2. Along the way, we ran tests on RHEL 5.3, RHEL 5.6, RHEL 6.0, and RHEL 6.1. The results reported here are with RHEL 6.1 unless otherwise noted
  • We always used large pages for the Oracle SGA by setting vm.nr_hugepages to a value to match the SGA size
  • DBMS Trial version of Oracle 11g R1 (11.1.0.6.0)

2.2.2 Hadoop Workload Configuration

Cluster of 24 HP DL380G7 servers:

  • Two 3.6 GHz 4-core Intel Xeon X5687 (Westmere-EP) processors with hyper-threading enabled
  • 72GB 1333MHz memory, 68GB used for the VM
  • 16 local 15K RPM local SAS drives, 12 used for Hadoop data
  • Broadcom 10 GbE NICs connected together in a flat topology through an Arista 7050S switch
  • Software: vSphere 5.1, RHEL 6.1, CDH 4.1.1 with MRv1
  • vSphere network driver bnx2x, upgraded for best performance

2.2.3 Latency-sensitive workload configuration

The test bed consists of one server machine for serving RR (request-response) workload requests and one client machine
that generates RR requests.

  • The server machine is configured with dual-socket, quad-core 3.47GHz Intel Xeon X5690 processors and 64GB of RAM, while the client machine is configured with dual-socket, quad-core 3.40GHz Intel Xeon X5690 processors and 48GB of RAM.
  • Both machines are equipped with a 10GbE Broadcom NIC. Hyper-threading is not used.
  • Two different configurations are used. A native configuration runs native RHEL6 Linux on both client and server machines, while a VM configuration runs vSphere on the server machine and native RHEL6 on the client machine. The vSphere hosts a VM that runs the same version of RHEL6 Linux. For both the native and VM configurations, only one CPU (VCPU for the VM configuration) is configured to be used to remove the impact of parallel processing and discard any multi-CPU related overhead.
  • The client machine is used to generate RR workload requests, and the server machine is used to serve the requests and send replies back to the client in both configurations. VMXNET3 is used for virtual NICs in the VM configuration. A 10 Gigabit Ethernet switch is used to interconnect the two machines.

2.3 Tools

Obviously, tools are critical in a study like this:

  • We collected data from the usual Linux performance tools, such as mpstat, iostat, sar, numastat, and vmstat.
  • With the later RHEL 6.1 experiments, we used the Linux perf(1) [5] facility to profile the native/guest application.
  • Naturally, in a study of virtualization overhead, vSphere tools are the most widely used:
  • Esxtop is commonly the first tool vSphere performance analysts turn to.
  • kstats is an internal tool that profiles the VMM. It is not a complete profile graph of the VMM or the guest application. If one thinks of the VMM as a collection of events or services, kstats shows how much time was spent in each service, including unmodified guest user/kernel.
    • vmkstats, which profiles the VMkernel routines.
    • Various other vSphere tools: memstats, net-stats, schedsnapshot, etc.
    • We relied heavily on Oracle statspack (also known as AWR) stats for the OLTP workload.
    • We collected and analyzed Unisphere performance logs from the arrays to study any possible bottlenecks in the storage subsystem for the OLTP workload.
    • Hadoop maintains extensive statistics on its own execution. These are used to analyze how well-balanced the cluster is and the execution times of various tasks and phases.

2.3.1 Hardware counters

Above tools are commonly used by vSphere performance analysts inside and outside VMware to study the performance of vSphere applications, and we made extensive use of them. But one of the key take-aways of this study is that to really understand the sources of virtualization overhead, these tools are not enough. We needed to go one level deeper and augment the traditional tools with tools that reveal the hardware execution profile.

Recent processors have expanded categories of hardware events that software can monitor [7]. But using these counters requires:

  1. A tool to collect data
  2. A methodology for choosing from among the thousands of available events, and combining the event counts to derive statistics for events that are not directly monitored
  3. Meaningful interpretation of the results

All of the above typically require a close working relationship with microprocessor vendors, which we relied on heavily to collect and analyze the data in this study.

Processor architecture has become increasingly complex, especially with the advent of multiple cores residing in a NUMA node, typically with a shared last level cache. Analyzing application execution on these processors is a demanding task and necessitates examining how the application is interacting with the hardware. Processors come equipped with hardware performance monitoring units (PMU) that enable efficient monitoring of hard-ware-level activity without significant performance overheads imposed on the application/OS. While the implementation details of a PMU can vary from processor to processor, two common types of PMU are Core and Uncore (see 2.3.1.1). Each processor core has its own PMU, and in the case of hyper-threading, each logical core appears to have a dedicated core PMU all by itself.

A typical Core PMU consists of one or more counters that are capable of counting events occurring in different hardware components, such as CPU, Cache, or Memory. The performance counters can be controlled individually or as a group—the controls offered are usually enable, disable, and detect overflow (via generation of an interrupt) to name a few, and have different modes (user/kernel). Each performance counter can monitor one event at a time. Some processors restrict, or limit, the type of events that can be monitored from a given counter. The counters and their control registers can be accessed through special registers (MSRs) and/or through the PCI bus. We will look into details of PMUs supported by the two main x86 processor vendors, AMD and Intel.

2.3.1.1 Intel

Intel x86 processors have a Core PMU that has a number (currently 3) of fixed-purpose counters, each 48 bits wide. Each fixed counter can count only one architectural performance event, thus simplifying the configuration part. In addition to fixed counters, Core PMU also supports 4-8 general-purpose counters that are capable of counting any activity occurring in the core. Starting with the Sandy Bridge microarchitecture, each core is presented with 8 counters when hyper-threading is disabled; otherwise, each core is limited to 4. Each Core PMU has a set of control registers to assist with programming the performance counters. The PMU also has EventSelect registers that correspond to each counter, which allows for specification of the exact event that should be counted. Intel also offers, as part of Core PMU, a set of complex events that monitor transactions originating in the “core” and satisfied by an “off-core” component, such as DRAM or LLC on a different NUMA node. An example for an off-core event is Remote-DRAM accesses from this core, which gives an idea of all the memory requests that were responded to by a remote memory bank in the case of a multi-socket system—as is typical in NUMA systems. The diagram in Figure 1 illustrates the core PMU on Intel x86 processors.

Figure 1

Figure 1. Intel X86 PMU

To monitor activity outside of “Core,” Intel provides Uncore PMU. Uncore PMU is independent of any particular core and tracks inter-core communication and communication between core and the uncore. An example of an Uncore event is to monitor memory bandwidth of the interconnect (QPI). By their nature, Uncore events cannot be attributed to any particular core or an application running on it.

2.3.1.2 AMD

AMD processors have a Core PMU for monitoring core-specific activity. Each Core PMU supports six 48-bit performance counters that track activity in the core. Analogous to the Uncore PMU in Intel, AMD has a NB PMU (Northbridge PMU), which supports four 48-bit counters per node. Northbridge provides interfaces to system memory, local cores, system IO devices, and to other processors. Using NB PMU, it is possible to monitor activity across cores and from cores to other components in a node.

2.3.1.3 Hardware profiling tools

In order to fully utilize the PMU, powerful tools are required to hide the complexity of using the counters as well as work around the resource limitation and digestion of the produced data. There are several tools which make use of PMU, such as Intel’s VTune, and open-source tools Perf and OProfile, among others. Tools such as VTune, OProfile, and other call stack profilers make use of the overflow detection facility in the PMU and collect call stacks. Certain other tools such as Perf and Intel’s EMon accumulate event counts for a given duration, and they can do system-wide event collection. Figure 2 shows the general architecture of hardware profiling tools.

Figure 2

Figure 2. Architecture of hardware profiling tools

2.3.1.4 Hardware profiling on a virtualized server

For vSphere-level monitoring, VMware provides a tool named vmkperf that allows one to program the PMU, thus examining activity generated by the guest(s) and guest kernel(s) and vSphere kernel. The tool vmkperf is limited by the number of underlying counters available, and raw data is presented to the user. Intel offers a tool named EMon that works with Linux and vSphere. It is capable of working with hundreds of events by time multiplexing the counter resources, produces summary statistics, and allows for sophisticated post-processing of the event data. We made extensive use of EMon and Perf to examine all the layers present when running applications inside virtual machines.

On vSphere we used a version of EMon called esx-emon that collected hardware counts at the vSphere level (i.e., for the whole system), including the guest, the vSphere VMM, and VMkernel. We also collected data inside the guest using native OS tools by exploiting a feature called vPMC, described next. 

2.3.2 vPMC

Beginning with virtual machines that have vSphere 5.1 and later compatibility, the core performance counters can be accessed from within a virtual machine by using the Virtual Performance Monitor Counter (vPMC) feature [12]. This feature allows software running inside the virtual machine to configure and access the performance counters in the same way as when running directly on a physical CPU. This allows existing unmodified profiling tools to be leveraged to analyze the performance of software running inside a virtual machine.

The virtual performance counters monitor the same processor core events as the underlying physical CPU’s PMCs. However, the performance counters of the physical CPU are not directly passed through to the software running inside the virtual machine be-cause of a few differences between virtual CPUs and physical CPUs [4]. The first difference is that virtual CPUs time-share the physical CPUs. That is, a virtual CPU can be descheduled from a physical CPU by saving the state of the virtual CPU so that execution can resume later. The physical CPU can then be used to run a different virtual CPU. When descheduled, a virtual CPU can also be migrated from one physical CPU to another physical CPU. The second difference is that some guest instructions running on a virtual CPU are intercepted by the VMM and retired using soft-ware emulation, rather than executing directly on the physical CPU. For example, when executed inside a virtual machine, the CPUID instruction will trap into the VMM. The VMM will then decode and emulate the guest CPUID instruction and increment the virtual instruction pointer, retiring the CPUID instruction using emulation. The VMM will then return control back to the guest at the next instruction.

The virtual PMCs are paused and their state is saved when their virtual CPU is descheduled, and the virtual PMCs’ state is restored and unpaused when their virtual CPU is rescheduled. This allows the physical PMCs to be time-shared between the virtual CPUs that are time-sharing a physical CPU. Also, this prevents performance monitoring events that should be associated with another virtual CPU from being accounted to a descheduled VCPU. Finally, it allows the virtual PMCs’ state to be migrated when a virtual CPU is migrated to another physical CPU.

The behavior of the virtual performance counters across VMM intercepts is configurable. In all cases, when the guest instructions are executing directly, the virtual PMCs will use the physical PMCs to count the performance monitoring events that are incurred by the guest instructions.

The first option, called guest mode, causes the virtual performance counters to pause whenever the VMM intercepts guest execution. When guest execution resumes, the virtual performance counters are unpaused and continue counting performance events using the underlying physical PMCs. Events incurred while executing in the VMM or VMkernel on behalf of this virtual CPU are not counted by the virtual PMCs. A profiler running inside the guest will show overheads incurred only while directly executing guest instructions.  For example, when the guest executes CPUID, this guest instruction will appear to execute nearly for free be-cause the counters are paused while executing in the VMM.

The second option is called vcpu mode. In this mode, virtual performance counters continue to count even across an intercept into the VMM. A profiler running inside the guest will show all the events incurred, even while executing the VMM and VMkernel on behalf of the virtual CPU. That is, the profiler will be able to accurately account the time that is required to execute a guest CPUID instruction. The profiler will also show that many instructions were retired when the guest executes a single CPUID instruction, because all the VMM instructions that were executed in order to emulate the CPUID will be accounted.

The third option is a mixture of the first two options and is called hybrid mode. In this mode, any virtual performance counter that is configured to count an at-retirement event (such as instructions retired or branches retired) behaves as guest mode. All other events, including the cycle counters, behave as vcpu mode. When this configuration is used, a profiler running inside the guest will show an accurate count of guest instructions executed and will also provide a picture of the overheads incurred due to intercepts into the VMM.

We typically collected results in the guest mode, which excludes the execution time in the VMM and VMkernel. The comparison between the guest collection inside the guest and the data collected by esx-emon allowed us to measure the impact of the VMM and VMkernel execution on the system level performance.

3. Results

The OLTP results (with RHEL 5.6 and vSphere 5.0) in Table 1 were the original impetus behind this investigation.

PERFORMANCE METRIC NATIVE VSPHERE 4.1 VSPHERE 5.0
Users 125 160 160
Throughput (Trans/sec) 10.6K 8.4K 8.4K
Ratio to native 1 80% 79%
IOPS 68.6K 54.0K 53.8K
Idle left 1.4% 2.6% 1.8%

Table 1. vSphere 5.0 results

We repeated the experiments on different hardware platforms, and several different Linux releases, and saw similar results. For these experiments, the server was configured for repeatability of runs and for stability of collected data, for example, by pinning each virtual CPU to a single physical CPU to have a near-constant execution pattern on each physical CPU and consistent hardware event counts throughout the run. This had a negative impact on performance in the virtual environment and made the native-virtual performance gap wider than necessary, but made the analysis easier. The native-virtual performance ratio did not change much when we increased the number of cores and vCPUs to 32 as depicted in Figure 3. In other words, SMP scaling is very good.

Figure 3

Figure 3. Scaling to 32-way

We ran in excess of 10,000 experiments on several different systems to collect and analyze data. All of the tools described in Section 2.3 were used for these tests, and results were collected and archived.

4. Hardware Support for Page Translation

We will provide a brief description of hardware support for page translation because it is something that figures prominently in the results below. Memory translation in the context of virtual machines was expensive due to the VMM being burdened with keeping shadow tables updated. These shadow page tables maintained the mapping between Guest Physical Address to Host Physical Address. The mapping between Guest Logical to Guest Physical address is handled by the Guest OS. Intel and AMD introduced specialized hardware-level page tables to support virtualization. This infrastructure is known as Extended Page Table (EPT) in Intel CPUs and Nested Page Tables (NPT) in AMD processors.

The benefit of hardware-assisted MMU comes from avoiding calling into the VMM too frequently to update shadow page
tables, thereby improving performance in general. Despite the performance gains obtained by switching from software-MMU to an EPT/NPT based one, however, there is still considerable overhead inherently associated with EPT due to the multi-level walk upon a TLB miss as depicted in Figure 4. Walking each level of the (guest) page table structure incurs an entire page-table waking of the EPT/NPT structure, considerably increasing the CPU cycles spent upon TLB misses. Figure 4 shows the translation flow for small (4K) pages. There is one less level for the large (2MB) pages in each direction (independently for guest and host pages).

Figure 4

Figure 4. Hardware support for 4K page translations

5. Analysis

5.1 OLTP workload

The common perception is that the virtualization overhead is mostly due to going through the I/O stacks twice. The guest performs I/O using the virtual drivers, but the same networking packet or disk access gets issued by the hypervisor, incurring a second driver overhead. To see if this was indeed the main source of the overhead, we ran the Oracle client processes on the guest VM, so no networking. We also drastically reduced the database size and increased the SGA size to cache the database and lower the disk I/O from around 45K IOPS to 10K IOPS. We still saw a 15% overhead compared to native in this configuration that was geared towards collecting stable data from various tools (the gap would have been smaller had we strived for optimum performance rather than repeatable, stable statistics). With the following profile, we would have expected much better performance.

  • The guest spends 86-90% in user mode. So the performance gap was not due to a large, unexpected rise in the time in the guest kernel.
  • Kstats says 96-97% of time was in DIRECT_ EXEC_64 and DIRECT_ EXEC_KERNEL_64. This is time in direct execution of guest code, including privileged code. In other words, there wasn’t a large growth in the execution code path length due to virtualization.
  • Vmkstats says 95-96% of time was in VMM and guest. So the time in VMkernel was very small.

So where do the cycles go?

The results in Table 2 show that:

  • Our cycles/tran has gone up by ~20%
  • 10% of that was in user mode processing
    • User mode path length was relatively constant, but CPI (Cycles Per Instruction) grew by 11-12%
    • 2.5% was in growth in kernel mode processing in the guest
      • We also have a slight increase in path length in kernel-mode inside the guest from 123K instr/tran to 135K
      • Worse kernel CPI
      • 8% was in kernel mode processing in VMkernel and VMM
        • We have ~3% idle in virtual, even with minimal I/O. (We later changed a configuration parameter, and removed this extra idle in the virtual case. But were not able to repeat all the experiments with this new, more optimal configuration.)
HARDWARE COUNTER MODE NATIVE ESX-EMON GUEST EMON
Thread Util Kernel 13% 18% 12%
User 87% 78% 78%
Unhalted core cycles Kernel 380K 697K 456K
User 2618K 297K 2914K
Thread CPI Kernel 3.08 3.38 3.34
User 2.00 2.23 2.24
Path length Kernel 123K 206K 136K
User 1306K 1335K 1301K

Table 2. High level hardware stats

So we need to determine the source of the increase in user and kernel mode cycles despite the user mode path length staying the same and only a modest increase (when compared to the overall path length) in the kernel mode path length. Studying the full hardware counts led us to the TLB events in Table 3. (We have included the Instruction TLB events in the table; however, the counts for the virtual case are suspect. We are currently discussing these events with our Intel colleagues.)

A complete discussion of memory management in a virtual environment is outside this scope of this paper. Consult [8] for
an overview of this topic. In brief, the modern MMU assist features in the Intel and AMD microprocessors make it easier to develop a hypervisor, and greatly improve performance for certain applications. Our earlier results with the OLTP work- load showed that even this workload benefitted from hardware-assisted MMU virtualization [11]. However, this doesn’t mean that the address translation in a virtual environment incurs no additional overheads. There is still an overhead associated with EPT (Intel’s name for hardware-assisted MMU for virtualized servers). The data in Table 3 shows that just the EPT overhead accounts for 6.5% of the 11.5% growth in user-mode cycles. Also, EPT over-head accounts for 1.1% of the 2.5% growth in the cycles in kernel mode in the guest. So all in all, EPT overhead accounts for nearly half the virtualization overhead for this workload. Discussing this data with our Intel colleagues, they believe some of the CPI increase caused by higher cache misses is also due to the impact of having to traverse two page table data structures rather than one. In other words, the address translation costs are responsible for well over half the virtualization overhead.

Combining data from vmkstats and emon tells us that about 6% of the increase in CPU cycles/Oracle transaction is due to VMkernel, and another ~2% is in the VMM.

HARDWARE COUNTER MODE NATIVE ESX-EMON GUEST EMON
ITLB Miss latency
cycles
Kernel 28 14 9
User 33 3 3
DTLB Miss latency cycles Kernel 25 65 69
User 39 97 104
ITLB Miss page
walks completed
Kernel 15 130 71
User 2,192 5,036 4,910
ITLB Miss page
walk cycles
Kernel 881 2,083 616
User 85,795 13,296 13,366
DTLB Miss page
walks completed
Kernel 1,218 1,901 1,314
User 3,548 3,604 3,385
DTLB Miss PMH busy cycles (total miss cycles) Kernel 40K 124K 91K
User 179K 351K 353K
Extended Page table walk cycles Kernel 0 36K 35K
User 0 202K 198K
Percent of instructions that hit in L1 cache Kernel 86% 86% 85%
User 87% 86% 87%
Percent of instructions that hit in L2 cache Kernel 8% 8% 8%
User 9% 8% 8%
Percent of instructions that hit in LLC Kernel 6% 6% N/A
User 5% 5% N/A

Table 3. TLB events per transaction for the OLTP workload 

5.2 Hadoop

The results presented here are for the map-shuffle phase of TeraSort on a 24-host cluster configured with a single VM on each host. During this phase the CPU is very nearly saturated, network bandwidth is about 1.3 Gb/sec (per host, in each direction), and total storage bandwidth is about 500 MB/sec per host.

This phase has a 16.2% greater elapsed time in the virtual case compared to native (equivalent to 14% lower throughput). As with the OLTP workload, a smaller overhead is expected based on the map-shuffle profile:

  • 82% of guest time in user mode
  • Less than 1% of total time spent in the VMM
  • About 5% of total time spent in the VMkernel
HARDWARE COUNTER MODE NATIVE ESX-EMON GUEST EMON
Thread Util Kernel 14.8% 21.7% 15.2%
User 83.4% 77.7% 77.8%
Unhalted core cycles Kernel 4,613 7,845 5,516
User 25,954 28,106 28,229
Thread CPI Kernel 2.88 3.62 3.80
User 1.36 1.49 1.49
Path lengh Kernel 1,602 2,164 1,452
User 19,109 18,856 18,959

Table 4. High level hardware statistics for the Hadoop workload

Table 4 shows the high-level hardware statistics. For the cycles and path length metrics a “transaction” is a single row processed.

The observed increase in total elapsed time is equal to a 17.6% increase in cycles per transaction less a small increase in CPU utilization. The increase in cycles is not much different than the 20% observed for the OLTP workload. However, the breakdown is different, which in turn can be used to characterize the differences in the workloads. First, 60% of the increase in cycles is in kernel mode, compared to a 50-50 split for the OLTP workload. This is despite the OLTP workload having a significantly higher fraction of its instructions executing in kernel mode. The big difference between the workloads is how the CPI changes with virtualization. The CPI increases by about 12% for user and both hypervisor and guest kernel modes for the OLTP workload. For Hadoop, the increase is 10% for user mode and 26% for kernel mode. The difference between the esx-emon and guest emon statistics shows that the hypervisor has a CPI 14% larger than native, but the increase is 32% for the guest kernel. These large increases in kernel CPI combined with much more modest changes in path length result in almost identical virtualization over-head in the guest kernel and hypervisor (3% and 7.6% respectively) as the OLTP workload.

Much of the above high-level view can be explained by the lower-level hardware counts shown in Table 5.

HARDWARE COUNTER MODE NATIVE ESX-EMON
Percent of instructions
that hit in L1 cache
Kernel 95% 90%
User 99.2% 98.6%
Percent of instructions
that hit in L2 cache
Kernel 3.2% 5.3%
User 0.8% 0.9%
Percent of instructions
that hit in LLC
Kernel 1.4% 3.1%
User 0.3% 0.3%
DTLB miss PMH busy cycles Kernel 312 1,154
User 1,413 3,039
EPT cycles Kernel 0 420
User 0 1,377

Table 5. Low-level hardware statistics for the Hadoop workload

Unlike the OLTP workload, the Hadoop workload has a very high L1 user instruction hit ratio for both the native and virtual cases. Most of the user instruction fetches that miss in the L1 hit in the L2 cache. The result is a relatively low user CPI and a modest increase from native to virtual. Kernel instructions miss the L1 5% of the time for native and twice that for virtual. This pattern continues to the L2 cache and LLC. This is consistent with a higher kernel CPI and a large increase from native to virtual.

We speculate that the sizes of the guest kernel code required to execute the OLTP and Hadoop workloads are similar, but the DBMS user code needed for OLTP is far larger than the Java user code needed to execute TeraSort. The latter fits easily into the L1 instruction cache, while the guest and hypervisor kernel code compete with each other. For OLTP, the guest user and kernel code must compete for L1 cache space, and so adding the hypervisor kernel code does not have a big effect. One implication of this is that running a sufficiently complex Hadoop workload should show less virtualization overhead than TeraSort.

As with the OLTP workload, EPT processing accounts for the bulk of the growth in user mode time: 5.3% out of 8.3%. It accounts for a bigger relative increase in kernel mode time (9%), although this is a small part of the 70% overall increase in kernel time. Overall address translation costs are almost half of the virtualization overhead. Vmkstats shows that 5% out of the overall 16.5% overhead is due to VMkernel, and kstats shows that 0.8% is in the VMM. Most of the rest is due to increased CPI from more instruction cache misses.

5.3 Latency-sensitive applications

This section evaluates the response time overhead of four different RR workloads on vSphere (VM configuration) against a native machine (native configuration).

Four different request-response workloads are used to evaluate and analyze the response time overhead: 1) Ping, 2) Netperf_RR, 3) Gemfire_Get, 4) Gemfire_Put, and 5) Apache.

  • Ping – Default ping parameters are used except that the interval is set to .001, meaning 1000 ping requests are sent out for every second.
  • Netperf_RR – The Netperf micro benchmark [6] is used to generate an RR workload in TCP.
  • Apache – The client generates HTTP traffic that is sent to Apache Web server 7 [3]. The Apache Web server is configured to run only one server process so that there is always one transaction handled at a time.
  • Gemfire_Put – Gemfire [13] is a Java-based distributed data management platform. Gemfire_Get is a benchmark workload that is built with Gemfire where the client node sends a request with a key to extract an object stored in server nodes. There are two server nodes that are replicated. All nodes are Java processes. The client node runs in the client machine, and the server nodes run in the server machine (or in a VM for the VM configuration).

Table 6 compares the response times of four different workloads between the native and VM configurations. Ping exhibits the lowest response time because it is the simplest workload. Gemfire_Put shows the highest response time, indicating that this is the most complicated workload. The overhead of Ping is 13 µs, while the overhead of Gemfire_Put is 23 µs. When a workload is virtualized, a certain amount of overhead is expected due to the extra layers of processing. Because a request-response workload processes exactly one request-response pair per transaction, it is reasonable to anticipate a similar amount of overhead in response time across a different variation of the workload (as long as the packet size is similar). It is interesting, however, to observe that the absolute overhead is obviously not constant, but increases with a more complicated workload.

HARDWARE COUNTER PING NETPERF APACHE GEMFIRE_PUT
Native 26 38 88 134
VM 39 52 106 157
Overhead 13 14 18 23

Table 6. Response times of four RR workloads

In order to understand why the overhead is not constant, hardware performance counters are used. Hardware performance counters provide an effective way to understand where CPU cycles are spent.

Observe that there is a noticeable difference in the code path executed between the native and VM configurations mainly due to the difference in the device driver; the VM configuration uses a paravirtualized driver, VMXNET3, while the native configuration uses a native Linux device driver. This makes it hard to compare performance counters of the two configurations fairly, because they execute different guest code. For these reasons, a simpler workload is used to find any difference in hardware performance counters between the two configurations. By running both the client and server on the same VM (1 VCPU), the two configurations get to execute the same (guest) code. Any network I/Os are removed, while still exercising the (guest) kernel TCP/IP code. The workload becomes completely CPU-intensive, avoiding halting and waking up the VCPU, removing the difference in the halting path. Another aspect of this kind of workload is that guest instructions directly execute without the intervention of the hypervisor.

Table 7 compares the major performance counters of two configurations: native and VM configuration (vPMC-Hybrid).
For both configurations, a Linux profiling tool perf [5] was used to collect performance counters. In the vPMC-Hybrid configuration, all events except the instructions retired and branches retired increment regardless of whether the PCPU is running guest or hypervisor instructions on behalf of the VM. The instructions retired and branches retired events count guest instructions only. This way, metrics involving ratios of guest instructions can be used to calculate the cost of executing instructions on the virtual machine. For example, one can compare the native and VM’s effective speed using IPC.

The test runs for a fixed number of iterations (that is, transactions) and hardware counters are collected during the entire period of a run. Because both native and VM execute the same (guest) instructions, the instructions retired counters of both configurations (native and VM) are very similar. Extra cycles spent in the VM (compared to the native machine), therefore, become the direct cause of the increase in response time.

HARDWARE
COUNTER
NATIVE vPMC-HYBRID
(VM)
CPI 1.43 1.56
Unhalted Cycles 376B 406B
# Instructions 261B 259B
TLB-misses-walk-cycles 227B 255B
EPT-walk-cycles 0 16B
L1-dcache-misses-all 4.5B 6.1B

Table 7. Performance counters comparison

From Table 7, the biggest contributor of the increased cycles when guest instructions directly execute is the Extended Page Table (EPT) walk cycles. This takes 4% out of the 8% cycles increase. The use of an additional page table mechanism (that is, EPT) to keep track of the mappings from PPN and MPN requires extra computing cycles when TLB misses occur. The EPT walk cycles count these extra cycles. Interestingly, L1/L2/L3 data cache misses also increased (only L1 data cache misses are shown in the table). 3% out of 8% comes from L1/L2/L3 cache misses, with L1 data cache misses constituting the majority. It is suspected that additional page table structures (EPT page tables) seemingly incur more memory accesses, increasing L1 data cache misses. An increase in TLB cache misses (excluding EPT walk cycles) takes the remaining 1% of the 8% increase. This is likely due to more memory accesses.

Assuming that the TLB miss rate remains the same (just for the sake of analysis) when executing guest instructions, the over-head caused by using EPT and associated cache misses become proportionally increased to the duration of the guest code execution. In other words, the longer the guest code runs, the more overhead will be shown in response time. This explains why there is a variable component in the response time overhead of virtualization.

6. Conclusions and Future Directions

Using a combination of well-known Linux performance monitoring commands, the traditional vSphere performance evaluation tools, and hardware monitoring tools, we have shown that all the workloads studied here exhibit similar performance profiles when virtualized:

  • The ratio to native is good, usually in excess of 80%.
  • The VMM and VMkernel overheads are minimal, except for the very high I/O cases, where we see as much as 10% of the time spent in the VMkernel.
  • Much of the difference with native is due to the tax virtualization places on the hardware, specifically in traversing a two-level MMU.
  • Besides the direct EPT costs, the dual page table architecture increases the contention for data caches, further contributing to the increase in CPI.

Besides the EPT costs, most of the virtualization overheads that we discovered in our study are small. They are easy to dismiss. Yet together, they add up to a total performance gap that requires attention.

7. Acknowledgements

Many individuals contributed to the work presented here. We would like to thank Shilpi Agarwal, Nikhil Bhatia, Ole Agesen, Qasim Ali, Seongbeom Kim, Jim Mattson, Priti Mishra, Jeffrey Sheldon, Lenin Singaravelu, Garrett Smith, and Haoqiang Zheng.

References           

  1. Adams, Keith, and Agesen, Ole: “A Comparison of Software and Hardware Techniques for x86 Virtualization”, Proceedings of AS-PLOS 2006.
  2. Agesen, Ole: “Software and Hardware Techniques for x86 Virtualization”; http://www.vmware.com/files/pdf/ software_hardware_tech_x86_virt.pdf, 2009
  3. Apache HTTP Server Project. The Apache Software Foundation. http://httpd.apache.org
  4. Serebrin, B., Hecht, D.: Virtualizing Performance Counters. 5th Workshop on System-level Virtualization for High Performance Computing 2011, Bordeaux, France
  5. perf: Linux profiling with performance. https://perf.wiki.kernel.org/index.php/Main_Page
  6. Netperf. www.netperf.org, 2011. http://www.netperf.org/netperf/
  7. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
  8. Performance Evaluation of Intel EPT Hardware Assist http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf
  9. TPC: Detailed TPC-C Description: http://www.tpc.org/tpcc/detail.asp
  10. Virtualizing Performance Counters https://labs.vmware.com/publications/virtualizing-performance-counters
  11. Virtualizing Performance-Critical Database Applications in VMware® vSphere™: http://www.vmware.com/pdf/ Perf_ESX40_Oracle-eval.pdf
  12. VMware Knowledge Base: Using Virtual CPU Performance Monitoring Counters. http://kb.vmware.com/ selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2030221
  13. VMware vFabric GemFire. VMware, Inc., 2012.

Footnotes

  1. When VMware introduced its server-class, type 1 hypervisor in 2001, it was called ESX. The name changed to vSphere with release 4.1 in 2009.