Redefining ESXi IO Multipathing in the Flash Era

Fei Meng
North Carolina State University
fmeng@ncsu.edu

Li Zhou1
Facebook Inc.
lzhou@fb.com       

Sandeep Uttamchandani
VMware, Inc.
sandeepu@vmware.com 

Xiaosong Ma
North Carolina State University & Oak Ridge National Lab
ma@csc.ncsu.edu

Abstract

At the advent of virtualization, primary storage equated spinning disks. Today, the enterprise storage landscape is rapidly changing with low-latency all-flash storage arrays, specialized flash-based I/O appliances, and hybrid arrays with built-in flash. Also, with the adoption of host-side flash cache solutions (similar to vFlash), the read-write mix of operations emanating from the server is more write-dominated (since reads are increasingly served locally from cache). Is the original ESXi I/O multipathing logic that was developed for disk-based arrays still applicable in this new flash storage era? Are there optimizations we can develop as a differentiator in the vSphere platform for supporting this core functionality?

This paper argues that existing I/O multipathing in ESXi is not the most optimal for flash-based arrays. In our evaluation, the maximum I/O throughput is not bound by a hardware resource bottleneck, but rather by the Pluggable Storage Architecture (PSA) module that implements the multipathing logic. The root cause is the affinity maintained by the PSA module between the host traffic and a subset of the ports on the storage array (referred to as Active Optimized paths). Today, the Active Un-optimized paths are used only during hardware failover events, since un-optimized paths exhibit higher service time than optimized paths. Thus, even though the Host Bus Adaptor (HBA) hardware is not completely saturated, we are artificially constrained in software by limiting to the Active Optimized paths only.

We implemented a new multipathing approach called PSP Adaptive as a Path-Selection Plug-in in the PSA. This approach detects I/O path saturation (leveraging existing SIOC techniques), and spreads the write operations across all the available paths (optimized and un-optimized), while reads continue to maintain their affinity paths. The key observation was that the higher service times in the un-optimized paths are still lower than the wait times in the optimized paths. Further, read affinity is important to maintain given the session-based prefetching and caching semantics
used by the storage arrays. During periods of non-saturation, our approach switches to the traditional affinity model for both reads and writes. In our experiments, we observed significant (up to 30%) improvements in throughput for some workload scenarios. We are currently in the process of working with a wide range of storage partners to validate this model for various Asymmetric Logical Units Access (ALUA) storage implementations and even Metro Clusters.

1. Introduction

Traditionally, storage arrays were built of spinning disks with a few gigabytes of battery-backed NVRAM as local cache. The typical I/O response time was multiple milliseconds, and the maximum supported IOPS were a few thousand. Today in the flash era, arrays are advertising I/O latencies of under a millisecond and IOPS on the order of millions. XtremIO [20] (now EMC), Violin Memory [16], WhipTail [18], Nimbus [7], Solid-Fire [22], PureStorage [14], Nimble [13], GridIron (now Violin) [23], CacheIQ (now NetApp) [21], and Avere Systems [11] are some of the emerging startups developing storage solutions that leverage flash. Additionally, established players (namely EMC, IBM, HP, Dell, and NetApp) are also actively developing solutions. Flash is also being adopted within servers as a flash cache to accelerate I/Os by serving them locally—some of the example solutions include [8, 10, 17, 24]. Given the current trends, it is expected that all-flash and hybrid arrays will completely replace the traditional disk-based arrays by the end of this decade. To summarize, the I/O saturation bottleneck is now shifting—administrators are no longer worried about how many requests the array can service, but rather how fast the server can be configured to send these I/O requests and utilize the bandwidth.

In this paper, we explore a novel idea of using both Active Optimized and Un-optimized paths concurrently. Active Un-optimized paths have traditionally been used only for failover scenarios, since these paths are known to exhibit a higher service time compared to Active Optimized paths. The hypothesis of our approach was that the service times were high, since the contention bottleneck is the array bandwidth, limited by the disk IOPS. In the new flash era, the array is far from being a hardware bottleneck. We discovered that our hypothesis is half true, and we designed a plug-in solution around it called PSP Adaptive.

The key contributions of this paper are:

  • Develop a novel approach for I/O multipathing in ESXi, specifically optimized for flash-enabled arrays. Within this context, we designed an algorithm that adaptively switches between traditional multipathing and spreading writes across Active Optimized and Active Un-optimized paths.
  • Implementation of the PSP Adaptive plug-in within the PSA module, and experimental evaluation.

The rest of the paper is organized as follows: Section 2 covers multipathing in ESXi today. Section 3 describes related work. Sections 4 and 5 cover the design details and evaluation, followed by conclusions in Section 6.

2. Related Work

There are three different types of storage arrays regarding the dual controller implementation: active/active, active/standby, and Asymmetric Logical Units Access (ALUA) [25]. Defined in the SCSI standard, ALUA provides a standard way for hosts to discover and manage multiple paths to the target. Unlike active/active systems, ALUA affiliates one of the controllers as optimized and serves I/Os from this controller in the current VMware ESXi path selection plug-in [12]. Unlike active/standby arrays that cannot serve I/O through the standby controller, ALUA storage array is able
to serve I/O requests from both optimized and un-optimized controllers. ALUA storage arrays have become very popular.
Today most mainstream storage arrays (e.g., most popular arrays made by EMC and NetApp) support ALUA.

Multipathing for Storage Area Network (SAN) has been designed as a fault-tolerance technique to avoid single point failure as well as provide performance enhancement optimization via load balancing [3]. Multipathing has been implemented in all major operating systems such as Linux, including different storage stack layers [9, 26, 28]; Solaris [5]; FreeBSD [6]; and Windows [19]. Multipathing has also been offered as third-party provided products such as Symantec Veritas [15] and EMC Powerpath [1]. ESXi has inbox multipathing support in Native Multipath Plug-in (NMP), and it has several different flavors of path selection algorithm (such as Round Robin, MRU, and Fixed) for different devices. ESXi also supports third-party multipathing plug-ins in the forms of PSP (Path Selection Plug-in), SATP (Storage Array Type Plug-in), and MPP (Multi-Pathing Plug-in), all under the ESXi PSA framework. EMC, NetApp, Dell Equalogic, and others have developed their solutions on ESXi PSA. Most of the implementations do simple round robin among active paths [27, 26] based on number of complete I/Os or transferred bytes for each path. Some third-party solutions, such as EMC’s PowerPath, adopt complicated load-balancing algorithms, but performance-wise are only at par or even worse than VMware’s NMP. Kiyoshi et. al. proposed a dynamic load balancing request-based device mapper Multipath for Linux, but did not implement this feature [29].

Multipathing in ESXi Today

Figure 1

Figure 1. High-level architecture of I/O Multipathing in vSphere

Figure 1 shows the I/O multipathing architecture of vSphere. In a typical SAN configuration, each host has multiple Host Bus Adapter (HBA) ports connected to both of the storage array’s controllers. The host has multiple paths to the storage array, and performs load balancing among all paths to achieve better performance. In vSphere, this is done by the Path Selection Plug-ins (PSP) at the PSA (Pluggable Storage Architecture) layer [12]. The PSA framework collapses multiple paths to the same datastore, and presents one logical device to the upper layers, such as the file system. Internally, the

NMP (Native Multipathing Plug-in) framework allows different path-selection policies by supporting different PSPs. The PSPs decide which path to route an I/O request to. vSphere provides three different PSPs: PSP FIXED, PSP MRU, and PSP RR. Both PSP FIXED and PSP MRU utilize only one path for I/O requests and do not do any load-balancing, while PSP RR does a simple round-robin load balancing among all active paths for active/active arrays.

In summary, none of the existing path load-balancing implementations concurrently utilize Active Optimized and Un-optimized paths of ALUA storage arrays for I/Os. VMware can provide a significant differentiated value by supporting PSP Adaptive as a native option for flash-based arrays.

3. Design and Implementation

Consider a highway and a local road, both to the same destination. When the traffic is bounded by a toll plaza at the destination, there is no point to route traffic to the local road. However, if the toll plaza is removed, it starts to make sense to route a part of traffic to the local road during rush hours, because now the contention point has shifted. The same strategy can be applied to the load-balancing strategy for ALUA storage arrays. When the array is the contention point, there is no point in routing I/O requests to the non-optimized paths: the latency is higher, and host-side I/O bandwidth is not the bound. However, when the array with flash is able to serve millions of IOPS, it is no longer the contention point, and host-side I/O pipes could become the contention point during heavy I/O load. It starts to make sense to route a part of the I/O traffic to the un-optimized paths. Although the latency on the un-optimized paths is higher, considering the optimized paths are saturated, using un-optimized paths can still boost aggregated system I/O performance with increased I/O throughput and IOPS. This should only be done during “rush hours,” when the I/O load is heavy and the optimized paths are saturated. A new PSP Adaptive is implemented using this strategy. 

3.1 Utilize Active/Non-optimized Paths for ALUA Systems

Figure 1 shows the high-level overview of multipathing in vSphere. NMP collapses all paths and presents only one logical device to upper layers, which can be used to store virtual disks for VMs. When an I/O is issued to the device, NMP queries the path selection plug-in PSP RR to select a path to issue the I/O. Internally PSP RR uses a simple round robin algorithm to select the path for ALUA systems. Figure 2(a) shows the default path-selection algorithm. I/Os are dispatched to all Active Optimized paths alternately. Active Un-optimized paths are not used, even if the Active Optimized paths are saturated. This approach is a waste of resource if the optimized paths are saturated.

To improve performance when the Active Optimized paths are saturated, we spread WRITE I/Os to the un-optimized
paths. Even though the latency will be higher compared to I/Os using the optimized paths, the aggregated system throughput and IOPS will be improved. Figure 2(b) illustrates the optimized path dispatching.

3.2 Write Spread Only

For any active/active (including ALUA) dual-controller array, each controller has its own cache to cache data blocks for both reads and writes. On write requests, controllers need to synchronize their caches with each other to guarantee data integrity. For reads, it is not necessary to synchronize the cache. As a result, reads have affinity to a particular controller, while writes do not. Therefore we assume issuing writes on either controller is symmetric, while it is better to issue reads on the same controller so that the cache hit rate would be higher.

Most workloads have many more reads than writes. However, with the increasing adoption of host-side flash caching, the actual I/Os hitting the storage controllers are expected to have a much higher write/read ratio: a large portion of the reads will be served from the host-side flash cache, while all writes still hit the array. In such cases, spreading writes to un-optimized paths will help lower the load on the optimized paths, and thereby boost system performance. Thus in our optimized plug-in, only writes are spread to un-optimized paths.

Figure 2

Figure 2. PSP Round Robin Policy

3.3 Spread Start and Stop Triggers

Because of the asymmetric performance between optimized and un-optimized paths, we should only spread I/O to un-optimized paths when the optimized paths are saturated (i.e., there is I/O contention). Therefore, accurate I/O contention detection is the key. Another factor we need to consider is that the ALUA specification does not specify the implementation details. Therefore different ALUA arrays from different vendors could have different ALUA implementations and hence different behaviors on serving I/O issued on the un-optimized paths. In our experiments, we have found that at least one ALUA array shows unacceptable performance for I/Os issued on the un-optimized paths. We need to take this into account, and design the PSP Adaptive to be able to detect such behavior and stop routing WRITEs to the un-optimized paths if no I/O performance improvement is observed. The following sections describe the implementation details.

3.3.1 I/O Contention Detection

We apply the same techniques that SIOC (Storage I/O Control) uses today to the PSP Adaptive for I/O contention detection: use I/O latency thresholds. To avoid thrashing, two latency thresholds ta and tb (ta > ta) are used to trigger start and stop of write spread to non-optimized paths. PSP Adaptive keeps monitoring I/O latency for optimized paths (to). If to exceeds ta, PSP Adaptive starts to trigger write spread. If to falls below tb, PSP Adaptive stops write spread. Like SIOC, the actual values of ta and tb are set by the user, and could differ for different storage arrays.

3.3.2 Max I/O Latency Threshold

As described earlier, different storage vendor’s ALUA implementations vary. The I/O performance on the un-optimized paths for some ALUA arrays could be very poor. For such arrays, we should not spread I/O to the un-optimized paths. To handle such cases, we introduce a third threshold: max I/O latency tc( tc > ta). Latency higher than this value is unacceptable to the user. PSP Adaptive monitors I/O latency on un-optimized paths (tuo) when write spread is turned on. If PSP Adaptive detects tuo exceeding the value of tc, it concludes that the un-optimized paths should not be used and stops write spread.

A simple on/off switch is also added as a configurable knob for administrators. If a user does not want to use un-optimized paths, an administrator can simply turn the feature off through the esxcli command. PSP Adaptive will behave the same as PSP RR in such cases, without spreading I/O to un-optimized paths.

3.3.3 I/O Performance Improvement Detection

We want to spread I/O to un-optimized paths only if it improves aggregated system IOPS and/or throughput. PSP Adaptive continues monitoring aggregated IOPS and throughput on all paths to the specific target. Detecting I/O performance improvements is more complicated, however, since system load and I/O patterns (e.g. block size) could change. I/O latency numbers cannot be used to decide if the system performance improves or not for this reason.

To handle this situation, we monitor and compare both IOPS and throughput numbers. When I/O latency on optimized paths exceeds threshold ta, PSP Adaptive saves the IOPS and throughput data as the reference values before it turns on write spread to un-optimized paths. It then periodically checks if the aggregated IOPS and/or throughput improved by comparing them against the reference values. If not improved, it will stop write spread; otherwise, no actions are taken. To avoid noise, at least 10% improvement on either IOPS or throughput is required to conclude that performance is improved.

Overall system performance is considered improved even if only one of the two measures (IOPS and throughput) improves. This is because I/O pattern change should not decrease both values simultaneously. For example, if I/O block sizes go down, aggregated throughput could go down, but IOPS should go up. If system load goes up, both IOPS and throughput should go up with write spread. If both aggregated IOPS and throughput go down, PSP Adaptive concludes that it is because system load is going down.

If system load goes down, the aggregated IOPS and throughput could go down as well and cause PSP Adaptive to stop write spread. This is fine, because less system load means I/O latency will be improved. Unless the I/O latency on optimized paths exceeds ta again, write spread will not be turned on again.

3.3.4 Impact on Other Hosts in the Same Cluster

Usually one host utilizing the un-optimized paths could negatively affect other hosts that are connected to the same ALUA storage array. However, as explained in the earlier sections, the greatly boosted I/O performance of new flash-based storage means the storage array is much less likely to become the contention point—even if multiple hosts are pumping heavy I/O load to the array simultaneously. Thereby the negative impact here is negligible. Our performance benchmark also proves that.

4. Evaluation and Analysis

All the performance numbers are collected on one ESXi server. The physical machine configuration is listed in Table 1. Iometer running inside Windows VMs is used to generate workload. Since vFlash and VFC were not available at the time the prototype was done, we used Iometer load with 50% random read and 50% random write to simulate the effect of host-side caching which changes the I/O READ/WRITE ratio hitting the array.

CPU 2 Intel Xeon, 8 Cores, 16 Logical CPUs
Memory 96GB DRAM
HBA Dual port 8Gbps HBA
Storage Array ALUA enabled array with 16 LUNs on SSD, each LUN 100MB
FC Switch 8Gbps FC switch

Table 1. Testbed Configuration

Figure 3 and Figure 4 compare the performance of PSP RR and PSP Adaptive when there is I/O contention on the HBA port. By spreading WRITEs to un-optimized paths during I/O contention, PSP Adaptive is able to increase aggregated system IOPS and throughput, at the cost of slightly higher average WRITE latency. The aggregated system throughput improvement also increases with increasing I/O block size.

Overall, the performance evaluation results show that PSP Adaptive can increase system aggregated throughput and IOPS during I/O contention, and is self-adaptive to workload changes. 

Figure 3

Figure 3. Throughput: PSP RR vs PSP Adaptive

Figure 4

Figure 4.  IOPS: PSP RR vs PSP Adaptive

5. Conclusion and Future Work

With the rapid adoption of flash, it is important to revisit some of the fundamental building blocks of the vSphere stack. I/O multipathing is critical for scale and performance, and any improvements translate into improved end-user experience
as well as higher VM density on ESXi. In this paper, we explored an approach that challenged the old wisdom of multipathing that active un-optimized paths should not be used for load balancing. Our implementation showed that spreading of writes on all paths has advantages during contention, but should be avoided during normal load scenarios. PSP Adaptive is a PSP plug-in that we developed to adaptively switch load-balancing strategies based on system load.

Moving forward, we are working to further enhance the adaptive logic by introducing a path scoring attribute that ranks different paths based on I/O latency, bandwidth, and other factors. The score is used to decide whether a specific path should be used for different system I/O load conditions. Further, we want to decide the percentage of I/O requests that should be dispatched to a certain path. We could also combine the path score with I/O priorities, by introducing priority queuing within PSA.

Another important storage trend is the emergence of active/active storage across metro distances—EMC’s VPLEX [2, 4] is the leading solution in this space. Similar to ALUA, active/active storage arrays expose asymmetry in service times, even across active optimized paths due to unpredictable network latencies and number of intermediate hops. An adaptive multipath strategy could be useful for the overall performance.

References

  1. EMC PowerPath. http://www.emc.com/storage/powerpath/powerpath.htm
  2. EMC VPLEX. http://www.emc.com/storage/vplex/vplex.htm
  3. Multipath I/O. http://en.wikipedia.org/wiki/Multipath_I/O
  4. Implementing vSphere Metro Storage Cluster using EMC VPLEX. http://kb.vmware.com/kb/2007545
  5. Solaris SAN Configuration and Multipathing Guide, 2000. http://docs.oracle.com/cd/E19253-01/820-1931/820-1931.pdf
  6. FreeBSD disk multipath control. http://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos= 0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=html
  7. Nimbus Data Unveils High-Performance Gemini Flash Arrays, 2001 http://www.crn.com/news/storage/ 240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htm
  8. Proximal Data’s Caching Solutions Increase Virtual Machine Density in Virtualized Environments, 2012. http://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machine
  9. DM Multipath, 2012. https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.html
  10. EMC vfcache, Server Flash Cache, 2012 http://www.emc.com/about/news/press/2012/20120206-01.htm
  11. Flash and Virtualization Take Center Stage at SNW for Avere., 2012. http://www.averesystems.com/
  12. Multipathing policies in ESXi 4.x and ESXi 5.x, 2012. http://kb.vmware.com/kb/1011340
  13. No-Compromise Storage for the Modern Datacenter, 2012, http://info.nimblestorage.com/ rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdf
  14. Pure Storage: 100% flash storage array: Less than the cost of spinning disk, 2012. http://www.purestorage.com/flash-array/
  15. Symantec Veritas Dynamic Multi-Pathing, 2012. http://www.symantec.com/docs/DOC5811
  16. Violin memory, 2012. http://www.violin-memory.com/
  17. VMware vFlash framework, 2012. http://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.html
  18. Who is WHIPTAIL?, 2012, http://whiptail.com/papers/who-is-whiptail/
  19. Windows Multipath I/O, 2012. http://technet.microsoft.com/en-us/library/cc725907.aspx.
  20. XtremIO, 2012. http://www.xtremio.com/
  21. NetApp Quietly Absorbs CacheIQ, Nov, 2012, http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457
  22. SolidFire Reveals New Arrays for White Hot Flash Market, Nov, 2012 http://siliconangle.com/ blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market/
  23. Gridiron Intros New Flash Storage Appliance, Hybrid Flash Array, Oct, 2012. http://www.crn.com/news/ storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htm
  24. BYAN, S., LENTINI, J., MADAN, A., AND PABON, L. Mercury: Host-side Flash Caching for the Data Center. InMSST (2012), IEEE, pp. 1–12.
  25. EMC ALUA System. http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdf
  26. KIYOSHI UEDA, JUNICHI NOMURA, M. C. Request-based Device-mapper multipath and Dynamic load balancing. In Proceedings of the Linux Symposium (2007)
  27. LUO, J., SHU, J.-W., AND XUE, W. Design and Implementation of an Efficient Multipath for a SAN Environment. In Proceedings of the 2005 international conference on Parallel and Distributed Processing and Applications (Berlin, Heidelberg, 2005), ISPA’05, Springer-Verlag, pp. 101–110.
  28. MICHAEL ANDERSON, P. M. SCSI Mid-Level Multipath. In Proceedings of the Linux Symposium (2003).
  29. UEDA, K. Request-based dm-multipath, 2008. http://lwn.net/Articles/274292/.

Footnotes

  1. Li Zhou was a VMware employee when working on this project.