Application Aware Storage Platform

Samdeep Nayak
VMware, Inc.
samdeep@vmware.com

Alan Shih
VMware, Inc.
ccshih@vmware.com

Sanjay Acharya
VMware, Inc.
sacharya@vmware.com

Sandeep Sebe
VMware, Inc.
sebes@vmware.com

Abstract

The increasing adoption of virtualization in data centers has enabled simpler deployment and management of applications. Applications running on virtual machines (VMs) typically use emulated interfaces for networking and storage. These emulated interfaces are realized in the guest operating system (GOS) in concert with the hypervisor and meet most application requirements. Recent advancements in storage hardware include differentiated data services [1] [8] [9] such as data integrity, quality of service (QoS), and security. Some cloudscale applications have already been modified to take advantage of the capabilities exposed by the underlying hardware to prevent silent data corruption and to gain QoS benefits. In addition, changes are being proposed to existing standards to support I/O hints or tags.

These are indications of a growing momentum toward building infrastructure that is in line with applications. This approach creates a few unique challenges. For a start, it has necessitated enhancement of the virtualization platform. In this paperwe discuss an I/O hinting and tagging scheme that is designed to work with new infrastructure hardware capabilities. These capabilities are exposed as a way to differentiate data services for virtualized applications in the GOS. The scheme uses a unique combination of in-band and out-of-band schemes to determine end-to-end capabilities, maps the capabilities to application needs, and coordinates hints to enforce service-level agreements (SLAs) for the I/O. We look into some real-world applications as the key use cases for our approach and outline the prototype results in an effort to validate our base assumptions.

1. Introduction

Consider a situation in which a user is working on completing a key project but sees a sudden drop in performance when accessing storage. It might be due to the fact that scheduled maintenance (e.g., a backup job) has kicked in and is contending for a set of resources. At this time, the user might be better served if she could indicate the relative priority of her application so that it takes precedence over the scheduled maintenance.

A related situation is one in which you could be building a cloud-scale application that leverages the newer advances in hardware for data services. Currently, there are no standard interfaces to exploit these capabilities. Applications are left to use a vendor-specific interface—in effect, developing plug-ins for each vendor. We propose to build a platform that extends the current I/O interface while permitting hardware vendors to bring their own value-adds.

Creating a framework that tries to match hardware capabilities to application needs is not a trivial task. Virtualization makes it even tougher, given the need to multiplex various VM requirements on the same physical hardware while still meeting the larger application requirements in the data center. In this paper we detail the Application Aware Storage Platform (AASP), which addresses the various areas detailed earlier. It uses the notion of I/O hinting/tagging to help differentiate the traffic for the application services. Some of the design challenges that the AASP has to address include

  •  End-to-end storage path – Modern data centers consist of hosts, fabric, and storage. Because each devices is unique in its capabilities, there is no easy way to match up end-to-end capabilities. To add to the mix, there is no universal standard for managing all the components.
  • Migration challenges in virtualized data centers – The end-to-end (host storage) configurations are not static. VM migration (e.g., VMware vSphere® vMotion®) and storage migration (e.g., VMware vSphere® Storage vMotion®) are often used to balance resource usage and consumption. The impact on the AASP is that it needs to insert itself in the middle of events such as vSphere vMotion and vSphere Storage vMotion activities to move the load to compliant infrastructure.
  • Management at scale – Modern data centers consist of hundreds of thousands of servers, an equal number of switch ports and host bus adapters (HBAs), and several hundreds of storage arrays. Based on the capability and status of these nodes, workloads might need to be moved around the data center to maintain the application’s SLA. This entails management of these entities at scale.

The rest of the paper is organized as follows. Section 2 provides an overview of the AASP. Sections 3 and 4 cover key use cases and their respective prototype results. The conclusion and future work are summarized in Sections 5 and 6.

2. Solution Overview

Figure 1 shows the position of the AASP within the hypervisor stack and an end-to-end view of its touch points. The next-generation application works through a hypervisor-based filter driver to request the services of the AASP. The AASP matches the request to the end-to-end capabilities of the combination of HBA, storage network fabric, and storage. The AASP is aware of the capabilities of the fabric through the capabilities it advertises.

Figure 1. Application Aware Storage Platform.

Figure 1. Application Aware Storage Platform.

Figure 2

Figure 2 shows the key component functions of AASP. These are
capability discovery, policy management, and policy enforcement.
Figure 2. AASP Building Blocks and Interfaces.

Figure 2 shows the key component functions of AASP. These are capability discovery, policy management, and policy enforcement. It interfaces with the various components on the storage I/O end-to-end path, such as the hypervisor (VMware® ESXi™), the storage networking fabric, and the actual storage array. The following sections
describe the components in greater detail.

2.1 Capability Discovery
This module discovers the capabilities of the underlying hardware that is used for data storage—namely, the capabilities of the HBA, fabric, and array. In-band extensions are used to query the adapter capability, while out-of-band schemes are used to identify the capabilities of the fabric and the storage. Discovery is typically triggered at initialization time or upon detection of any changes in topology, configuration, or capability. Topology changes can happen, for example, when a VMware ESX® host shuts
down or reboots, whereas configuration changes can occur when new storage containers are added. An example of a capability change is when an HBA needs to operate in a “degraded mode” whereby no QoS functionality is possible. The discovered capabilities are maintained in a central repository in a central management server.

Figure 3. Capability Discovery in AASP.

Figure 3. Capability Discovery in AASP.

Figure 3 summarizes the mechanism used to determine end-to-end topology. The discovery module queries VMware® vCenter Server™ to identify all of the hosts, HBAs, and data stores that are attached to it through the vCenter Server API. The storage-configuration and capability data are acquired in an out-of-band fashion using the vSphere API for Storage Awareness/Systems Management Interface-Storage (VASA/SMI-S). In the case of fabric topology discovery, CIM [7] is the management interface used for Fibre Channel (FC) and Fibre Channel over Ethernet (FCoE ). In the case of Ethernet fabrics, Link Layer Discovery Protocol (LLDP ) is used to discover
the Ethernet topology.

In addition to topology discovery, we need to obtain and exploit the capabilities of the HBA, fabric, and storage target. To enable broader adoption of the AASP mechanisms, we propose standardization of the I/O hints/tags on an end-to-end basis:

  • HBA API – There is a need to enhance interfaces to provide new capabilities that can be consumed in a standard way. For ESXi, the I/O device management (IODM) framework will be used to gather the capabilities initially. Capabilities such as QoS and data integrity can be advertised using this interface.
  • Fabric – For FC and FCoE fabrics, SMI-S [6] seems to provide an ideal interface to expose the new capabilities and program service requests. However, for Ethernet a combination of SNMP and LLDP TLV enhancement might have to be considered for exporting the capabilities.
  • Target – The VASA interface [5] might be the best way to gather the new capabilities and program any service requests.

2.2 Policy Management
Storage capabilities are often classified as data-at-rest or data-inflight capabilities. The capabilities that are advertised by the storage array are typically referred to as data-at-rest capabilities. The endto-end capabilities provided by the HBA in the host, fabric, and the target port in the array are classified as data-in-flight capabilities.
This classification is based on the functionality, not on the physical attributes—that is, it is possible for a future filter driver inserted in the I/O path to advertise data-at-rest capabilities.

The data-at-rest and data-in-flight capabilities are computed based on the data gathered in the discovery stage. Data-at-rest capabilities are exposed as part of storage container attributes. Data-in-flight capabilities are exposed as part of virtual adapter capabilities to the guest. Based on the underlying infrastructure capabilities, a VM can be configured to have a mix of both of these capabilities. The policy information lives throughout the life of the VM and is used for policy enforcement and migration validation. To realize this, the policy information is stored in a central management station.

After these capabilities are determined, it is possible to enable data integrity support for the applications on a VM. A typical workflow to enable this is as follows:

  • As part of VM creation, the administrator chooses a data store that offers data integrity service. Because this is more of a data-at-rest capability, the I/O is not protected against data-in-flight corruptions.
  • The administrator can then choose a virtual SCSI adapter that offers data integrity in flight service.
  • Based on the underlying hardware capability, appropriate initial placement decisions can be taken at this time.
  • At the end of VM creation, any identifier and service-requirement tags that might be required for enforcing policy are shared with HBA, fabric, and array. For example, if the user intends to have a certain QoS for a particular application in a VM, the VM identifier, application identifier, and desired service level will be shared with

HBA, fabric, and array using the interface extensions discussed in the previous section.

  • The policy data is stored as part of VM metadata in a central management station.
  • Now the VM as a container is ready to accept I/O requests from applications with I/O integrity tags.
  • Assume that we are running an Oracle Database in a Linux VM. The paravirtual SCSI driver in the guest registers itself with DIF/DIX capabilities to the Linux SCSI  idlayer using the SCSI host template. The SCSI midlayer in the guest issues SCSI inquiry and read capacity commands [2] that are returned with appropriate responses by the virtual disk, based on the policy configured in the initial steps.
  • Applications such as the Oracle Database can now send in I/O requests with data integrity application tags, creating an effective physical-to-virtual (P2V) migration path for this application.
Figure 4. I/O Hinting.

Figure 4. I/O Hinting.

Figure 4 shows how end-to-end capabilities are assimilated and presented to applications. During the VM migration (vSphere vMotion) or storage migration (vSphere Storage vMotion), appropriate checks are made before the migration to ensure that the destination platform is capable of meeting the policy requirements for the VM. Following this, identifier and service-requirement tags are shared with the destination HBA, fabric, and array. This will enable mobility for applications with I/O hints and tags in the data center.

2.3 Policy Enforcement
I/O policy enforcement includes the critical data-path elements, namely I/O hint generation, network translation, and I/O processing and error handling.

2.3.1 I/O Hinting and Tagging We categorize applications that generate hints and tags into three categories which are described below. 2.3.1.1 Cloud-Generation Applications Cloud-generation applications generate application-specific hints or tags to the underlying infrastructure. These applications today run on bare-metal environments but cannot be migrated to a virtual environment due to lack of hypervisor support. Popular databases with data integrity support fall in this bucket. Based on the underlying infrastructure capability, the AASP will honor and retain the tags as the I/O trickles down the infrastructure.

2.3.1.2 Legacy Applications
In our approach with legacy applications, we propose that the VMware GOS driver inject hints and tags based on the popular application running in the GOS. While no changes to the applications are required for this approach, the scope of the hinting scheme will be limited to popular applications that are supported by VMware GOS drivers.

Figure 5. I/O Path Policy Enforcement.

Figure 5. I/O Path Policy Enforcement.

2.3.1.3 VM-Specific Tags and Kernel Services
Without unique VM tag support, I/O streams are a complete black box to the fabric. This prevents fabrics from providing services such as isolation, SLA enforcement, and network replication at VM granularity. We propose changes to the storage stack to provide VM visibility in the fabric.

In addition, to support data services for kernel extensions, kernel services will need to be modified to provide appropriate hints. Because no changes to the GOS are required, this approach creates broad coverage for applications. However, given that this approach is furthest out from the application, only specific use cases can be targeted here.

2.3.2 Policy Enforcement and Error Handling
Whenever an I/O with an appropriate tag/hint is received, each of the nodes looks up the policy that is programmed by the policy manager and tries to enforce them (see Figure 5). For example, when an I/O with a data integrity request is received by the HBA, it will try to look up the policy configured for the VM and try to provide data integrity service. If no services are configured, the I/O is failed back to the initiating application. As the I/O trickles down the VMkernel, the multipath kernel module ensures that the I/O is routed through a path that supports the tag request.

2.3.3 Network Translation
While the applications, HBA, and storage arrays understand the tags in an I/O request, the switches operate at the network frame level, and it might be hard to extract the tags in the SCSI command without a significant performance penalty. To avoid the performance hit, the initiators (HBA) and the target ports append the identifiers and the tags in each of the frame headers. This helps the fabric to look up the tag associated with the identifier and service it appropriately.

3. Use Cases: SSD Consolidation Using QoS Hinting

The AASP will be of less significance if it does not solve real-world problems. In this use case, we focus on a centrally located Solid State Device (SSD) array, along with QoS hints, to obtain performance comparable to that of locally attached SSDs. We named our approach remote SSD to highlight storing SSDs in a centralized place in a fabric, compared to attaching them locally. The environment is shown in Figure 6.

There are several benefits to centrally located SSDs:

  •  SSDs and nonvolatile memory (NVM) are still expensive, and a mechanism to share them among many applications would be ideal. While local SSD has a range of several meters, the ranges of fabrics are in kilometers. This helps in better utilization of the hardware.
  • SSDs have predefined life and need to be replaced at their life expectancy. Instead of going to each server, administrators can now add capacity or remove capacity on the fly.
  • It provides simplified vSphere vMotion with no cache warming.
  • The solution works seamlessly with blade deployments.
Figure 6. SSD Storage Consolidation.

Figure 6. SSD Storage Consolidation.

In our solution, we worked with our ecosystem partners QLogic and Brocade to optimize the prototype for

  • VM awareness:
    • Make the end-to-end infrastructure aware of VM identifiers and per-flow service requests.
    • Use VM identity to provide end-to-end QoS.
  • Differentiation/isolation:
    • Independent queuing at initiator, target adapter, and fabric to allow VMs with different SLAs.
    • Minimize any head-of-line blocking.
  • Prioritization:
    • ––Strict Priority Queueing (and Weighted Round Robin) scheduling to allow prioritization of traffic between VMs of
      different SLAs.
    • Prioritizing the I/O processing at the target.
  • End-to-end QoS:
    • Identification of VMs to allow end-to-end QOS between initiator and target, through the fabric.
    • Reliance on cut-through switching to reduce fabric latency.

Our goal was to first compare I/O performance between a local SSD and remote SSD setup and then observe performance benefits when I/O hinting is enabled for caching software. For the caching software, we relied on the VMware vSphere Flash Read Cache™ framework [3] to accelerate VM I/O. We modified Flash Read Cache to add a “high” QoS tag for each of the I/Os generated by the module. The results confirmed the theory that it is feasible to store SSDs in a central place and get comparable results to that of local SSD while end-to-end QoS is configured using AASP. Details of the results are covered in the next section.

4. Prototype Work

The goal of the prototype is to demonstrate that an SSD device can be configured as shared storage to provide caching benefits with no significant loss of performance. The prototype will be used to gather performance data for the following setup:

  • Setup A (local SSD – LSSD
    • This measures performance for a local SSD storage. The setup uses one ESX VM for generating I/O to an SSD disk. QoS latency I/O hinting is disabled in this setup (see Figure 7).
Figure 7. Setup A – I/O to Local SSD (I/O Hinting Disabled).

Figure 7. Setup A – I/O to Local SSD (I/O Hinting Disabled).

  • Setup B (remote SSD – RSSD) – This measures performance to a remote shared SSD storage. The setup uses one ESX VM for generating I/O to a remote SSD disk. QoS latency I/O hinting is disabled in this setup. An 8Gb FC transport interface is used between the initiator and target. We use an Ubuntu-based Linux SCSI target for exporting the storage devices.
Figure 8. Setup B – Remote SSD (I/O Hinting Disabled).

Figure 8. Setup B – Remote SSD (I/O Hinting Disabled).

Figure 9. Setup C – Remote SSD with I/O Hinting and Congestion.

Figure 9. Setup C – Remote SSD with I/O Hinting and Congestion.

  • Setup C (remote SSD with contention – RSSDC) – This measures performance to a remote shared SSD storage with and without I/O contention. The setup uses two ESX VMs for generating I/O. One VM has Flash Read Cache caching enabled with “low” latency QoS hinting. The other VM has “medium” QoS latency enabled and issues I/O to magnetic disk storage with no caching. The second VM is used only to generate contention in the SAN path. An 8Gb FC transport interface is used between the initiator and target. A Dell Compellent SC8000 array is used in the target side to generate enough traffic for I/O contention within the SAN path.

With the use of shared SSD storage with congestion (setup C shown in Figure 9), we try to optimize the fabric elements to process the I/O as quickly as possible and reduce latency. Performance is measured by using various VM I/O block sizes in each host and then comparing the results. Note that the hosts and the SSD used in gathering the data are identical. With the current infrastructure, only WRITE performance data is captured. READ performance data will be captured as part of future work.

For the above setup, the hardware used is specified below:

  • Initiator and target hosts – HP ProLiant DL380p Gen8
  • SSD – HP 200GB 6G SAS SLC SFF
  • FC initiator HBA – QLogic BR1860 2-Port 16Gbps Gen5 FC [8]
  • FC target HBA – QLogic QL2672-CK 2-Port 16 Gbps Gen5 FC [9]
  • FC switch –Brocade 6505 24 port 16Gbps Gen 5 FC [10]
  • Storage array –Dell Compellent SC8000 Array

4.1 Performance Charts
4.1.1 LSSD Versus RSSD (I/O Hinting Disabled)

Figure 10. Chart A. IOPS.

Figure 10. Chart A. IOPS.

Performance reduction (%):
table-1
The above two charts compare the performance data for setup A (local SSD) and setup B (remote SSD). The max performance reduction of 10% is seen at 64K block size and it is yet to be investigated. The average performance degradation is about 5.4% for the entire block range. It should be noted that there was no significant reduction in performance when a shared SSD was used.

4.1.2 RSSDC (Contention Enabled [C], with and Without I/O Hinting [IOH])

Figure 12. Chart C. IOPS.

Figure 12. Chart C. IOPS.

Charts C and D compare performance data for a shared SSD setup when there is contention involved. Chart E compares latencies in for the setup.

Figure 13. Chart D. Throughput.

Figure 13. Chart D. Throughput.

Figure 14. Chart E. Completion Time in Milliseconds.

Figure 14. Chart E. Completion Time in Milliseconds.

Performance gain with I/O hinting (%):

table-2

From the above chart, it can be seen that with I/O hinting there is sufficient performance gain when contention is involved. There is an anomaly with 1K block size that is yet to be investigated. The average performance gain is about 8% over the entire block range.

Latency improvement with I/O hinting (%):

table-3

With I/O hinting enabled, there is an improvement in latencies from block sizes 4K and up. There is an average of 7.20% latency improvement over the entire block rage. With larger block sizes, there is a large gain in latency improvement.

5. Future Work

The prototype enforces the theory that with I/O hinting, accessing SSD storage in a shared environment did not result in a significant deviation in performance as compared to a local SSD. However, in a shared environment other parameters are likely to affect performance.

For example, generating more I/O traffic from multiple VMs over multiple hosts is likely to introduce congestion in the SAN environment, and this is where I/O QoS hinting will be extremely useful. The hinting enables the SAN elements to make efficient use of resources and process low-latency I/Os faster as compared to regular ones. Future work will include performance measurement by generating traffic from multiple ESX hosts to a shared storage and comparing performance for the setup. In addition, the platform might have to be enhanced to check the feasibility of other services from the guest and their impact on overall performance.

6. Summary

The I/O hinting and tagging scheme described in this paper provides a mechanism for distinguishing one I/O request from another. Storage adapters and fabric elements nowadays provide advanced features to add special capabilities during I/O processing.

It becomes imperative to make use of these special capabilities to provide services to the host operating system and guest applications on an as-needed basis. This approach will meet evolving application needs by making efficient use of next-generation storage infrastructure while improving overall manageability in the data center.

Acknowledgments

The authors would like to thank Chris Kukkonen, David Ward, Lee Kohle, Nhan Pham, Sathish Gnanasekaran, Dennis Makishima, and Srikara Subramanyan from Brocade; Anil Gurumurthy, Chakri Kommuri, Manoj Wadekar, Praveen Midha, Srikanth Rayas, and Girish Basrur from QLogic; and T. Sridhar from VMware for providing input and helping us with the prototype work and collection of performance data.

References

1. Using the Brocade DCX Backbone to Transform a Brocade Storage Area Network into a Data Center Fabric.
2. SCSI Block Commands – 3 (SBC-3) – Rev 36, BSR Number: INCITS 514.
3. vSphere Flash Read Cache.
4. vSOM: A Framework for Virtual Machine–centric Analysis of End-to-End Storage I/O Operations.
5. vSphere 5.0 Storage Features Part 10 – VASA – vSphere Storage APIs – Storage Awareness.
6. Storage Management Initiative Specification (SMI-S).
7. Common Information Model.
8. QLogic BR-1860 Fabric Adapter.
9. QLogic 2600 Series Gen 5 16Gbps Fibre Channel-to-PCIe Adapters.
10. Brocade 6505 Switch.