vSOM: A Framework for Virtual Machine-centric Analysis of End-to-End Storage IO Operations

Sandeep Uttamchandani
VMware, Inc.
suttamchandani@vmware.com

Wenhua Liu
VMware, Inc.
liuw@vmware.com

Samdeep Nayak
VMware, Inc.
samdeep@vmware.com

Abstract

Diagnosis of an I/O performance slow down is a complex problem. The root cause could be one among a plethora of event combinations such as VMware® ESXi misconfiguration, an overloaded fabric switch, disk failures on the storage arrays, and so on. As a virtualization administrator, diagnosing the end-to-end I/O path today requires working with discrete fabric and storage reporting tools, and manually correlating component statistics to find performance abnormalities and root-cause events. To address this pain point, especially for cloud scale deployments, we developed the VMWare SAN Operations Manager (vSOM), a framework for end-to-end monitoring, correlation, and analysis of storage I/O paths. It aggregates events, statistics, and configuration details across the ESXi server, host bus adapters (HBAs), fabric switches, and storage arrays. The correlated monitoring data is analyzed in a continuous fashion, with alerts generated for administrators. The current version invokes simple remediation steps, such as link and device resets, to potentially fix errors such as link errors, frame drops, I/O hangs, and so on. vSOM is designed to be leveraged by advanced analytical tools. One example described in this paper, VMware® vCenter™ Operations Manager™, uses vSOM data to provide end-to-end virtual machine-centric health analytics.

1. Introduction

Consider an application administrator responding to multiple problem tickets: “The enterprise e-mail service has a 40-60 percent higher response time compared to its average response time over the last month.” Because the e-mail service is virtualized, the administrator starts by analyzing virtual machines, manually mapping the storage paths and associated logical and physical devices. Today, there is no single tool to assist with the complete end-to-end diagnosis of the Storage Area Network (SAN), starting with the virtual machine, through the HBAs, switches, storage array ports and disks (Figure 1). In large enterprises, the problem is aggravated further with specialized storage administrators performing isolated diagnosis at the storage layer: “The disks look fine; I/O rate for e-mail related volumes has increased, and the response time seems within normal bounds.” This to-and-fro between application and storage administrators can take weeks to resolve. The real root cause could be accidental rezoning of the server ports, combined with a change in the storage array configuration that labeled those ports as “low priority traffic.” In summary, end-to-end analysis in a SAN environment is a complex task, even in physical environments. Virtualization makes the analysis even more complex, given the multiplexing of virtual machines on the same physical resources.

VMW-DGRM-SANDEEP-FIG1-REALWORLD-ENDTOEND-IO-PATH-VIRTUALIZED-ENVIRO-101

Figure 1: Illustration of the Real-world End-to-end I/O Path in a Virtualization Environment

In summary, end-to-end analysis in a SAN environment is a complex task, even in physical environments. Virtualization makes the analysis even more complex, given the multiplexing of virtual machines on the same physical resources.

This paper describes vSOM, the VMWare SAN Operations Manager, a general-purpose framework for heterogeneous, cloud-scale SAN deployments. vSOM provides a virtual machine-centric framework for end-to-end monitoring, correlation, and analysis across the I/O path including SAN components, namely HBAs, switches, and storage arrays. vSOM is designed to provide correlated monitoring statistics to radically simplify diagnosis, troubleshooting, provisioning, planning, and infrastructure optimization. In contrast, existing tools [1, 2, 3, 4, 5] are discrete in monitoring one or more components, falling short of the end-to-end stack. vSOM internally creates a correlation graph, mapping the logical and physical infrastructure elements, including virtual disks (VMDK), HBA ports, switch ports, array ports, logical disks, and even physical disk details if exposed by the controller. The elements in the correlation graph are monitored continuously, aggregating both statistical metrics and events.

Developing a cloud-scale end-to-end I/O analysis framework is nontrivial. The following are some of the design challenges that vSOM addresses:

  • Heterogeneity of fabric and storage components: There is no single available standard that is universally supported for out-of-band management of fabric and storage components. SNIA’s Storage Management Interface Specifications (SMI-S)[7] has been adopted by a partial subset of key vendors. Leveraging the VMware vSphere® Storage APIs for Storage Awareness (VASA) [10], provides a uniform syntactic and semantic interface to query storage devices, but is not supported by fabric vendors.
  • Scalability of the monitoring framework: The current version of vSphere supports 512 virtual machines per host, 60 VMDKs per virtual machine, and 2,000 VMDKs per host. The ratio of VMDKs to physical storage LUNs typically is quite large. vSOM needs to monitor and analyze a relatively large corpus of monitored data to notify administrators for hardware saturation, anomalous behavior, correlated failure events, and so on.
  • Continuous refinement for configuration changes: In a virtualized environment, the end-to-end configuration is not static. It evolves with events, such as VMware vSphere vMotion®, where either a virtual machine moves to different server, associated storage relocates, or both. The analysis of monitoring data needs to take into account the temporal nature of the configuration and appropriately correlate performance anomalies with configuration changes.

vSOM employs several interesting techniques, summarized as the key contributions of this paper:

  1. Discovery and monitoring of the end-to-end I/O path that consists of several heterogeneous fabric and storage components. vSOM stiches together statistical metrics and events using a mix of standards and propriety APIs.
  2. Correlation of the configuration details is represented internally as a directed acyclic dependency graph. The edges in the graph have a weight, representing I/O traffic between the source and destination vertices. The graph is self-evolving and updated based on configuration events and I/O load changes.
  3. Analysis of statistics and events to provide basic guidance regarding the health of the virtual machine based on health of the I/O path components. Additionally, vSOM plugs into a richer set of analytics, planning, and trending capabilities of VMWare’s Operations Manager.

The rest of the paper is organized as follows: Section 2 gives a bird’s eye-view of vSOM. Sections 3, 4, and 5 cover details of the monitoring, correlation, and analysis modules respectively. The conclusion and future work are summarized in Section 6.

2. A Bird’s Eye View of vSOM

The objective of vSOM is to monitor, correlate, and analyze the health of the virtual machine, as a function of the SAN I/O components (HBA, switches, and storage array). This section describes the system model and a high-level overview of the vSOM architecture.

2.1 System Model

VMW-DGRM-SANDEEP-FIG2-vSOM-SYSTEM-MODEL-101

Figure 2: vSOM System Model

The overall system model is shown in Figure 2. The hypervisor abstracts physical storage LUNs (also referred to as Datastore), and exports Virtual Disks (VMDKs) to virtual machines. The hypervisor supports a broad variety of storage protocols, such as Fibre Channel (FC), Internet SCSI (iSCSI), Fibre Channel over Ethernet (FCoE), the Network File System (NFS), and so on. A VMDK is used by the Guest operating system either as a raw physical device abstraction (referred to as Raw Device Mapping or RDM), or a logical volume derived from VMFS or a NFS mount point. RDMs are not a common use-case, since they bypass the hypervisor I/O stack and key features of the hypervisor such as vMotion, resource scheduling, and so on, cannot be used. As illustrated in Figure 2, a typical end-to-end I/O path from the virtual machine to physical storage consists of Virtual machine > VMDK > HBA Port > Switch Port > Array Port > Array LUN > Physical device. The current version of vSOM supports block devices only. NFS volumes are not supported.

Multiple industry-wide efforts try to standardize the management of HBAs, switches, and storage arrays. The most popular and widely adopted standard is SNIA’s Storage Management Initiative Specifications (SMI-S)[7]. The standard defines profiles, methods, functions, and interfaces for monitoring and configuring system components. This standard is widely adopted by switch and HBA vendors, with limited adoption by storage array vendors. The standard is built on the Common Information Model (CIM)[8] that defines the architecture to query and interface with system management components. In the context of CIM, a CIM Object Manager (CIMOM) implements the management interface, accessible locally or through a remote connection.

2.2 vSOM Overview

vSOM tracks the end-to-end I/O path and collects data across ESXi hosts, VMware Virtual Center™, fabrics, and storage arrays. component is monitored continuously to collect configuration details, performance statistics, and events. Data collected from the components is correlated to create a virtual machine-centric analysis including VMDKs, HBAs, switches, and storage arrays. The preciseness of the end-to-end correlation depends on the configuration. For a virtual machine with a raw mapped VMDK, there is a one-to-one mapping between VMDK and the physical LUN—the statistics gathered from the LUN and HBA paths can be attributed directly to the virtual machine. Conversely, with a virtual machine using VMDKs carved on a VMFS volume, the statistics of the storage array LUN and HBA paths reflect the status of a set of virtual machines sharing the LUN.

VMW-DGRM-SANDEEP-FIG3-HIGH-LEVEL-vSOM-ARCHITECTURAL-MODELS-101

Figure 3: High-level vSOM Architectural Modules

The vSOM architecture consists of three key building blocks: Monitoring, Correlation, and Analysis Modules (Figure 3).

  • Monitoring Module: The monitoring framework consists of Agents and a centralized Management Station. Agents collect statistics and events on the components. The initial component discovery uses a combination of vSphere configuration details from vCenter, combined with the Service Location Protocol (SLP)[11]. Agents collect monitoring data in real time, using a combination of SMI-S standards and vSphere APIs. The Management Station aggregates data collected by the individual component agents, and internally uses different protocols to connect with the host, switch, and storage array agents.
  • Correlation Module: The configuration details collected from the component agents are used to create an end-to-end dependency graph. The graph is updated continuously for events such as vMotion, Storage vMotion, Storage DRS, HA failover, and so on. An update to the I/O path configuration is tracked with a unique Configuration ID (CID). The historical statistical data from each component is persisted by tagging with the corresponding CID. This enables statistical anomaly detection and the effect of changes to the configuration.
  • Analysis Module: The monitored data is analyzed to determine the health of individual components. Statistical anomaly analysis of monitored data can be absolute or relative to other components. The current version of vSOM also supports rudimentary remediation actions that are triggered when an erroneous pattern is observed over a period of time, such as an increased number of loss sync or parity errors. vSOM plugs into Operations Manager, and provides data for end-to-end virtual machine-centric analysis.

3. Monitoring Module

As mentioned earlier, the Monitoring module consists of Agents and the centralized Management Station. Agents implement different mechanisms to collect data. The ESXi agent uses CIM, fabric agents uses SMI-S, storage array agents use either SMI-S or VMWare’s API for Storage Awareness (VASA). Besides the agents, the Management Station also communicates with vCenter Server to collect event details.

The monitoring details collected from Agents are represented internally as software objects. The schema of these objects leverages CIM-defined profiles wherever possible. The objects are persisted by the Management Station using a circular buffer implementation. The size of the circular buffer is configurable, and corresponds to the amount of history. Component monitoring is near-real-time, with the monitoring interval typically being 30-60 seconds.

This section describes the steps involved in the monitoring bootstrapping process, as well as details of the internal software object representation. vSOM implements a specialized CIM-based agent for ESXi hosts, and this section covers the key implementation details.

3.1 Bootstrapping Process

Bootstrapping involves the discovery of entities associated with a given virtual machine. This is accomplished using a combination of vSphere configuration analysis and the Service Location Protocol (SLP). The steps involved in the discovery process are summarized as follows:

  1. vSOM queries the vCenter Server to identify all hosts and datastores that host active virtual machines and VMDKs, respectively. Querying the vCenter Server acts as white-box knowledge, limiting the search space and enabling faster convergence. In contrast, black-box discovery of all devices with a vSphere cluster and SAN setup would be much slower to complete.
  2. For each VMDK, the ESXi CIM provider is queried to provide the logical device details.
  3. Using the logical device details, vCenter is queried to retrieve the associated storage port IDs at the host and storage array (commonly referred to as initiator and target ports). Each port is uniquely identified with a World Wide ID (WWID). At the end of this step, for each VMDK, the corresponding initiator and target port WWIDs are discovered. For VMDKs mapped on the same logical device (datastore), the initiator and target ports are the same.
  4. The following schemes are adopted to discover the fabric topology, depending on whether the storage protocol is FC, FCoE, or iSCSI.

a.   FC and FCoE fabrics implement CIM. vSOM uses SLP to discover the CIMOM for each switch, followed by validation of support for the SMI-S profile. If the SMI-S profile is supported by the CIMOM, vSOM queries the switch ports associated with the initiator and target WWID and identifies any inter-wwitch links that might be connected between the host and the target.

b.   For iSCSI, vSOM uses fast traceroute to identify the fabric topology between the host and the target. vSOM queries the individual port details using the Simple Network Management Protocol (SNMP).

3.2 End-to-end I/O Monitoring

The end-to-end I/O path is represented as a combination of Host, Switch, and Array objects. Each object stores a combination of performance statistics and events.

3.2.1 Host Object

The Host object includes monitoring of the VMDKs, SCSI disks, and HBA. The host object is exported as a CIM profile, accessed by the central Management Station. Instead of defining a new data model, the Host object follows the SMI-S Block Server Performance Subprofile [7]. The profile defines classes and methods for managing performance information, and was originally designed for storage arrays, virtualization engines, and volume managers. In designing the data model (Figure 4), we observed a one-to-one mapping between the abstractions on an ESXi server and a storage array: Virtual machines within ESXi are equivalent to hosts of a storage array. VMDKs are equivalent to LUNs exported by a storage array. SCSI devices on the host are similar to physical drives on an array, HBAs as the back-end ports, datastores as the storage pools of a storage array. In other words, a Block Server Performance Subprofile maps well with vSOM requirements. The Host CIM provider implements only a subset
of the classes, associations, properties, and methods of the profile, as required for the vSOM monitoring framework. For the HBA object, the data collected includes generic SCSI performance data and transport-specific performance data.

VMW-DGRM-SANDEEP-FIG4-DATA-MODEL-FOR-ESXi-HOST-OBJECT-101

Figure 4: Data Model for ESXi Host Object Derived from the SMI-S Block Server Performance Sub-profile

3.2.2 Switch Object

The Switch object consists of a collection of ports across one or more switches that are in the I/O path that serves the virtual machine’s storage traffic. Major SAN switch vendors have implemented SMI-S compliant CIM providers. vSOM uses these CIM providers as data collection agents. In the context of storage, switches typically use Fibre Channel or standard Ethernet. For each FC port, vSOM switch objects use the data model defined in CIM schema 2.24 (CIM_FCPortRateStatistics and CIM_FCPortStatistics). In SCSI, only Class 3 service is used on Fibre Channel. As a result, the attributes in the CIM profile containing Class 1 or Class 2 are ignored. Performance data is collected and persisted only for the ports on which ESX hosts or storage arrays are connected.

3.2.3 Storage Array Object

The Storage Array is the final destination of the I/O, and is referred to as the Target. A typical storage array consists of array ports, controllers, and logical volumes. Additionally, storage arrays can export details on physical disks and back-end storage ports. In vSOM, two mechanisms are available to collect data from the storage array: VMWare’s VASA profile or the CIM storage profile. The latter provides a limited set of attributes, generalizing the differentiated capabilities of the storage arrays. The Management Station connects with the Array agents using the WSDL-based protocol[9] for VASA or a Generic Storage Adapter for CIM. Note that for the purpose of vSOM, the existing VASA specifications have been extended with certain performance attributes.

3.3 Internals for Data Collection from Host

As a part of the vSOM initiative, the ESXi host has been extended with I/O Device Management (IODM). This module provides functionality to configure, monitor, and deliver Storage object I/O details to the vSOM Management Station. This subsection describes details of the IODM implementation within ESXi (Figure 5).

VMW-DGRM-SANDEEP-FIG5-IODM-IMPLEMENTATION-FOR-ESXi-101

Figure 5: I/O Device Management (IODM) Implementation for ESXi

IODM implementation is split into two parts: an IODM Kernel Module and an IODM CIM Provider for the user world. The kernel module captures statistics and events about VMDKs, SCSI logical devices, SCSI paths, and HBA transport details. The kernel module also presents an asynchronous mechanism to deliver events to the user world.

The IODM CIM provider implemented consists of two parts: Upper layer and Bottom layer. The Upper layer is the standard CIM interface, with both the intrinsic and extrinsic interfaces implemented. The intrinsic interface includes EnumInstances, EnumInstanceNames, and GetInstance. This interface is used to get I/O statistics and error information for virtual machines, devices, and HBAs. The extrinsic interface controls the IODM behavior with functions such as start/stop data collection. CIM indication for events also is part of the Upper layer. It interacts with the IODM kernel modules to get events and alerts.

4. Correlation Module

The Correlation Module maintains the dependency between the components in the I/O path. The dependency details are persisted as an acyclic directed graph. Vertices represent the I/O path components, and edges represent the correlation weight, as illustrated in Figure 6. The steps involved in discovering correlation details between the components is similar to the bootstrapping process covered in Section 3.1.

VMW-DGRM-SANDEEP-FIG6-CORRELATION-BETWEEN-IO-PATH-COMPONENTS-101

Figure 6: Representation of the Correlation between I/O Path Components

Figure 6 shows the correlation details between VM1 and LUN 6. VM1 has configured VMDK 4 and VMDK 8 as storage volumes. These VMDKs are mapped to logical SCSI device (Datastore 1). The Datastore is connected to HBA 1 and HBA 2 on ports P0 and P1 respectively. These ports are mapped to Switches 1 and 2 (SW1 and SW2) on ports 18 (P18) and 12 (P12), respectively. Finally, the switches connect to the Storage Array Target (T1 on Port 2), accessing LUN 6. With respect to correlation granularity, VMDK 4 and VMDK 8 merge into Datastore 1. The outgoing traffic from the Datastore can be a combination of other VMDKs as well (in addition to VMDK 4 and 8). The weights for an edge are normalized to a value between 0 and 1. The summation of the outgoing edges from a vertex should typically be 1. Note that this might not always be the case. For instance, the traffic from Target T1 Port 2 is mapped to other LUNs besides LUN 6. As a result, the total of the outgoing edges from T1 P2 is shown in Figure 6 as 0.3.

vSOM continuously monitors for configuration changes and updates the dependency graph. Each version of the configuration is tracked using a unique 32-byte Configuration ID. Maintaining the versions of the dependency graph help in the time travel analysis of configuration changes, and their corresponding impact on configuration changes. As mentioned earlier, the performance statistics are tagged with the Configuration ID.

The dependency graph is updated in response to either configuration change events or updates to the edge weights in response to workload variations. Configuration change events such as vMotion and HA, among others, are tracked from vCenter, while CIM indications from individual components indicate the creation, deletion, and other operational status change of switches, FC ports, and other components. The edge weights in the dependency graph are maintained as a moving average and are updated over longer time windows (3-6 hours).

5. Analysis Module

The goal of the Analysis module is to use monitoring and correlation details to provide an intuitive representation of virtual machine health as a function of the heath of the individual components (Host machine, HBAs, Fabric, and Storage Arrays). The Analysis module categorizes the health of each component into green, yellow, orange, or red:

  • Green indicates normal, with the component behaving within expected thresholds
  • Yellow means attention is needed, primarily based on reported error events
    Orange indicates an increasingly degrading condition, based on a combination of statistical analysis and error events
  • Red means I/O can no longer flow and virtual machines cannot operate

Based on the dependency graph, the health of individual components is rolled up at the virtual machine level (Figure 7).

VMW-DGRM-SANDEEP-FIG7-VM-HEALTH-CUMULATIVE-ROLLUP-101

Figure 7: Virtual Machine Health is a Cumulative Roll Up of Individual Components in the End-to-end Path

As shown, degradation in the Switch Port affects the virtual machine, based on its high correlation weight for host to storage array connectivity.

The health of a component can be deduced using different approaches. Traditionally, administrators are expected to define alert thresholds. For example, when capacity reaches 70 percent, the health of the disk is marked yellow or orange. Defining these thresholds typically are nontrivial and arbitrary. Further, given the scale of cloud systems, it is unrealistic for administrators to define these thresholds.

vSOM employs two different techniques to determine component health. The first approach is referred to as Absolute Anomaly Analysis. The history of monitored data is analyzed to determine whether current component performance is anomalous. There are standard data mining techniques for anomaly detection. vSOM uses the K-means Clustering approach that detects an anomaly and associates a weight to help categorize the anomaly as yellow, orange, or red. The second approach is based on Relative Analysis. In this approach, peer components (such as ports on the same switch, or events on different ports of the same HBA) are analyzed to determine if the observed behavior anomaly is similar to other components. In large-scale deployments, Relative Analysis is an effective approach, especially if the available history of monitored data is not sufficient for Absolute Anomaly Analysis.

Analysis can help pinpoint the root cause of the problem and be used to trigger automated remediation. Automated root-cause analysis is nontrivial, especially in large-scale, real-world deployments where the cause and effect might not always be on the same component, or the problem might be a result of multiple correlated events. vSOM implements a limited version of auto remediation, using link or device resets. Complex remediation actions, such as changing the I/O path or vMotion, is beyond the scope of the current version of vSOM.

Reset is an effective correction action for a common set of link-level and device-level erroneous patterns. For errors observed over a period of time, such as an increased number of loss sync or parity errors, failure of protocol handshaking, and so on, are commonly fixed with a link reset. A link reset can reinitialize the link and put I/O back on track. If several link resets do not fix the problem, the path can be disabled to trigger a path failover to a backup path if multiple paths are available.

vSOM correlates events from components in the end-to-end path. This helps in determining the root cause of events such as link down, frame drops, I/O or virtual machine hangs, and similar events. For instance, link down events are collected from the ESX host by subscribing to CIM indications, helping to isolate the root cause on the ESX host versus a specific HBA or switch port.

The vCenter Operations Manager is an existing VMware product that provides rich analytical capabilities for managing performance and capacity for virtual and physical infrastructures. It provides analytics for performance troubleshooting, diagnosis, capacity planning, trending, and so on. Using advanced statistics analysis, Operations Manager currently associates a health score to resources such as compute, and continuously tracks the health to raise alerts for abnormal behavior, sometimes even before a problem exhibits any symptoms. vSOM plugs into Operations Manager using its standard adapter interface. vSOM complements the existing analysis of Operations Manager for virtual machine and datastore level, with details of the SAN components (HBAs, Fabric, and Storage Array). Operations Manager stores historic statistical data in a specialized database and implements anomaly detection algorithms for historic data analysis. In addition to end-to-end monitoring and troubleshooting, Operations Manager can help with planning and optimization use cases, such as balancing workloads across all controllers and switch ports.

6. Conclusion and Future Work

This paper describes an end-to-end SAN management framework implemented for vSphere. It addresses the pain points associated with monitoring I/O components from the viewpoint of virtual machine-centric performance. While problem diagnosis is the most intuitive use case, vSOM is applicable to other use cases, such as planning, single-pane of glass monitoring, load balancing, and more.

We plan to extend this work in several dimensions. Automated remediation has significant value for administrators. We plan to extend beyond the current reset action to support complex multistep actions. For root-cause analysis, we plan to combine our current black-box anomaly analysis with rule-based techniques, particularly to correlate error events across different components in the I/O path. Finally, we are exploring monitoring module extensions to include application-level statistics in the end-to-end I/O path.

References

  1. HP Systems Insight Manager Overview, http://h18013.www1.hp.com/products/servers/management/hpsim/index.html?jumpid=go/hpsim
  2. Dell OpenManage Systems Management, http://www.dell.com/content/topics/global.aspx/sitelets/solutions/management/en/openmanage?c=us&l=en&cs=555
  3. IBM Tivoli Storage Management Solutions, http://www-01.ibm.com/software/tivoli/solutions/storage/
  4. Shen, K., Zhong, M., and Li, C. I/O System Performance Debugging Using Model-driven Anomaly Characterization. In Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies (FAST) (2005), pp. 23–23.
  5. Pollack, K. T., and Uttamchandani, S. Genesis: A Scalable Self-Evolving Performance Management Framework for Storage Systems. In IEEE International Conference on Distributed Computing Systems (ICDCS) (2006), p. 33.
  6. VMWare vCenter Operations Manager, http://www.vmware.com/support/pubs/vcops-pubs.html
  7. Storage Management Initiative Specification (SMI-S), http://www.snia.org/tech_activities/standards/curr_standards/smi
  8. Common Information Model, http://en.wikipedia.org/wiki/Common_Information_Model_(computing)
  9. Web Services Description Language (WSDL), http://www.w3.org/TR/wsdl
  10. vSphere Storage API for Storage Awareness (VASA), http://blogs.vmware.com/vsphere/2011/08/vsphere-50-storage-features-part-10-vasa-vsphere-storage-apis-storage-awareness.html
  11. Service Location Protocol, http://www.ietf.org/rfc/rfc2608.txt