Diagnosis of an I/O performance slow down is a complex problem. The root cause could be one among a plethora of event combinations such as VMware® ESXi misconfiguration, an overloaded fabric switch, disk failures on the storage arrays, and so on. As a virtualization administrator, diagnosing the end-to-end I/O path today requires working with discrete fabric and storage reporting tools, and manually correlating component statistics to find performance abnormalities and root-cause events. To address this pain point, especially for cloud scale deployments, we developed the VMWare SAN Operations Manager (vSOM), a framework for end-to-end monitoring, correlation, and analysis of storage I/O paths. It aggregates events, statistics, and configuration details across the ESXi server, host bus adapters (HBAs), fabric switches, and storage arrays. The correlated monitoring data is analyzed in a continuous fashion, with alerts generated for administrators. The current version invokes simple remediation steps, such as link and device resets, to potentially fix errors such as link errors, frame drops, I/O hangs, and so on. vSOM is designed to be leveraged by advanced analytical tools. One example described in this paper, VMware® vCenterTM Operations ManagerTM, uses vSOM data to provide end-to-end virtual machine- centric health analytics.
Sandeep Uttamchandani, Wenhua Liu, Samdeep Nayak