Virtualized datacenters contain a wide variety of storage devices with different performance characteristics and feature sets. In addition, a single storage device is shared among different virtual machines (VMs) due to ease of VM mobility, better consolidation, higher utilization and to support other features such as VMware Fault Tolerance (FT)  and VMware High Availability (HA) , that rely on shared storage. According to some estimates, the cost of managing storage over its lifetime is much more expensive as compared to initial procurement costs. It is highly desirable to automate the provisioning and runtime management operations for storage devices in such environments.
In this paper, we present Storage DRS as our solution for doing automated storage management in a virtualized datacenter. Specifically, Storage DRS handles initial placement of virtual disks, runtime migration of disks among storage devices to balance space utilization and IO load in a unified manner, and respect business constraints while doing so. We also present how Storage DRS handles various advanced features from storage arrays and different virtual disk types. Many of these advanced features make the management more difficult by hiding details across different mapping layers in the storage stack. Finally, we present various best practices to use Storage DRS and some lessons learned from initial customer deployments and feedback.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Modeling techniques.
C.4 [Performance of Systems]: Measurement techniques.
C.4 [Performance of Systems]: Performance Attributes.
D.4.8 [Operating Systems]: Performance—Modeling and Prediction
D.4.8 [Operating Systems]: Performance—Measurements
D.4.8 [Operating Systems]: Performance—Operational analysis
Algorithms, Management, Performance, Design, Experimentation.
VM, Virtualization, Resource Management, Scheduling, Storage, Hosts, Load Balancing
Virtualized infrastructures provide higher utilization of physical infrastructure (servers and storage), agile IT operations, thereby reducing both capital and operating expenses. Virtualization offers unprecedented control and extensibility over consumption of compute and storage resources, allowing both VMs and their associated virtual disks to be placed dynamically based on current load and migrated seamlessly around the physical infrastructure, when needed. Unfortunately, all this sharing and consolidation comes at the cost of extra complexity. A diverse set of workloads that are typically deployed on isolated physical silos of infrastructure now share a bunch of heterogeneous storage devices with a variety of capabilities and advanced features. Such environments are also dynamic—as new devices, hardware upgrades, and other configuration changes are rolled out to expand capacity or to replace aging infrastructure.
In large datacenters, the cost of storage management captures the lion’s share of the overall management overhead. Studies indicate that over its lifetime, managing storage is four times more expensive than its initial procurement . The annualized total cost of storage for virtualized systems is often three times more than server hardware and seven times more than networking-related assets . Due to inherent complexity and stateful nature of storage devices, storage administrators make most provisioning and deployment decisions in an ad-hoc manner trying to balance the space utilization and IO performance. Administrators typically rely on rules of thumb, or risky and time-consuming trial-and-error placements to perform workload admission, resource balancing, and congestion management.
There are several desirable features that a storage management solution needs to provide, in order to help administrators with the automated management of storage devices:
- Initial placement: Find the right place for a new workload being provisioned while ensuring that all resources (storage space and IO) are utilized efficiently and in a balanced manner.
- Load balancing: monitor the storage devices continuously to detect if free space is getting low on a device or if IO load is imbalanced across devices. In such cases, the solution should take remediation actions or recommend resolution to the administrators.
- Constraint handling: Handle a myriad of hardware configuration details and enforce business constraints defined by the administrator, such as anti-affinity rules for high availability.
- Online data gathering: All of the above needs to be done in an online manner by collecting runtime stats and without requiring offline modeling of devices or workloads. The solution should automatically estimate available performance in a dynamic environment.
Toward automated storage management, VMware introduced Storage Distributed Resource Scheduler (Storage DRS), the first practical, automated storage management solution that provides all of the functionality mentioned above. Storage DRS consists of three key components: 1) a continuously updated storage performance and usage model to estimate performance and capacity usage growth; 2) a decision engine that uses these models
to place and migrate storage workloads; and 3) a congestion management system that automatically detects overload conditions and reacts by throttling storage workloads.
In this paper, we describe the design and implementation of Storage DRS as part of a commercial product in VMware’s vSphere management software . We start with some background on various storage concepts and common storage architectures that are used in VMware based deployments in Section 2. Then we present a deep-dive in to the algorithms for initial placement and load balancing that are used for generating storage recommendations while balancing multiple objectives such as space and IO utilization (Section 3). We provide various use case scenarios and how Storage DRS would handle them along with the description of these features.
In practice, arrays and virtual disks come with a wide variety of options that lead to different behavior in terms of their space and IO consumption. Handling of advanced array or virtual disk features and various business constraints is explained in Sections 4 and 5. Given any solution of such complexity as Storage DRS there are always some caveats and recommended ways to use them. We highlight some of the best practices in deploying storage devices for Storage DRS and some of the key configuration settings in Section 6.
Finally, we present several lessons learned from real world deployments and outline future work in order to evolve Storage DRS to handle the next generation of storage devices and workloads (Section 7).
We envision Storage DRS as the beginning of the journey to provide software-defined storage, where one management layer is able to handle a diverse set of underlying storage devices with different capabilities and match workloads to the desired devices, while doing runtime remediation.
Storage arrays form the backbone for any enterprise storage infrastructure. These arrays are connected to servers using either fiber channel based SAN or Ethernet based LANs.
These arrays provide a set of data management features in addition to providing a regular device interface to do IOs.
In virtualized environments, shared access to the storage devices is desirable because a bunch of features such as live migration of VMs (vMotion ), VMware HA and VMware FT depend on the shared storage to function. Shared storage also helps in arbitrating locks and granting access of files to one of the servers only, when several servers may be contending. Some features such as Storage IO Control (SIOC) also use shared storage to communicate information between servers in order to control IO requests from different hosts .
2.1 Storage Devices
Two main interfaces are used to connect storage devices are with VMware ESX hypervisor: block-based interface and file-based interface. In the first case, the storage array exposes a storage device (also called as LUN) as a set of blocks on which one can do regular IO operations using SCSI commands. ESX hypervisor installs a clustered file system called VMFS  on that LUN. VMFS allows every ESX host to see the same file system and all changes, in a consistent manner. In the second case, a storage device is exported by as a mount point by NFS server. ESX hypervisor accesses the device using NFS protocol. Currently ESX supports and implements NFSv3 protocol.
In both cases, ESX creates a concept of a datastore and exposes that to the administrator as a management and provisioning entity. Typically, a datastore is backed by a single LUN or NFS mount point, but we also allow a VMFS file system to extend across two devices. Such configuration is not supported by some of the features and is uncommon in customer deployments. We use the term datastore to denote a LUN or NFS mount point in this paper.
At the storage array, the storage controllers pool a set of underlying physical devices to create a volume or a RAID-group on them. Different vendors use different terms but the overall concept of creating a pool of underlying physical resources is pervasive. This pool is governed typically by similar properties in terms of reliability, fault handling, and performance sharing across the underlying devices. On top of this volume or a RAID-group, one can create a LUN, which is striped across the underlying devices. Finally these LUNs are exposed as a block device or a mount
point over NFS.
This virtualization of underlying devices is hidden from the hypervisor and the exact performance and device level characteristics are not known outside the array. Since there are multiple layers of mappings across the storage stack from virtual disks to all the way down to physical disks, it is often quite difficult to discern where exactly a given block is stored.
In order to manage these devices, a solution such as Storage DRS needs to infer the performance characteristics in an online manner.
Storage controllers leverage the mapping of LUN address space to physical disk pool in many different ways to provide space savings and performance enhancing functionality. One of the widely used features is thin provisioning, where a LUN with a fixed reported capacity is backed by a much smaller address space from among the physical drives. For example, a 2 TB datastore can be backed up by a total of 500 GB physical drive pool. Storage controllers only map the used (i.e., written) blocks in the datastore to the allocated space in the backing pool.
The capacity of the backing pool can be adjusted dynamically by adding more drives when the space demand on the datastore increases over time. The available free capacity in the backing pool is managed at the controller, and controllers provide dynamic events to the storage management fabric when certain threshold in the usage of the backing pool is reached. These thresholds can be set individually at storage controllers. The dynamic block allocation in thin provisioned datastores allow storage controllers to manage its physical drive pool in a much more flexible manner by shifting the available free capacity to where it is needed.
In addition, storage controllers also offer features like compression and de-duplication to compress and store the identical blocks only once. Common content, such as OS images, copies of read-only databases, and data that does not change frequently can be compressed transparently to generate space savings. A de-duplicated datastore can hold more logical data than its reported capacity due to the transparent removal of identical blocks.
Storage controllers also offer integration with hypervisors to offload common hypervisor functions for higher performance. VM cloning, for example, can be performed efficiently using copy-on-write techniques by storage controllers: a VM clone can continue to use the blocks of the base disk until it writes new data, and new blocks are dynamically allocated during write operations. Storage controller keeps track of references to each individual block so that the common blocks are kept around until the last VM using them is removed from the system. Clones created natively at the storage controllers are more efficiently stored as long as they remain under the same controller. Since the independent storage arrays can’t share reference counts, all the data of a clone needs to be copied in case a native clone is moved from one controller to another.
Storage controllers also take advantage of the mapping from logical to physical blocks to offer dynamic performance optimizations by managing the usage of fast storage devices.
Nonvolatile storage technologies such as solid-state disks (SSDs), flash memory, and battery backed RAM can be used to transparently absorb a large fraction of the IO load to reduce latency and increase throughput. Controllers remap blocks dynamically across multiple performance tiers for this purpose. This form of persistent caching or tiering allows fast devices to be used efficiently depending on the changing workload characteristics.
2.2 Virtual Disks
The datastores available to the hypervisor are used to store virtual disks belonging to a VM and other configuration files (e.g., snapshots, log files, swap files, etc.). Hypervisor controls the access to physical storage by mapping IO commands issued by VMs to file read and write IOs on the underlying datastores. This extra mapping layer allows hypervisors to provide different kinds of virtual disks for increased flexibility.
In the simplest form, a virtual disk is represented as a file in the underlying datastore and all of its blocks are allocated. This is called a thick-provisioned disk. Since the hypervisor controls the block mapping, not all blocks in a virtual disk have to be allocated at once. For example, VMs can start using the virtual disks while the blocks are being allocated and initialized in the background. These are called lazy-zeroed disks and allow a large virtual disk to be immediately usable without waiting for all of its blocks to be allocated at the datastore.
VMFS also implements space saving features that enable virtual disk data to be stored only when there is actual data written to the disk. This is similar to thin provisioning of LUNs at the array level. Since no physical space is allocated before a VM writes to its virtual disk, the space usage with thin-provisioned disks starts small and increases over time as the new data is written to the virtual disk. One can overcommit the datastore space by provisioning substantially larger number of virtual disks on a datastore, as there is no need to allocate unwritten portions of each virtual disk. This is also known as space over commitment. Figure 1 depicts the virtual disk types. ESX also provides support for VM cloning where clones are created without duplicating the entire virtual disk of the base VM. Instead of a full replica, a delta disk is used to store the modified blocks by the clone VM. Any block that is not stored in the delta disk is retrieved from the base disk. These VMs are called linked clones as they share base disks for the data that is not modified in the clone. This is commonly used to create a single base OS disk storing the OS image and share that across many VMs. Each VM only keeps the unique data blocks as delta disk that contains instance specific customizations.
Figure 1. Virtual Disk Types
Finally, we use the key primitive of live storage migration (also called Storage vMotion ) provided by ESX that allows an administrator to migrate a virtual disk from one datastore to another without any downtime of the VM. We rely heavily on this primitive to do runtime IO and space management.
2.3 Storage Management Challenges
Despite the numerous benefits of virtualization in terms of extensible and dynamic allocation of storage resources, the complexity and large number of virtual disks and datastores calls for an automated solution. It is hard for an administrator to keep track of the mappings and sharing at various levels, in order to make simple decisions such as finding the best datastore for an incoming VM. The design may need to consider several metrics as explained below:
- Space requirements: One might think that using a datastore with the most available space is the best option. As we have described, determining the datastore with the most available space is harder in the presence of thin provisioning, de-duplication, linked clones, and unequal data growth from thin provisioned virtual disks. For example, it is often more space efficient to utilize an existing base disk for a linked clone than to create a full clone on another datastore – the former will use a fraction of the space as compared to the full clone.
- Performance requirements: Using a datastore with the most available performance headroom (IOPS or latency) seems to be a good option, but it is hard to determine the available headroom in the presence of rampant sharing of underlying physical resources behind many layers of mapping. Estimating the performance of a storage system and dynamically detecting the sharing among multiple datastores are hard problems that need to be solved to efficiently manage performance.
- Multiple-dimensions: It is quite possible that the best datastore for available space is not the same as the best datastore for available performance. Determining the relative goodness of a placement when there are multiple optimization criteria is needed in any solution.
- Constraints: Administrators can also provide constraints such as anti-affinity rules (e.g., keeping multiple VMs on different datastores), datastore compatibility (e.g., using a native datastore feature), connectivity constraints, or datastore preference (e.g., requiring a certain RAID level) that need to be taken into account for placement. In these cases, a provisioning operation might have to move other virtual disks to be successful.
Beyond just the initial placement, storage resources must be continuously monitored to detect changes in capacity consumption and IO loads. It is common practice to relocate workloads in the server infrastructure using vMotion or Storage vMotion to carry out various remediation actions.
This level of flexibility requires that the management layer be sufficiently intelligent to automatically detect resource imbalance, formulate actions that will fix the issue before it can develop into a service disruption, and carry out these
actions in automated fashion.
3. Solution: Storage DRS
Figure 2. Storage DRS as Part of VMware vSphere
Figure 2 describes vSphere architecture with Storage DRS. As illustrated in the figure, Storage DRS runs as part of vCenter Server Management Software . Storage DRS provides a new abstraction to the administrator, which is called datastore cluster. This allows administrators to put together a set of datastores into a single cluster and use that as the unit for several management operations. Within each datastore cluster, VM storage is managed in the form of Virtual Disks (VMDKs). The configuration files associated with a VM are also pooled together as a virtual disk object with space requirements only. When a datastore cluster is created, the user can specify three key configuration parameters:
- Space Threshold: This threshold is specified in percentage and Storage DRS tries to keep the space utilization below this value for all datastores in a datastore cluster. By default this is set to 80%.
- Space-Difference Threshold: This is an advanced setting that is used when all datastores have space utilization higher than the space threshold. In that case a migration is recommended from a source to a destination datastore only if the difference in space utilization is higher than this space-difference value. By default, this is set to 5%.
- IO Latency threshold: A datastore is considered overloaded in terms of IO load only if its latency is higher than this threshold value. If all datastores have latency smaller than this value, no IO load balancing moves are recommended. This is set to 10 ms by default. We compute 90th percentile stats in terms of datastore
IO latency to compare with this threshold.
- Automation level: Storage DRS can be configured to operate in fully automated or manual modes. In fully automated mode, it will not only make recommendations but execute them without any administrator intervention. In manual mode, administrator intervention is needed to approve the recommendations.
All these parameters are configurable at any time. Based on these parameters, the following key operations are supported on the datastore cluster:
- 1Initial placement API for VMDKs of a VM can be called on a datastore cluster instead of a specific datastore. Storage DRS implements that API and provides a ranked list of possible candidate datastores based on space and IO stats.
- Out-of-space situations are avoided in the datastore cluster by monitoring the space usage and comparing that to the user-set threshold. If there is any risk that a datastore may cross the space utilization threshold, Storage DRS recommends Storage vMotion to handle that case.
- Monitor and balance IO load across datastore, if the 90th percentile latency is higher than the user-set threshold for any datastore, Storage DRS tries to migrate some VMDKs out of that datastore to lightly loaded ones.
- Users can put a datastore in the maintenance mode. Storage DRS tries to evacuate all VMs off of the datastore to suitable destinations within the cluster. The user can then perform maintenance operations on the datastore without impacting any running VMs.
All resource management operations performed by Storage DRS respect user set constraints (or rules). Storage DRS allows specification of anti-affinity and affinity rules between VMDKs in order to fulfill business, performance and reliability considerations.
As depicted in Figure 2, Storage DRS architecture consists of three major components – (1) Stats Collection: Storage DRS collects space and IO statistics for datastores as well as VMs. VM level stats are collected from the corresponding ESX host running the VM and datastore level stats are collected from SIOC, which is a feature that runs on each ESX host and computes the aggregated datastore level stats across all hosts. For example, SIOC can provide per datastore IOPS and latency measures across all hosts accessing the datastore. (2) Algorithm: This core component does computation of resource entitlements and recommends feasible moves. Multiple choices could be possible, and recommendations carry a rating indicating their relative importance/benefit. (3) Execution Engine: This component monitors and executes the Storage vMotion recommendations generated by the algorithm.
In this paper, we focus on handling of space, user constraints and virtual disk features such as linked clones. To handle IO, Storage DRS performs online performance modeling of storage devices. In addition, it continuously monitors IO stats of datastores as well as VMDKs. Details of IO modeling and corresponding algorithms are described in earlier papers . We do not discuss those topics here.
In the remaining section, we first describe automated initial placement of VMs in a Storage DRS cluster. Then we go into details of load balancing, which form the core of storage resource management followed by the discussion on constraint handling.
3.1 Initial Placement
Initial placement of VMs on datastores is one of the most commonly used features of Storage DRS. Manually selecting appropriate storage for a VM often leads to problems such as poor IO performance and out of space scenarios. Storage DRS automatically selects most fitting datastore to place a VM based on space growth modeling and IO load history of the datastores. During initial placement, Storage DRS does not have any history of VMs IO profile or its future space demand. So, it uses conservative estimates for space and IO: space is assumed to be full disk size for thick provisioned disks and a fixed small size for thin provisioned ones. In terms of IO load we use the average load from other existing VMDKs on the datastore.
In addition, Storage DRS gives preference to datastores connected to as many hosts as possible. This allows solutions like Distributed Resource Scheduler (DRS)  to do load balancing of VMs across hosts more effectively. Storage DRS supports all virtual disk formats for placement—thin, thick eager zeroed as well as thick lazy zeroed. Thin disks in particular, start off with a small size; but can consume all the space up to their configured size over time. This poses a risk of running out of space on a datastore. Such placements are done with future space growth considerations.
Many aspects of Storage DRS such as constraints, prerequisite moves, pending recommendations are common to initial placement and load balancing. We discuss them in detail when we describe load balancing. Figure 3 below describes the initial placement used by Storage DRS.
Figure 3. Initial Placement Algorithm
Not all available datastores can be used for initial placement. (1) A datastore may not have sufficient resources to admit the VM (2) Inter-VM anti-affinity rules may be violated due to particular placement or (3) VM may not be compatible with some datastores in the datastore cluster. Storage DRS applies these filters before evaluating VM placement on a datastore. For each datastore that passes the filters, the placement is checked for conflicts with pending recommendations. Then Storage DRS computes datastore cluster imbalance as if the placement was done on that datastore. Goodness is measured in terms of change in imbalance as a result of the placement operation. After evaluating VM placements, they are sorted based on their goodness values. Datastore with highest goodness value is overall the best datastore available for VM placement.
3.2 Load Balancing
Load balancing ensures that datastores in a datastore cluster do not exceed their configured thresholds as space consumption and IO loads change during runtime. Unlike DRS, which minimizes the resource usage deviation across hosts in a cluster, Storage DRS is driven by threshold trigger. Its load balancing mechanism moves VMs out of only those datastores, which exceed their configured threshold values. Figure 4 gives the outline of load balancing as used by Storage DRS. For each pass of Storage DRS, the algorithm is invoked first for datastores that exceeded their space threshold and later for those violating IO threshold, effectively fixing space violations followed by IO violations in the datastore cluster.
Figure 4. Storage DRS Load Balancing
Storage DRS is able to factor in multiple resources (space, IO and connected compute capacity) for each datastore when generating a move proposition. It uses weighted resource utilization vector to compare utilizations. Resources that are closer to their peak utilization get higher weights compared to others. Following example best illustrates the effectiveness of Storage DRS in multi-resource management.
Consider a simple setup of two (DS1 and DS2) datastores with 50GB space each and same IO performance characteristics. Both datastores are part of same Storage DRS cluster.
Two types of VMs are to be placed on these datastores. (1)
High IO VMs, which run a workload of 5-15 outstanding IOs for the duration of the test. (2) Low IO VMs with IO workload of 1-3 outstanding IOs for the duration of test.
The VMs have pre-allocated thick disk of 5GB each. During the experiment, high IO VMs are placed on DS1 and low IO VMs on DS2. The initial setup is as described in table 1 below:
Table 1: Initial Placement—Space and IO
The latency numbers are computed using a device model, which dictates the performance of the device as a function of workload. Storage DRS is invoked to recommend placement for a new High IO VM, making the total number of VMs to ten. Placement on DS1 had rating of 0, while placement on DS2 received rating of 3, even though DS1 has more free space than DS2. This is because DS1 is closer to exhausting its IO capacity. By placing the VM on DS2, Storage DRS ensures that IO bottleneck can be avoided. Furthermore, another Storage DRS pass over the datastore cluster recommended moving out one of the High IO VM from DS1 to DS2. The balanced cluster, at the end of evaluation was as in table 2 below:
|DS2||Low IOHigh IO||52||35GB||<10ms|
Table 2: Final Datastore Cluster State
Note that the space threshold for datastores is set to 80% of their capacity (40GB). So from space perspective, the final configuration is still balanced.
Storage DRS is equally effective in balancing multiple resources while load balancing VMs in a cluster. Consider the two datastores (DS1, DS2) setup as before. This example uses 9 VMs with high IO workload with 1.2GB disk and 8 VMs with low IO workload and 5GB disk. All VMs were segregated on datastores based on their workload and space profiles, so the initial setup looks as described in Table 3.
Table 3: Imbalanced Initial Configuration
The space threshold was configured at 75% datastore capacity (37.5GB). As is evident from the description, the setup is imbalanced from both space and IO perspective. Storage DRS load balancing was run multiple times on this cluster, during which 5 moves were proposed. The final configuration looked as in Table 4.
|DATASTORE||VM TYPE||NUMBER OF VMs||SPACE USED||LATENCY|
Table 4: Balanced Final Configuration
Note that final configuration does not perfectly balance space as well as IO, but performs only the moves sufficient to bring the space consumption below the configured threshold and balance IO load more evenly.
Load balancing is done in two phases. First, moves balancing space usage are generated followed by IO balancing. Space moves are evaluated for future space growth. A move which violates space or IO threshold or that may cause destination datastore to run out of space in future is rejected. Similarly, IO load balancing move of a VM to datastore with higher latency than source is filtered out.
3.2.2 Cost Benefit Analysis
Although a move is useful in balancing resource usage of datastore cluster, Storage vMotion is lengthy and costly operation. After a VM is moved to its destination, it should make cluster resource usage fair for a long time. Otherwise, changes in VM workload/growth rate can cause ping-pong moves. So, in addition to goodness, each move is evaluated per its cost and resulting benefit. For Storage, the benefit is computed as net reduction in normalized resource usage for space and IO on source with respect to increase on destination. The cost is computed in terms of total storage transferred as part of Storage vMotion and the number of IOs, which experience increased latency for that period. Linked clones indirectly affect cost benefit analysis. A move to a datastore, which has larger portion of clone disk chain is preferred, because it results in greater space savings as well as less costly Storage vMotion.
3.3 Pending Recommendations
Provisioning of new VMs and Storage vMotion of existing VMs can take several minutes to complete in a typical case. New invocation of Storage DRS under such transient conditions may perceive available free space and IO incorrectly and generate recommendations, based on stale information. Execution Engine tracks lifecycle of recommendations and corresponding actions. Prior to load balancing run, an accurate snapshot is generated in order to produce valid recommendations. This ability is important for group deployments such as vApp and test/dev environments. Similarly, during the algorithm run Storage DRS maintains a snapshot of resource state of datastores and VMs in the datastore cluster. After generating a valid recommendation, its impact is applied to the snapshot and resource values are updated before searching for next load balancing move. That way, each subsequent recommendation does not conflict with prior recommendations.
4. Constraint Handling
Storage capacity and IO performance are the primary resources Storage DRS considers in its initial placement and load balancing decisions. In this section, we describe several additional constraints that are taken into account when a virtual disk is placed on a datastore. There are two types of constraints considered by Storage DRS:
- Platform constraints: Some features that are required by a virtual disk may not be available in certain storage hardware or operations may be restricted due to firmware revision, connectivity, or configuration requirements.
- User specified constraints: Storage DRS provides a rule engine that allows users to specify affinity rules that restrict VM placement. In addition, Storage DRS behavior is controlled by a number of configuration options that can be used to relax restrictions or specify user preferences on a variety of placement choices. For example, the degree of space overcommit on a datastore with thin provisioned virtual disks can be adjusted dynamically using a configuration setting.
Storage DRS considers all constraints together when it is evaluating virtual disk placement decisions. Constraint enforcement can be strict or relaxed for a subset of constraints.
In strict enforcement, no placement or load balancing decision that violates a constraint is possible, even though they may be desirable from performance point of view. In contrast, relaxed constraint enforcement is a two-step process. In the first step, Storage DRS attempts to satisfy all constraints using strict enforcement. If this step is successful, no further action is necessary. In the second step, relaxed constraints can be violated when Storage DRS considers placement decisions. The set of constraints that can be relaxed and under what conditions is controlled by user configuration settings.
4.1 Platform Constraints
In this section, we describe platform constraints enforced by Storage DRS. The first group of platform constraints are defined as part of VMware APIs for Storage Awareness  (VASA) that enable storage devices to communicate their configuration constraints to Storage DRS. As we have described earlier, storage controllers determine the amount of actual capacity available for thin provisioned datastores. When the available backing space for a thin provisioned datastore runs low, storage controllers send events to the vCenter management server using the VASA API so that Storage DRS can adjust its placement decisions accordingly.
In these cases, even though the actual available space is running low, the datastore may appear to have a much larger free space. In response, Storage DRS does not place new virtual disks or move other virtual disks to datastores that are identified as running low on capacity. These restrictions are lifted when the storage administrator provisions new physical storage to back a thin provisioned datastore. This is an example of a dynamic constraint declared entirely by the storage controllers.
VASA APIs are also used to control whether Storage DRS can move virtual disks from one datastore to another in response to imbalanced performance. Since relocating a virtual disk to a different datastore also shifts the IO workload to that datastore, a performance improvement is possible only when independent physical resources back two datastores. In a complex storage array, two different datastores may be sharing a controller or physical disks along the IO path. As a result, their performance may be coupled together. Storage DRS considers these relationships and attempts to pick other, non-correlated datastores to correct performance problems.
If virtual disks are created using storage-specific APIs (native cloning or native snapshots), Storage DRS restricts the relocation of such virtual disks, as native API must be used for their movement. Alternatively, if new provisioning operations require storage-specific APIs to be used, Storage DRS constrains the candidate pool of datastores that has the necessary capabilities. For example, thin-provisioned virtual disks are only available in recent versions of VMFS. In these cases, the capabilities are treated as strict constraints on actions generated by Storage DRS.
Platform specific constraints can also be specified through VMware Storage Policy Based Management (SPBM) framework, by using storage profiles. Storage profiles associate a set of capabilities generated by users, system administrators, or infrastructure providers. For example, a high performance, highly available datastores can be tagged as “Gold storage”, whereas other datastores with lower RAID protection can be tagged as “Silver storage”. Storage DRS clusters consist of datastores with identical storage profiles. That way, load balancing and initial placement operations satisfy SPBM constraints. In the future, storage clusters can be more flexible and allow for datastores with different storage profiles as we discuss later.
Storage DRS is compatible with VMware HA. HA adds a constraint that the VM’s configuration files can only be moved to a datastore visible to the host running HA master. Storage DRS checks for such conditions before it proposes moves for HA protected VMs.
4.2 User-specified Constraints
Storage DRS provides a rich rule enforcement engine that allows users to specify affinity rules that restricts VM placement. Storage DRS supports the following rules:
- Virtual Machine Anti-Affinity: If one or more VMs are part of an anti-affinity rule, Storage DRS ensures that they are placed on different datastores. This is useful to identify a set of VMs that should not fail together so that services supported by those VMs are always available. In enforcing anti-affinity rules, Storage DRS also considers physical resource sharing as reported through VASA APIs so that anti-affine VMs are not placed on datastores where storage controllers identified as sharing resources.
- Virtual Disk Anti-Affinity: If a single VM has more than one virtual disk, Storage DRS supports placing them on different datastores. Anti-affinity is useful for managing storage of I/O intensive applications such as databases. For example, log disks and data disks can be automatically placed at different datastores, enabling better performance and availability.
- Virtual Disk Affinity: Using this rule, all virtual disks of a virtual machine can be kept together on the same datastore. This is useful for majority of the small servers and user VMs as it is easier for administrators when all the files comprising a VM are kept together. Furthermore, this simplifies VM restart after a failure.
Storage DRS also supports relaxation of affinity-rule enforcement during rare maintenance events, such as when VMs are evacuated from a datastore. Since affinity-rule enforcement might constrain the allocation of available resources, it is useful to temporarily relax these constraints when available resources will be reduced intentionally for maintenance.
Storage DRS respects constraints for all of its regular operations such as initial placement and load balancing. User specified constraints could be added or removed at any given time. If there is any constraint violation after the rule set is modified, Storage DRS immediately generates actions to fix the rule violations as a priority. User-specified constraints can also be added using certain advanced configuration settings. For example, the degree of space over provisioning due to thin provisioning can be explicitly controlled by instructing Storage DRS to keep some reserve space for virtual disk growth. This is done by using a certain fraction of unallocated space for thin disks as consumed. This fraction is controlled by an advanced option.
5. Estimating Space Usage
IIn this section, we describe how Storage DRS estimates the space requirements of virtual disks for scenarios where space consumption is highly dynamic. Thin provisioned virtual disk grows over time as the new data is written for the first time, up to the provisioned capacity of the virtual disk. The rate of growth of a virtual disk is variable depending on the applications running inside the VM. In addition, frequent creation and deletion of virtual machine snapshots also contribute to the variable growth rates. This is because there is no separate placement step for snapshot creation; VM’s current datastore is used for the newly created snapshots. Finally, linked clones themselves use delta disks that contain only the different content from the base disk using which the clone is created. Since the delta disk grows over time similar to a thin provisioned disk, different datastores experience varying space usage growth rates. It is important to keep track of datastore space usage automatically and take actions with the unequal growth rates taken into account. This prevents a datastore from running out of space and causing a service disruption to the running VMs.
Storage DRS avoids out of space scenarios and extraneous moves by modeling space growth. It maintains a running average of datastore space usage over a period of time and predicts the space growth based on this running average. Using the model, Storage DRS avoids placing VMs on datastores where space will be running out faster than other datastores for a fixed time in the future. Following example illustrates space modeling. The initial setup is as described in table 5 below:
|TIME TO FULL SIZE|
Table 5: Initial Placement with Growing Disk
Each of the VMs starts with 100 MB and eventually grows to 2 GB in size. This is analogous to typical lifecycle of a VM with thin provisioned virtual disks. VMs on DS1 grow slowly, and attain a size of 2 GB in 80 hours, while VMs on DS2 grow to 2 GB in 30 hours. After 80 hours, DS1 will have used 46 GB of space, while in 30 hours; DS2 will hit 40 GB space usage. Next time Storage DRS places a VM; it chooses DS1 even though it has less free space at the time of placement. Since growth rate of VMs on DS1 is slower, over time the space usage across these datastores will be balanced.
Figure 5. Space Usage with Linked Clones
Linked clones have a different type of complexity for space usage estimation: relocation of a linked clone will consume different amounts of space depending on the availability of base disks. Consider the setup as outlined in Figure 5. Datastore A and Datastore B have both an identical base disk that are used by VM1, VM2, VM3, VM4, and VM5. VM1 currently uses 10 GB in its delta disk plus the 1 TB base disk at Datastore A. Similarly, VM4 is using 30 GB of delta disk and the 1 TB base disk at Datastore B. If VM1 is to be relocated to Datastore B, only 10 GB of the delta disk will have to be moved, since VM1 can start using the identical base disk at Datastore B. However, if VM1 is to relocate to Datastore C, a copy of a 1 TB base disk has to be made as well, and as a result, both the base disk and the delta disk have to be moved. Note that relocating VM1 to Datastore B not only results in a smaller total space being used but also will complete faster since a much smaller amount of data needs to be moved.
The base disk of linked clones is retained so long as there is at least one clone using the base disk. For example, unless both of Vm4 and Vm5 are relocated to a different datastore, the 1 TB base disk will continue to occupy space at Datastore B. Moving either of Vm4 or Vm5 alone will result in space savings of only 30 GB and 10 GB respectively, leaving the 1 TB base disk intact.
6. Best Practices
Given the complexity and diversity of storage devices available today, it is often hard to design solutions that work for a large variety of devices and configurations. Although we have tried to set the default knobs based on our experimentation with a couple of storage arrays, it is not possible to procure and test many of the devices out there. In order to help system administrators to get the best out of Storage DRS, we suggest the following practices in real deployments:
(1) Use devices with similar data management features: Storage devices have two types of properties: data management related and performance related. The data management related properties include RAID level, back up capabilities, disaster recovery capabilities, de-duplication etc. We expect all datastores in a storage DRS cluster to have similar such data management properties so that the virtual disks can be migrated among them without violating any business rules. Storage DRS handles the performance variation among devices but assumes that virtual disks are compatible with all devices based on the data management features. This is something that we plan to relax in future, by using storage profiles and placing or migrating virtual disks only on the datastores with compatible profile.
(2) Use full or similar connectivity to hosts: We suggest keeping full connectivity among datastores to hosts. Consider a datastore DS1 that is only connected to one host as an extreme case. The VM whose virtual disk is placed on DS1 can not be migrated to other hosts in case that host has high CPU and memory utilization. In order to migrate the virtual machine, we will also have to migrate the disks from that datastore. Basically, poor connectivity constraints the movement of VMs needed for CPU and memory balancing to a few set of hosts.
(3) Correlated datastores: Different datastores exposed via a single storage array may share the same set of underlying physical disks and other resources. For instance, in case of EMC ClaRiiON array, one can create RAID groups using a set of disks, with a certain RAID level and carve out multiple LUNs from a single RAID group. These LUNs are essentially sharing the same set of underlying disks for RAID and it doesn’t make sense to move a VMDK from one to another for IO load balancing. In storage DRS, we try to find such correlation and also allow storage arrays to tell us about such performance correlation using VASA APIs. In future, we are also considering exposing an API for admins to provide this information directly, if the array doesn’t support VASA API to provide correlation information.
(4) Ignoring stats during management operations: Storage DRS collects IO stats for datastores and virtual disks continuously and computes online percentiles during a day. These stats are reset once a day and a seven-day history is kept although only one-day stats are used right now. In many cases, there are nightly background tasks such as back-up and virus scanners that lead to a very different stats profile during the night as compared to actual workday. This can also happen for tasks with a different periodicity, such as a full backup on a weekend. These tasks can distort the view of load on a datastore for Storage DRS. We have provided API support to declare such time periods during which the stats should not be collected and integrated in to the daily profile. We suggest storage administrators to use this API, to avoid any spurious recommendations. Furthermore, it is a good idea to ignore recommendations after a day of the heavy management operation such as RAID rebuild or something, unless it is actually desirable to move out of the datastore that went through the rebuild process and provided high latencies, to protect against future faults.
(5) Use the affinity and anti-affinity rules sparingly: Using too many rules can constrain the overall placement and load-balancing moves. By default we keep all the VMDKs of a VM together. This is done to keep the same failure domain for all VMDKs of a VM. If this is not critical, consider changing this default, so that storage DRS can move individual disks if needed. In some cases using rules is a good idea, since only the user is aware of the actual purpose of the VMDK for an application. One can use VMDK-to-VMDK anti-affinity rules to isolate data and log disks for a database on two different datastore. This will not only improve performance by isolating a random IO stream from a sequential write stream, but also provide better fault isolation. To get high availability for a multi-tier application, different VMs running the same tier can be placed on separate datastores using VM-to-VM anti-affinity rules. We expect customers to use these rules only when needed and keeping in mind that in some cases, the cluster may look less balanced due to the limitations placed by such rules on our load balancing operation.
(6) Use multiple datastore instead of using extents: Extents allow the admin to extend a single datastore to multiple LUNs. These are typically used to avoid creating a separate datastore for management. Extent based datastores are hard to model and reason about. For instance, the performance of that datastore is a function of two separate backing LUNs. IO stats are also a combination of the backing LUNs and are hard to use in a meaningful way. Features like SIOC are also not supported on datastores with multiple extents. With Storage DRS the management problem with multiple datastores is already handled. So we suggest storage administrators to use separate datastores per LUN and use storage DRS to manage them, instead of using extents to increase the capacity of a single datastore.
(7) Storage DRS and SIOC threshold IO latencies: Users specify threshold IO latency of datastore while enabling Storage DRS on a datastore cluster. Storage DRS uses threshold latency value as a trigger. If datastore latency exceeds this threshold, VMs are moved out of such datastore(s) in order to restore latency value below threshold.
When Storage DRS is enabled on a datastore cluster, SIOC is also enabled on individual datastores. SIOC operation is controlled by its own threshold latency value. By default SIOC threshold latency is higher than that of Storage DRS. SIOC operates at much higher time frequency than Storage DRS. Without SIOC, there could be situations where IO workloads kick in for short durations and IO latency can shoot up for those time intervals. Until Storage DRS can remedy the situation, SIOC acts as a guard and keeps IO latency in check. It also ensures that other VMs on that datastore do not suffer during such intervals and get their proportional IO share. It is important to make sure that SIOC threshold latency is higher than that of Storage DRS. Otherwise, datastore latency will always appear lower than the threshold value to Storage DRS and it will not generate moves to balance IO load in the datastore cluster.
7. Discussion and Future Work
7.1 Storage DRS in the Field
VMware DRS technology influenced Storage DRS to a large extent. They both use similar concepts such as cluster, load balancing domain, recommendations, rules, and faults.
Over time, as Storage DRS deployments increased in the field, a few key differences have emerged.
Unlike DRS, it is critical for Storage DRS to choose the right storage for virtual disks. While placing a VM, users want simplicity and policy driven placement. As discussed previously, we plan to expand Storage DRS to make VM placements work seamlessly across all storage in a vSphere environment. Another key difference is the scale. Ideally, Storage DRS users want aggregation of all storage in a single pool, and let Storage DRS automatically manage VM placements, IO performance, space usage, and policy compliance over the entire pool. In this regard, we are working towards improvements in scale and performance.
Storage architectures have evolved significantly since the initial design of Storage DRS. As storage controllers become more intelligent, the existing black box performance models  are less capable of covering all salient aspects of device performance. We are exploring new interfaces for performance modeling so that storage devices can compactly report their performance capabilities that can be used by Storage DRS. In addition; we are exploring combinations of active  and passive  performance modeling approaches.
Storage DRS best practices recommend datastores with similar data management features and even similar disk types in a datastore cluster. Many customers want the capability to include different storage types in terms of capacity, performance, protocols, protection level, etc. in a single pool. These not only require improvements in automation, but also more fine grain controls when managing datastores in a cluster.
7.2 Future Directions
Storage technologies and storage system designs are evolving at a very rapid pace. With the introduction of multi-core CPUs, high-bandwidth interconnects and solid-state disks, many new storage architectures are coming to market. Some of the common new designs include: multi-tiered storage, scale-out storage and compute-storage converged architectures.
In case of multi-tiered storage, SSDs are being used as a primary storage media either as a first tier or as a caching layer. Both designs make it harder to model the datastore from outside as a black box. This is because an outside observer cannot know the hit rate of IOs in the SSD tier.
The scale-out storage paradigm is very useful for cloud service providers. They can buy a unit of storage and scale out as demand grows. In this case, the servers may see a single connection point but the IOs can get served from one of many backend storage controllers via internal routing. Some examples of this architecture include EMC Isilon , NetApp ONTAP GSX , IBM SoNAS  etc. These architectures make it hard to determine the amount of available performance left on the storage device.
The server-storage converged architectures such as Nutanix , Simplivity , HP Lefthand  etc. take the scale-out to the next level by coupling together local storage across servers to form a shared datastore. These solutions provide high-speed local access for most of the IOs and do remote IOs only when needed or for replication. In all these cases, it is quite challenging to get a sense of performance available in a datastore. We think a good way to manage these emerging storage devices is by creating a common API that brings together the virtual device placement and management with that of the internal intelligence of the storage devices. We are working towards such a common language and incorporating it as part of VASA APIs.
So far we have talked about modeling IO performance, but similar problems arise for space management as well. Thin provisioning, compression and de-duplication for primary storage are becoming common features for space efficiency in arrays. This makes it harder to estimate the amount of space that will get allocated to store a virtual disk on a datastore. Currently Storage DRS uses the actual provisioned capacity as reported by the datastore. In reality the space consumed may be different: this has the effect of space usage estimations being slightly inaccurate. We are working on modifying storage DRS to handle such cases and also planning to add additional APIs for better reporting of space allocations.
Overall, building a single storage management solution that can do policy based provisioning across all types of storage devices and perform runtime remediation when things go out of compliance is the main goal for Storage DRS.
In this paper, we presented the design and implementation of a storage management feature called Storage DRS from VMware. The goal of Storage DRS is to make several management tasks such as initial placement of virtual disks, out of space avoidance and runtime load balancing of space consumption and IO load on datastores. Storage DRS also provides a simple interface to specify business constraints using affinity and anti-affinity rules, and it enforces them while making provisioning decisions. We also highlight some of the complex array and virtual disk features that make the accounting of resources more complex and how Storage DRS handles that. Based on our initial deployment in the field, we have gotten a very positive response and feedback from customers. We consider Storage DRS as the beginning of the journey to software managed storage in virtualized datacenters and we are planning to accommodate the newer storage architectures and storage disk types in future to make Storage DRS even more widely applicable.
- Resource Management with VMware DRS, 2006, http://vmware.com/pdf/vmware_drs_wp.pdf
- M. Eisler, P. Corbett, M. Kazar and D. S. Nydick. Data Ontap GX: A scalable storage cluster. In 5th USENIX Conference on File and Storage Technologies, pages 139-152, 2007.
- EMC, Inc. EMC Isilon OneFS File System. 2012, http://simple.isilon.com/doc-viewer/1449/emc-isilon-onefs-operating-system.pdf
- A. Gulati, C. Kumar, I. Ahmad, and K. Kumar. BASIL: Automated IO Load Balancing across Storage Devices. In USENIX FAST, Feb 2010.
- A. Gulati, G. Shanmuganathan, I Ahmed, M. Uysal and C. Waltspurger. Pesto: Online Storage Performance Management in Virtualized Datacenters. In Proc. Of the 2nd ACM Symposium on Cloud Computing, Oct 2011.
- Hewlett Packard, Inc. HP LeftHand P4000 Storage. 2012, http://www.hp.com/go/storage
- IBM Systems, Inc. IBM Scale Out Network Attached Storage, 2012. http://www-03.ibm.com/systems/storage/network/sonas/index.html
- A. Mashtizadeh, E. Celebi, T. Garnkel, and M. Cai. The Design and Evolution of Live Storage Migration in VMware ESX. In Proc. USENIX Annual Technical Conference (ATC’11), June 2011 (to appear).
- D. R. Merrill. Storage Economics: Four Principles for Reducing Total Cost of Ownership. May 2009, http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost -of-ownership.pdf
- M. Nelson, B. H. Lim, and G. Hutchins. Fast Transparent Migration of Virtual Machines. In Proc. USENIX, April 2005.
- Nutanix, Inc. The SAN free datacenter. 2012, http://www.nutanix.com
- Simplivity, Inc. The Simplivity Omnicube Global Federation. 2012, http://www.simplivity.com
- N. Simpson. Building a data center cost model. Jan 2010, http://www.burtongroup.com/Research/DocumentList.aspx?cid=49
- S. Vagnani. Virtual machine le system. In ACM SIGOPS Operating Systems Review 44.4, pages 57-70, 2010.
- VMware, Inc. VMware APIs for Storage Awareness. 2010, http://www.vmware.com/technical-resources/virtualization-topics/virtual-storage/storage-apis.html
- VMware, Inc. VMware vSphere Storage IO Control. 2010, http://www.vmware.com/files/pdf/techpaper/VMW-vSphere41-SIOC.pdf
- VMware, Inc. VMware vSphere. 2011, http://www.vmware.com/products/vsphere/overview.html
- VMware, Inc. VMware vCenter Server. 2012, http://www.vmware.com/products/vcenter-server/overview.html
- VMware, Inc. VMware vSphere Fault Tolerance. 2012, http://www.vmware.com/products/datacenter-virtualization/vsphere/fault-tolerance.html
- VMware, Inc. VMware vSphere High Availability. 2012, http://www.vmware.com/solutions/datacenter/business-continuity/high-availability.html