Scaling of Cloud Applications Using Machine Learning

Pradeep Padala
Distributed Resource
Management Team
VMware Inc.

Aashish Parikh
Distributed Resource
Management Team
VMware Inc.

Anne Holler
Distributed Resource
Management Team
VMware Inc.

Madhuri Yechuri
Distributed Resource
Management Team
VMware Inc.

Lei Lu
Distributed Resource
Management Team
VMware Inc.

Xiaoyun Zhu
Distributed Resource
Management Team
VMware Inc.


Today’s Internet applications are required to be highly scalable and available in the face of rapidly changing, unpredictable workloads. Multi-tier architecture is commonly used to build Internet applications, with different tiers providing load balancing, application logic, and persistence. The advent of cloud computing has given rise to rapid horizontal scaling of applications hosted in virtual machines (VMs) in each of the tiers. Currently, this scaling is done by monitoring system-level metrics (e.g., CPU utilization) and determining whether to scale out or in based on a threshold. These threshold-based algorithms, however, do not capture the complex interaction among multiple tiers, and determining the right set of thresholds for multiple resources to achieve a particular service level objective (SLO) is difficult.

In this paper, we present vScale, a horizontal scaling system that can automatically scale the number of VMs in a tier to meet end-to-end application SLOs. vScale uses reinforcement learning (RL) to learn the behavior of the multi-tier application while automatically adapting to changes. We provide a RL formulation of the autoscaling problem and design a solution based on Q-learning. Our learning algorithm is also augmented with heuristics to improve the responsiveness and guide the learning algorithm. A vScale prototype is implemented in Java and is evaluated on a VMware vSphere® test bed. We tested vScale by replaying traces from the 1998 FIFA World Cup (World Cup ’98) to simulate production workloads. Our experiments indicate that vScale learns quickly, adapts to changing workloads, and outperforms the RightScale auto-scaling algorithm.

1. Introduction

VMware customers have complex applications with multiple tiers, and meeting end-to-end SLOs is of paramount importance. These applications are required to be highly scalable and available in the face of dynamic workloads. To support the requirements of these applications, our customers often use a multi-tier architecture in which individual tiers can be scaled independently. Although vSphere and VMware cloud products allow these tiers to be hosted in VMs, which can be scaled, automation is missing. Currently, cloud providers such as Amazon EC2 offer auto-scaling services for a group of VMs, which are scaled based on specific conditions (e.g., average CPU usage > 70%) set by the user. However, auto-scaling based on individual VM resource usage is insufficient for scaling multi-tier and stateful applications.

Challenges include

  • Choosing the right thresholds – Choosing thresholds for multiple resource metrics used in mechanisms such as EC2 CloudWatch [1] is not easy. Often, cloud administrators set thresholds based on ad-hoc measurements and past experience, wasting resources.
  • Meeting end-to-end application SLOs – Internet applications have strict SLOs that must be met for a good user experience, which might directly affect company revenues. Converting end-to-end application SLOs to the right number of VMs in each tier is not trivial.
  • Interaction among multiple tiers – Applications that involve multiple tiers often have complicated dependencies that cannot be easily captured. Setting thresholds or scaling parameters for a specific tier independently can be counterproductive [12]. It is also possible for the bottlenecks to oscillate among the tiers due to incorrect scaling.
  • Efficient usage of resources – Finding the right number of resources to achieve a particular SLO is essential because of the costs involved in running VMs in a cloud. Another challenge is optimizing resource usage across multiple resources.
  • Stateful applications – Applications with persistent state often exhibit data access spikes due to the increase in the intensity of requests hitting the persistence layer and changes in the object popularity [8]. Scaling persistence layers is challenging because of the latency involved in data copying and distribution.

RightScale provides a mechanism [4] for automated scaling by allowing each node to vote for scaling up or down based on thresholds set on system metrics. The votes are aggregated and another threshold is used to determine the final scaling action. Setting the right thresholds still remains a problem in this approach. Academic solutions [12] have been proposed to address automated scaling, but we found that the assumptions made in previous work (e.g., good predictions about workloads, deep application instrumentation, and full view of the system) do not hold true in real systems.

To address these challenges, we propose a scaling system called vScale, built on top of vSphere, that automatically scales out and in (horizontal scaling) VMs in individual tiers to meet end-to-end application goals. We formulate the auto-scaling problem as an RL problem and solve the problem using the Q-learning technique [13]. A vScale prototype is implemented in Java and is evaluated using a vSphere test bed comparing it to the RightScale algorithm. We replayed World Cup ’98 traces to create different scenarios including workload variation, shift in bottlenecks, and cyclical workloads.

2. System Architecture

Figure 1. Typical Multi-Tier Application Architecture

Figure 1. Typical Multi-Tier Application Architecture

Today’s Internet applications are hosted in a multi-tier architecture, as shown in Figure 1. The architecture includes a DNS server, load balancers, application servers, and database servers. Typically, clients connect to a URL, which is resolved to multiple load balancers in a round-robin fashion. The load balancers (also known as front-end or Web servers) serve static pages and send requests to the application tier. The application tier contains the business logic for the application. Persistence for the application is provided by the database tier. The database tier consists of a database proxy that routes requests to multiple database nodes.

We assume that each of the tiers can be independently scaled. When a new VM is added to or removed from a tier, the previous tier is updated to reflect the state of the system. The vScale system monitors and manages multi-tier applications that are represented by multiple distinct VMs. In Figure 2, we show the architectural block diagram of the overall system. The client program interacts with the vScale system through a well-defined vScale API. This API can be used by products such as AppDirector and Cloud Foundry to register the topology and the applications’ VMs with vScale. The API is also used to define the SLO and, optionally, cost budgets (or resource limits) to cap the amount of auto-scaling. The SLOs are specified as a goal on the expected performance (e.g., 99 percentile latency < 500ms).

Each VM has a VMware vCenter™ Hyperic® [5] agent installed, and these agents are used to monitor the applications running in the VM. Hyperic agents can collect system-level statistics (e.g., CPU, memory utilization) and application-level metrics (e.g., latency, throughput). The collected statistics are aggregated in Hyperic server, which vScale periodically polls to gather the data. Based on the measured values of the performance metrics and the SLO, vScale uses an RL algorithm to compute the recommendations. Each recommendation is a tuple <tier; nvms; up|down>, where tier is the tier to be scaled, nvms is the number of VMs to scale, and up|down specifies starting more VMs or stopping existing VMs. The recommendations are converted into provisioning operations that can be executed by either vCenter instance.

3. Design

We present the design of the vScale algorithm in this section. 3.1. Brief Primer on Reinforcement Learning An RL formulation contains an intelligent agent that automatically learns from an environment. The agent interacts with the environment by applying an action and learning from the reward (positive or negative) awarded by the environment. At each time interval t, the environment provides the state st to the agent. The agent applies an action at and receives a reward rt+1, and the environment moves to state s t+1. The agent chooses the action based on a “policy.” The objective of the agent is to learn the optimal policy to achieve maximum reward in the long run. For more-detailed background on RL, see [11].

3.2 Auto-Scaling Problem
The primary objective in auto-scaling is to achieve the SLO by automatically scaling VMs in and out while utilizing the fewest resources. For the multi-tier application, we define ut as the resource configuration at time t, which is a vector of total number of VMs (nvmt) and resource utilizations (vt). vt is a vector of CPU, memory, storage, and network resource utilization. For each resource, utilization is calculated as a ratio of total consumed and total configured size for all VMs.

We also define a limit nvmlimit, which determines the maximum total number of VMs the multi-tier application is allowed to consume. The application performance is represented by yt, which specifies the end-to-end performance of the multi-tier application. yt is a vector of the individual tier performance (ytiert ) (e.g., MySQL tier latency). yref is the SLO for the application. We represent the state in our RL problem as a combination of the current resource configuration and the application performance st = (ut; yt). We do not include the input workload in our formulation, because it cannot be directly observed. However, note that the workload is indirectly represented by the application performance (yt).The actions in our RL problem are scaling VMs either in or out in a particular tier represented by at = (tier; howmany; up|down), where howmany specifies the number of VMs to be scaled. We define the total expected return Rt as a function of the individual rewards at each future time interval, with a discounting factor of α. Intuitively, the discount factor β allows rewards from previous intervals to be counted toward the overall return, because our goal is to maximize overall return rather than immediate reward.


The reward gained by taking an action is a function of the SLO, application performance, and resource usage. If the application meets the SLO (e.g., latency < 200ms, throughput > 1000 reqs/sec), the environment awards a positive reward. However, we don’t want the application to consume too many resources (e.g., latencies far below the SLO). To penalize excessive usage of resources, the environment provides a negative reward (or penalty) for the application exceeding the SLO by a wide margin. If the application does not meet the SLO, a negative reward is provided to discourage the action taken. We use two concave functions to calculate the rewards, as explained in Algorithm 1. We compute the score by combining the application performance and resource configuration ut. Resource configuration contains the number of VMs and resource utilizations. We use the maximum constrained item in ut in computing the score.

Algorithm 1. Reward computation

Algorithm 1. Reward computation

3.3 Solution
We solve the auto-scaling problem by using Q-learning. First, we define the value of taking an action a in state s under a policy as Q(s, a), which denotes the expected return from taking action a from state s.


We learn the optimal value of the action-value function Q by using Q-learning. Algorithm 2 describes our core algorithm. We first initialize the Q value for the starting state to zero. For each interval, we iteratively update the Q values as per Q-learning technique [13]. The action is chosen from a policy (described below). Α is the learning factor, which can be changed to favor exploration or exploitation. Larger values of α favor exploitation, by simply using the Q value that has been learned so far. Smaller values update Q values with what has been learned by applying the state. Β is a discount factor for the reward to discount the rewards seen in previous intervals.

Algorithm 2. Q-learning algorithm for the auto-scaling problem

Algorithm 2. Q-learning algorithm for the auto-scaling problem

3.4 Challenges in Applying Q-Learning
Before we talk about the policy for choosing the action in Q-learning, there are practical difficulties to consider in applying the Q-learning algorithm. In typical Q-learning algorithms, “ε-greedy policy is used to determine the next action to be taken. In “ε-greedy policy, action with the best Q value is chosen with 1 – ε probability, and a random action is chosen with ε probability. The ε values can be increased or decreased to give preference to exploration vs. exploitation. However, the auto-scaling problem has a large state space, making it difficult to find an optimal action. Another challenge is the time taken for provisioning a new VM, which can be on the order of minutes. As a result, if we apply Algorithm 2, it will take a long time to converge and we will not be able to adapt to changes quickly.

To avoid these problems, we define a few heuristics to bootstrap Q-learning and speed up learning. We specifically list the heuristics for SLOs specified as latency < SLO, but similar heuristics are applicable to throughput as well:

  • No-change policy – Stay unchanged when latency for all requests below γ * SLO or above (1- γ) SLO. This is a policy to make sure that we do not aggressively scale up or down for slight variations from SLO. Γ can be changed to control the aggressiveness with which to stay closer to SLO.
  •  Scale-up policy – Scale up when average latency for one percentile of requests is above γ * SLO. Choose the tier with
    • ––Max increasing errors within the history window W (e.g., last 32 actions)
    • ––Greatest latency increase during the history window W
    • ––Highest per-tier max latency
  • Scale-down policy – Scale down when average latency for 99 percentile of requests is below (1 – γ) * SLO. Choose the tier with
    • ––Greatest latency decrease during the history window W
    • ––Lowest per-tier max latency

Finally, we define our policy as a combination of learning and heuristics as follows:

ε = (# actions explored so far)/(# of possible actions from s);
With probability ε: use heuristics;
With probability 1 - ε: use Q table to find action a;

When we start the vScale system or when the system behavior changes (due to workload changes), heuristics dominate the vScale operation. As vScale learns from the actions taken due to heuristics, the number of explored actions increases and learning becomes more predominant.

4. Experimental Test Bed

Our test bed consists of multiple enterprise-class servers, each connected to multiple datastores backed by an enterprise-class SAN. Each node is an HP ProLiant BL465c G7 with two 12-core 2.099GHz processors, 128GB RAM, and two Gigabit Ethernet cards. The SAN has 10 non-SSD datastores of 1TB each. The nodes are running VMware ESXi™ 5.1.0 managed by vCenter 5.1.0, and we used Fedora 14 Linux images as guest VMs. The guest VM image contained all the necessary applications for running multi-tier applications. We used named as our DNS server, nginx as the Web server and reverse proxy, JBoss as the application tier, and Percona MySQL Server as the database tier. The guest image also contained Hyperic agents to collect application-level monitoring information. We used RUBiS [6], an online auction site benchmark, as our primary benchmark application. A trace is used to replay the behavior of a production system. RUBiS allows a specified number of client threads to be specified to simulate multiple concurrent connections. We modified this to allow different numbers of client threads to allow replaying of the traces and follow the workload pattern observed in the traces.

5. Evaluation Results

In this section, we present experimental results validating vScale behavior. We designed experiments to specifically test the following aspects of vScale:

  • Automatically detect the bottlenecked tier and scale the VMs to meet the SLO.
  • Quickly achieve the SLO.
  • Use minimum resources to achieve the SLO.

5.1 World Cup Trace
In this experiment, we used RUBiS as the multi-tier application and the World Cup ’98 trace 4. A detailed characterization of the trace is available at [7]. The World Cup trace contains 92 days, and we mapped each day to 5 minutes. To match the workload pattern, we mapped the number of daily access requests to the number of concurrent client threads. The length of the resulting workload is approximately 18000s with workload intensity varying over time. We ran vScale with the “latency of 99 percentile requests < 5 secs” SLO. Figure 3(a) shows the performance of the application over a period of time. The primary y-axis shows the total number of requests and requests below 200ms. The secondary y-axis on the right side shows the number of requests above the SLO. The number of requests below 5 seconds is 99.94%, thereby achieving the SLO. Figure 3(b) shows the scaling of VMs in each tier. The scaling matches the spikes in the workload, and scaling of different tiers is performed according to the intensity of the workload at different tiers.

Figure 3. Performance of vScale Under Single Workload

Figure 3. Performance of vScale Under Single Workload

5.2 Learning with Repetitive Workloads

To evaluate vScale’s learning, we created a repetitive workload pattern that contains four repetitions of the World Cup trace. Figure 4(a) shows the performance of the application. We can see that by learning through reward/punishment of historical autoscaling actions, vScale gradually evolves its policy throughout the four repetitive workloads. The benefit of such evolution is twofold: First, by learning from history, vScale avoids potentially inappropriate decisions so that the right number of resources are allocated to the applications quickly; second, by avoiding unnecessary scaling actions, vScale saves resources. To further demonstrate the benefit of learning, we reran the experiment after disabling the heuristics. Figure 4(b) shows the results. vScale can still generate satisfying scaling policy without heuristics, albeit slowly.

Figure 4. vScale Performance Under Repetitive Workloads

Figure 4. vScale Performance Under Repetitive Workloads

We compared vScale’s performance with the RightScale algorithm [4]. We set the thresholds for each tier based on the observed resource utilization, when latency is above the SLO. Results are shown in Figure 4(c). As can be seen from the figure, the RightScale algorithm has more bad requests than vScale with and without heuristics. The algorithm also does not learn from the workload pattern and produces the same scaling behavior for all repetitions of the workload. Table 1 shows the comparison in more detail. The SLO time specifies the total number of minutes the application meets the SLO for all requests in that minute. vScale performs 77% better than RightScale in this regard. We also measured the resource usage (# of VMs * # of minutes, similar to Amazon EC2’s cost policy) and see that vScale uses 20% more resources, but for a much bigger improvement of 77% in SLO time. Because RUBiS is a closed-loop workload, vScale achieves much higher throughput compared to RightScale, with 34% more requests served.


6. Conclusion

Solving the problem of automated scaling of multi-tier applications is important for achieving efficient resource usage and low costs while meeting SLOs. Current state-of-the-art approaches use threshold-based mechanisms, whereby scaling is performed based on a threshold set for the system metrics. These approaches are insufficient for achieving end-to-end SLOs, due to the difficulty involved in setting the thresholds correctly. In this paper, we formulated the auto-scaling problem as an RL problem and designed a solution based on Q-learning. A Java prototype is implemented and is evaluated using a local vSphere test bed. Our experiments with replaying the World Cup ’98 trace to reproduce various dynamic workloads indicate that vScale outperforms the RightScale algorithm.

7. Acknowledgments

We would like to acknowledge Jeff Jiang, Venkatanathan Varadarajan, Dimitris Skourtis, and Abhishek Samanta, who worked on this project during their internships at VMware.


1. Amazon CloudWatch.
2. Decision Tree Learning.
3. Galera Cluster for MySQL.
4. Set up autoscaling using voting tags.
5. VMware vCenter Hyperic.
6. C. Amza, A. Ch, A. L. Cox, S. Elnikety, R. Gil, K. Rajamani, E. Cecchet, and J. Marguerite. Specification and implementation of dynamic Web site benchmarks. In Proc. of IEEE Workshop on Workload Characterization, Oct. 2002.
7. M. Arlitt and T. Jin. Workload characterization of the 1998 world cup web site. Technical Report. HPL-1999-35R1, Hewlett Packard Laboratories, Feb. 1. 1999. 8. P. Bodik, A. Fox, M. J. Franklin, M. I. Jordan, and D. A. Patterson. Characterizing, modeling, and generating workload spikes for stateful services. In SoCC, pp. 241–252, 2010.
9. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
10. R. S. Sutton and A. G. Barto. Reinforcement Learning:
An Introduction. The MIT Press, 1998.
11. B. Urgaonkar, G. Pacici, P. J. Shenoy, M. Spreitzer, and A. N. Tantawi. An analytical model for multi-tier internet services and its applications. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems, pp. 291–302, June 2005.
12. C. Watkins. Learning from delayed rewards. PhD thesis, Cambridge University, Psychology Dept., 1989.