Performance Group and Product Management, VMware, Inc.
The average virtualization administrator finds it difficult to manage hundreds of hosts and thousands of virtual machines. At the same time, almost anyone on earth with a smartphone is adept at using social media sites like Facebook for managing hundreds of friends. Social media succeeds because a) the interfaces are intuitive, b) the updates are configurable and relevant, and c) the user can choose arbitrary groupings for friends. Why not apply the same techniques to virtualized datacenter management?
In this paper, we propose combining the tenets of social media with the VMware® vSphere® management platform to provide an intuitive technique for virtualized datacenter management. An administrator joins a social network and “follows” a VMware vCenter™ server or another monitoring server to receive timely updates about the status of an infrastructure. The vCenter server runs a small social media client that allows it to “follow” hosts and their status updates. This client can use the messaging capabilities of social media (posting messages to lists, deleting messages from lists, sending replies in response to messages etc.) to apprise the administrator of useful events. Similarly, hosts contain a small client and can “follow” virtual machines and be organized into communities (clusters), and virtual machines can be organized into “communities” based on application type (for example, all virtual machines running Microsoft Exchange) or owner (for example, all virtual machines that belong to user XYZ). By creating a hierarchy from an administrator to the host to the virtual machine, and allowing each to post status updates to relevant communities, an administrator can easily stay informed about the status of a datacenter. By utilizing message capabilities, administrators can even send commands to hosts or virtual machines. Finally, by configuring the types of status that are sent and even the data source for status updates, and by using social media metaphors like hash tags and ‘likes’, an administrator can do first-level triaging of issues in a large virtualized environment.
Categories and Subject Descriptors
D.m [Miscellaneous]: virtual machines, system management, cloud computing.
Performance, Management, Design.
Virtual machine management, cloud computing, datacenter management tools.
One of the most challenging problems in virtualized deployments is keeping track of the basic health of the infrastructure. Operators would like to know quickly when problems occur and would also like to have guidance about how to solve issues when they arise. These problems are exacerbated at scale: it is already difficult to visualize problems when there are 100 hosts and 1000 virtual machines (virtual machines), but what about in setups with 1000 hosts and 10,000 virtual machines? Conventional means for monitoring these large environments focus on reducing the amount of data to manageable quantities. Reducing data is difficult, requiring two 2 distinct skill sets: first, knowledge of virtualization, in order to determine what issues are serious enough to be alerted and how to solve such issues; second, the ability to create intelligent visualizations that reduce data into manageable chunks.
Automated techniques for monitoring the health of an infrastructure  have become increasingly prevalent and helpful. Such approaches leverage the collection and analysis of a large number of metrics across an environment in order to provide a concise, simplified view of the status of the entire environment. However, despite the success of such tools, significant training is often required in order to obtain proficiency at understanding and using the output of such tools.
In this paper, we approach the problem of virtualization monitoring from a different perspective. We observe that while the average administrator might find it difficult to monitor 100 hosts and 1000 virtual machines, the same administrator might find it relatively easy to keep abreast of his hundreds of Facebook friends. The reason for this is that social networking sites allow many knobs to limit the information flow to a given user. Moreover, these knobs are extremely intuitive to use. For example, users of Google+ can organize friends into circles and limit status updates to various circles, or might choose to propagate status updates only to select listeners. The knobs are also designed with a keen understanding of the problem domain: for example, in a social network, birthdays are important events, so social networks create special notifications based solely on birthdays.
We propose organizing a virtualized environment into a social network of its own, including not just humans (system administrators), but also non-human entities (hosts, virtual machines, and vCenter servers). Each member of the community is able to contribute status updates, whether manually (in the case of humans) or programmatically (through automated scripts running on virtual machines and hosts). We organize this social network according
to our understanding of the hierarchy in a virtualized environment, and we limit the information flow so that only the most important updates reach an administrator. Moreover, the administrator is capable of performing commands within the social media client. By combining the reduction of information with the ability to perform basic virtualization management operations in response to such information, and by wrapping these features into the familiar UI of a social media application, we create an intuitive, platform-independent method for basic monitoring and management of a virtualized environment.
The outline of this paper is follows. In section 2, we explain how we map the constructs of a social network to a virtual hierarchy. In section 3, we describe a prototype design for such a monitoring scheme, leveraging one set of social networking APIs, the Socialcast Developer APIs. In section 4, we describe our initial experiences deploying such a system in a real-world environment. In section 5, we discuss related work. We provide conclusions and future directions in section 6.
2. Comparing Virtual Inventories and Social Media Networks
Figure 1 depicts a sample social network. A two-way arrow suggests a friend relationship. For example, A is friends with B and B is friends with G, but G is not friends with A. In addition, B, F, and G might choose to create a separate, private group, as indicated by the dotted rectangle in the figure. There is a distinction between physical entities, namely the members of the groups like A through G, and the logical entities, like the group consisting of B, F, and G.
Figure 1: A sample social network, illustrating friend relationships, groups, and hierarchies.
Similar to a social network, a virtualized infrastructure also consists of ‘physical’ members and logical groups. For example, consider the sample VMware vSphere inventory shown in Figure 2. As the figure indicates, vCenter server W1-PEVCLOUD-001 is composed of a datacenter named vCloud, which in turn consists of a cluster AppCloudCluster and a group of hosts. The cluster contains resource pools and virtual machines. This hierarchy can be mapped to a ‘social’ network of its own. An administrator can be a ‘friend’ of a vCenter server. A vCenter server can have hosts as friends, and hosts can have virtual machines as friends. Hierarchy is important because it is one method of limiting information flow. In a social network, a person like A might choose instead to only be friends with B, knowing that if anything interesting happens to F and G, that B will likely collect such information and share it with A. In a similar manner, the vCenter server need not choose to be friends with all virtual machines, but just with all hosts. If a host receives enough status updates from the virtual machines running on it, it might choose to signal a status change to vCenter. In a similar way, an administrator might choose to be friends only with vCenter, knowing that vCenter can accumulate status updates and propagate them to the administrator.
While hosts and VMs are entities in our virtualization social network, we currently do not add datacenters, clusters, and resource pools as entities. At present, this is because of a practical issue. Datacenters, clusters, and resource pools cannot be added as friends because they do not have a ‘physical’ manifestation. In other words, while an administrator can send and receive network packets to/from virtual machines and hosts, an administrator cannot send a message to a datacenter. Instead, datacenters, clusters, resource pools, and host/virtual machine folders are more similar to a ‘group’ in a social network. We extend the notion of a group to include not just virtualization hierarchy, but also allow arbitrary user-defined collections of entities. For example, it might be helpful to put all virtual machines that run a SpecJBB in a group labeled SpecJBB, or it might be helpful to put all virtual machines under a given resource pool in a given group.
Figure 2: A Sample VC Inventory. Hosts, virtual machines and the vCenter server itself can be mapped to members of a social network, while datacenters, clusters, and resource pools are like groups.
3. Prototype Design
In this paper, we propose that virtual machines, hosts, and vCenter servers become nodes in a social network. Each one runs agents that publish various pieces of information to the social network. An administrator can then examine the web site (or her mobile device) in order to get interesting information about the virtualized infrastructure in a more intuitive manner than a standard management interface.
In order to validate this design approach, we discuss a proof-of-concept design based on using the Socialcast Developer API . The verbs of social media messaging, as encapsulated in the Socialcast Developer API, map very well to the verbs required to build an efficient notification system for virtualized infrastructure. In the next two sections, we describe the Socialcast API and then indicate how it might be used to build a management infrastructure. Using a special on-premise Socialcast virtual appliance, we have been able to prototype and validate most of the proposals described in this section.
3.1 The Socialcast API
Before discussing our prototype design, we first give a brief description of social media messaging in Socialcast. There are several kinds of messages in Socialcast. There are community streams, in which a group of users can essentially subscribe to a given topic and see messages related to that topic. There are also private messages, which are messages directed to a particular user and not viewable by anyone else. There are comments, in which users can essentially respond to existing stream messages, and there are private message replies, which are similar to comments, but are responses to private messages. Messages and comments can be liked (in which other users express approval) or un-liked (in which a previously-posted ‘like’ is removed). Messages can be tagged with categories or filtered by content. Finally, users can be followed: if user A is followed by user B, then when user A makes comments, user B is notified of them. This allows user B to keep abreast of the events in A’s life.
Based on this description of the message types in Socialcast, we can now take a brief look at the relevant portions of the API.
- Messages API: The messages API allow users to read a single stream message or a group of stream messages, create new messages, update new messages, destroy messages, and search messages. A user can also specify the retrieval of messages since a certain date, the retrieval of messages
that fit certain criteria, etc.
- Likes API: The likes API allows a user to like a message or un-like a message.
- Comments API: The comments API allows a user to retrieve comments, create comments, update comments, or delete comments. There is also a ‘comment likes’ API where a user can like or un-like a comment.
- Flagging API: With flagging, a user tags a message that she has posted as being important to her, as a reminder to her to view it later.
- Private Messages API: The private messages API allows users to perform all of the same actions as in the standard messages API, but for private messages.
- Users API: The users API allows users to retrieve information about other users, search for users, deactivate users, and retrieve messages from a specific user.
- Follow/Unfollow API: The follow API allows users to ‘follow’ other users (i.e., see comments or notifications by the other users).
- Groups API: The groups API allows users to list groups, the members of groups, and group memberships of a given user.
- Attachments API: The attachments API allows a user to create attachments (either separately or as part of a message).
These commands can be simple HTTP GET or POST requests. Simply by installing a library like libcurl in virtual machines, we are able to have virtual machines programmatically send status and receive status. Adding this library to an ESX host further enables an ESX host to send/receive status. In essence, because of the ability to programmatically interact with messages, groups, etc., the hosts and virtual machines are able to be users in the virtualization social network in the same way that human beings are members of the virtualization social network.
3.2 Mapping the Socialcast API to virtualization monitoring
To understand how the Socialcast model fits in with virtualization monitoring, consider how virtualized environments are monitored. If important events happen, then notifications are sent to an administrator. These notifications are acknowledged and then cleared. Multiple similar issues might happen among a group of hosts or virtual machines, suggesting a common root cause. Messages can be flagged according to severity, and messages with common headers can be additional categorized.
Consider our canonical design in which an administrator follows the vCenter server. The vCenter server, in turn, follows hosts, as shown in Figure 3. The hosts follow virtual machines. Because hosts are often in clusters, it might be useful to organize a set of hosts into a group named after the parent cluster. Virtual machines might reside in folders or resource pools, so virtual machines might be placed in groups based on parent folder or resource pool. Moreover, other interesting groupings are possible. For example, perhaps every physical host that belongs in rack X can go into a group named X, or every virtual machine running Microsoft Exchange can go into a group named “Microsoft Exchange.” An administrator might also decide to join such a group of virtual machines to view notifications related to these ‘Microsoft Exchange’ virtual machines.
Figure 3: Mapping virtualization relationships to social relationships. In this case, a vCenter server is managing a total of 16 hosts. From a social media perspective, vCenter is ‘following’ those hosts.
This simple “social network” of persons, hosts, and virtual machines forms a powerful monitoring service. For example, when a virtual machine encounters an issue like a virtual hard drive running out of space, the virtual machine can do a simple http POST request to indicate its status (“ERROR: hard drive out of space”) using the messages API, as shown in Figure 4. In this case, a custom stream for Exchange virtual machines has been created, so the virtual machine sends the message to Socialcast and specifically to this custom group. If an administrator is periodically watching updates to this stream, he might notice a flurry of activity and choose to investigate the Exchange virtual machines in his infrastructure. Alternatively, a host can have an agent running that automatically reads messages to a given stream, parses them, and performs certain actions as a result. Finally, by creating a graph connecting vCenter to its end users and to administrators, it becomes easier to notify the relevant parties when an event of interest has occurred: for example, if a virtual machine is affected, then we can limit notifications to the followers of that virtual machine (presumably the users of that virtual machine).
Figure 4: A community of Exchange virtual machines. The administrator has a stream for messages from Exchange virtual machines. The Exchange virtual machines send messages to the stream when they are running out of disk space.
Blindly sending messages to a stream can result in an overload of messages to a human. To avoid such an overload, we can take advantage of a helpful feature of the Socialcast API: the ability to read a stream before publishing to it. If several virtual machines are exhibiting the same issues (for example, hard drive failures), rather than each posting to the same stream and inundating an administrator with messages, the Socialcast agent on each virtual machine can programmatically read the public stream and find out if such a message already exists. If so, the virtual machine can ‘like’ the message instead of adding a new message to the stream. In this manner, an administrator that is subscribed to this group will not be overwhelmed with messages: instead, the administrator will see a single error message with a large number of ‘like’ messages. This might suggest to the administrator that something is seriously wrong with some shared resource associated with these virtual machines, like the datastore backing the virtual machines. An illustration of this is shown in Figure 5, in which a number of hosts lose connection to an NFS server. The first server that is affected posts a message, and the others ‘like’ that message, providing an at-a-glance view of the severity of the problem.
Figure 5: Using ‘Like’ as a technique for aggregating data. Each host has the same error (in this case, failure to connect to an NFS storage device), but rather than having each one post a separate message, the first affected host posts a message, and subsequent hosts ‘like’ that message, providing an at-a-glance view of the severity of the problem.
Along the same lines, consider a host that is following each of its virtual machines. The host can use a simple loop to poll for status updates by its virtual machines. When enough such ERROR messages are detected, the host might decide to post an aggregated “ERROR: VM disk failures” to its status. The host can also query Socialcast to find out the number of likes of a given message and thereby determine how many entities are affected by that error. The vCenter server that is following this host might then choose to update its status accordingly (“ERROR: HOST X shows VM disk failures”). The administrator, who is following this vCenter server, will then see the appropriate status notification and might decide to investigate the host. Notice that by utilizing the hierarchical propagation of messages, an administrator sees a greatly reduced set of error messages. The administrator might further decide to create a special group called a ‘cluster’, and put all hosts in that cluster in a group. The administrator might choose to occasionally monitor the messages in the cluster group. By seeing all messages related to the cluster in one place, the administrator might notice patterns that would not otherwise be obvious. For example, if the cluster group shows a single host disconnect message and a number of ‘likes’ for that single ‘host disconnect’ message from the other hosts, it might be the case that a power supply to a rack containing these hosts has failed, and all hosts are subsequently disconnected. Note here that the ability to read the group messages before publishing is crucial to reduce the number of messages: Rather then sending discrete messages for each error, the ‘like’ attribute is used. Depending on the type of power supply (managed or not), the power supply itself might be able to join a given social network of hosts and virtual machines and emit status updates.
As yet another technique for reducing information, a host or vCenter might utilize flags or comments. For example, depending on the content of the messages (for example, ERROR vs. WARNING), a host that is following its virtual machines might examine a message stream, choose the messages with ERRORS, and flag them or comment on them, indicating that they are of particular importance. The host can later programmatically examine flagged/commented messages and send a single update to the vCenter server. The vCenter server, in turn, can notify the administrator with a single message.
3.2.1 Flexibility and Extensibility
As noted earlier, because nearly all of these messages rely on simple HTTP GETs and POSTs, any virtual machine or ESX host or vCenter server can utilize the entire breadth of the Socialcast API. In each case, it is simple matter of writing a shell script that does rudimentary monitoring and invokes appropriate GET and POST requests. Moreover, more complex workflows and messaging are possible. For example, a simple script in a Linux virtual machine that monitors vmstat might notice that the free memory has dipped below a predefined threshold. The virtual machine might decide to post a status update and use the attachment API to include as an attachment the output of vmstat. Alternatively, the virtual machine might decide to generate a graph and attach the graph to its status message. For Windows virtual machines running MS-SQL, perhaps an agent can periodically monitor disk activity using perfmon and send the results of perfmon as an attachment to the appropriate administrator. Moreover, the architecture can be extended to include any data source that can generate POST/GET requests, ranging from physical devices like core network switches to change management software or procurement software, providing a single pane of glass for any activity related to an individual entity.
3.2.2 Sending and receiving commands via Socialcast
Messages do not have to be limited to static read-only content. For example, perhaps an administrator can send a private message to a virtual machine that includes the body of a script. When the virtual machine reads the message, it can execute the script. Similar such commands can be sent to hosts. For example, a primitive heartbeat mechanism can also be implemented: if each virtual machine and host is configured to send a message once a day, and if a host periodically checks to see if each virtual machine has issued an update, the host can potentially detect if a virtual machine has gone offline. The host could then send itself a command to power on the virtual machine, and if no response is detected from the virtual machine, a message can be posted by the host to the administrator’s group. To prevent security issues with malicious users sending arbitrary commands to hosts and virtual machines, it is important to leverage the in-built features of a Socialcast community: namely, that only authorized community members and members of a given group (for example, an administrators group for system administrators) are allowed to send messages to other members.
3.2.3 Message archival and search
The preceding sections have demonstrated various advantages to using a social-media API for virtualization monitoring, including techniques for information reduction and simple interfaces for generating arbitrary types of status information in the form of attachments. An additional advantage of using an online community for virtualization monitoring is that these communities can be hosted in a private cloud, avoiding storage space concerns on any of the entities themselves. Moreover, the Administrator can periodically flush old messages or messages that have been acknowledged and acted upon. Socialcast stores messages in its database and allows full-text search as well as searching using database indexes. Thus, messages can easily be searched, providing a helpful audit trail.
3.3 Implementation Details
3.3.1 System Architecture
Based on the discussion above, one possible implementation involves installing agents in all ESX hosts, virtual machines, and vCenter instances, and having them directly post updates to a Socialcast server. Currently, this might be quite difficult, mainly because IT administrators are understandably risk-averse, so any changes to infrastructure applications (such as vCenter and the software running on ESX hosts) require significant levels of approval. As the approach matures and such an agent is more hardened and tested in the field, however, this is certainly a viable approach. To accommodate existing environments, we chose a slightly less disruptive approach that leverages only publicly available, secure interfaces.
Figure 6: Initial Implementation. A Service virtual machines monitors vCenter and coordinates with vCenter to monitor ESX hosts, posting status to Socialcast on their behalf. Application virtual machines do not use the Serivce VM, and instead run monitoring agents that post status directly to Socialcast.
A logical block diagram of our initial implementation is shown in Figure 6. We use a Socialcast Virtual Machine to implement the social network. We install monitoring agents in each application virtual machine. In application virtual machines running Windows, the agent monitors WMI counters or perfmon counters and application-specific log files like Apache http logs, and it sends HTTP POST requests directly to the Socialcast Virtual Machine. For Linux virtual machines, the monitoring agent examines log files or the output of tools like iostat. For vCenter and for the ESX hosts, in our initial prototype, we chose not to install agents:* instead, we use a service virtual machine to bridge between the Socialcast Virtual Machine and vCenter. The service virtual machine communicates with vCenter over the vSphere API in order to read log data, event data, or statistics. In turn, vCenter is also able to retrieve similar data from each ESX host. In this manner, the service virtual machine only needs to authenticate to one server (vCenter) in order to retrieve information for any host in the infrastructure. The Service Virtual Machine, in turn, performs simple aggregations before posting the data to Socialcast on behalf of the appropriate ESX host or vCenter server. The Service Virtual Machine can also choose to like or comment on the data instead, on behalf of the appropriate ESX host or vCenter. Socialcast is then responsible for its own aggregations (for example, showing how many hosts are affected by a problem by displaying the number of ‘likes’, as shown in Figure 5). The service virtual machine monitors ESX hosts and vCenter and posts updates on their behalf to the Socialcast VM. While we could have used a Service Virtual Machine to probe application virtual machines in addition to vCenter and ESX, we made the tradeoff to install agents in application virtual machines for two main reasons. First, we were trying to locate application-specific behaviors, and the data we needed could not be retrieved via an API. For example, we initially wanted to probe virtual machines running VMware View  to detect latency issues, and such information is kept in certain log files rather than exposed publicly. Second, many application virtual machines already have existing monitoring agents, or export standard interfaces like SNMP and VMI, so there is a precedent for an administrator to admit new agents into a virtual machine. Finally, many administrators create virtual machines from templates or catalogs, so upon deployment, so it is relatively easy to automate the process of installing an agent.
3.3.2 Coupling Virtualization Management and Social Media
To effectively couple virtualization management to social media, our system must perform three functions:
- Discover the relationships between these entities.
- Create the appropriate mapping of these virtualization entities to entities in a social network.
- Monitor each of these entities and post interesting status to Socialcast.
* “In some early prototypes, we used agents on vCenter and ESX. However, in order to allow easier deployment in the VMware Hands-on-Lab (see section 4), we opted for this approach.”
To discover the relationships between these entities, the Service Virtual Machine uses the VMware Web Services SDK to retrieve topology information about the virtualization infrastructure from the vCenter server. This topology information includes the hosts being managed by the vCenter server, the virtual machines running on each host, and the virtual datacenters and clusters. Once this topology information has been gathered, the Service Virtual Machine maps these entities to members of the social network by making calls to the Socialcast Virtual Machine to create users for the vCenter server, the virtual machines, and the hosts. The Service Virtual Machine also makes calls to the Socialcast Virtual Machine to create groups for the datacenters and clusters. To create the appropriate mapping of the relationships, we have nodes in a hierarchy ‘follow’ their descendants. For example, vCenter follows the hosts it is managing. The ESX hosts ‘follow’ the virtual machines that they are running. Virtual machines are joined to the datacenters or clusters they belong to, as are hosts. As a final step, we link users to their virtual machines, although this is currently a mostly manual step, unless the user has annotated virtual machines with user information in a structured manner amenable to auto discovery.* At this point, the graph database of the social network has a complete map of the virtualization infrastructure.
The off-the-shelf Socialcast Virtual Machine is architected primarily for human-to-human interaction and collaboration. Thus, user creation requires an administrator logging into a Socialcast instance in order to create user information and to send an email invitation, or it requires an import from another identity source like LDAP. Moreover, the joining of a group is also typically a manual operation performed by a human. As a result, the Socialcast API in its present form does not support creation of users and joining of groups. However, we have modified the API and the Socialcast Virtual Machine to support both of these operations in an automated way, allowing us to completely script the procedure of adding virtualization entities to the social network.
For monitoring these entities, we choose a variety of metrics. For vCenter itself, several factors are important. We monitor the performance metrics of the vCenter server itself like task latencies by using the vSphere API. We also gather usage metrics like CPU, disk, memory and network: these can be gathered via standard interfaces like SNMP. We also examine the log files of the vCenter server itself: these log files are available via the API, given the user has appropriate permissions.
For ESX hosts, we examine performance statistics and kernel logs that are accessible via the vSphere API. The vSphere API allows administrator users with appropriate roles and permissions to login to the vCenter server and access the kernel logs of the ESX hosts. To reduce spew on Socialcast posts, we filter the kernel messages from the hosts and only post warnings and errors. To gain more insight into high vCenter task latencies, it is sometimes helpful to examine the communication logs between vCenter and the ESX hosts: we use the vSphere API to retrieve these logs. Finally, the vCenter server also has an API for retrieving performance statistics per host.
For virtual machine monitoring, we use a two-pronged approach: we collect resource usage statistics for the virtual machine (CPU, Disk, Memory, and network) using the vSphere API. In addition, our agents collect in-guest resource statistics and also examine log files. For example, certain applications like virtual desktops emit log statements when the frame rate of the desktop is low enough to cause user-perceived latencies. Our agent examines such log files and posts relevant message to Socialcast. For resource usage, we do preliminary trending analysis to see if a problem like high memory usage has occurred and then is resolved, and we post resolution of the message to Socialcast. Because these statistics are being gathered within the guest, we gain some visibility that might not be available by the virtual machine-level metrics collected using the vSphere API.
4. Monitoring Case Study: VMworld Hands-On-Labs
To validate our design and gain real-world feedback on our approach, we installed our monitoring service at the hands-on labs at VMworld 2012.. The hands-on-lab allows VMworld attendees to experiment with various VMware products by following a scripted series of steps. There are over 20 different types of labs to showcase various VMware products, and there are nearly 500 users at a time. As a user enters the hands-on-lab area and indicates a preference for a lab topic, a provisioning portal determines whether a version of the requested lab is available. If not, the lab is provisioned. A lab consists of a number of virtual machines (between 10-17 virtual machines in most cases), and encompasses a mini virtual datacenter that the user can control.
The hands-on-lab posed unique challenges for our monitoring solution, and required modifications to our original design. The main issue is that the hands-on-lab represents a high churn environment, in which virtual machines are constantly being created and destroyed, existing for an hour on average. We use the vSphere API to track the
*An alternate approach here could be to use commercially available application discovery tools and then provide an API to link the virtual machines to applications and applications to users.
creation and deletion of virtual machines and appropriately modify the relationships within Socialcast, and this environment stresses such code severely. Moreover, because the load on the ESX hosts is highly variable, virtual machines are migrated quite frequently. Our code uses the vSphere APIs to track the motion of virtual machines and update the relationships appropriately, but the frequency of updating such relationships is much higher than might be expected from a typical social network. For example, 2000 virtual machines might be created, destroyed, or moved every hour for 8 hours. In contrast, a company like VMware, with approximately 17,000 employees, might create a dozen new Socialcast users per day.
In light of these challenges and to provide adequate performance, our ultimate deployment architecture utilizes multiple Socialcast Virtual Machines organized using Socialcast clustering. We also employ multiple Service Virtual Machines and divide them among the multiple vCenter servers in the Hands-on-Lab. We also have a Service Virtual Machine for doing preliminary monitoring of the cloud management stack (VMware vCloud Director) that is controlling the vCenter servers.
For our initial monitoring, we focused on a few key areas:
1. Resource utilization of management components. We monitored the resource usage of the management software so that we could feed the information back into our core development teams. We show an example in Figure 7, which helped us isolate a given management server that showed much higher CPU usage than others and therefore merited further investigation.
Figure 7: CPU Usage of vCenter servers across infrastructure. The chart is pushed periodically to the administrator’s Socialcast stream. In this post, one of the vCenters is showing much higher CPU utilization than others and might need further investigation.
2. Operational workload on management servers. The hands-on lab represents an extreme of a cloud-like self-service portal. Creating a user’s lab from the self-service portal ultimately results in provisioning operations on the vCenter server, including the cloning of virtual machines from templates, reconfiguring those virtual machines with the proper networking, and then destroying the virtual machines when they are no longer in use. Understanding the breakdown of operations helps developers determine which operations to optimize in order to improve infrastructure performance. We see this in Figure 8, in which we show the breakdown of tasks across all vCenter instances and notice a pattern in the workflow. Specifically, vCenter appears to perform multiple reconfigure operations per VM power operation. We can thus investigate reducing the number of reconfiguration operations as a possible orchestration optimization.
Figure 8: Operational Workload on Management Servers. The standard self-service workflow requires multiple virtual machine reconfigure operations before powering on the VM. This represents a potential optimization opportunity.
3. Alarms/Errors on virtual machines, ESX hosts and vCenter servers. We monitor kernel-level error events on ESX hosts and warning/error events on vCenter servers. We also monitor alarm conditions on virtual machines. An example of a kernel-level ESX host error event is loss of connectivity to shared storage. Alarm conditions on virtual machines include high CPU utilization and high disk utilization. For the alarms that are already built into the vCenter server (for example, high virtual machine CPU usage), we leverage vCenter’s alarm mechanism, while for others (like high disk usage within a virtual machine), we utilize agents with the virtual machines to monitor and proactively alert Socialcast appropriately.
There were several areas in which our approach had notable advantages over conventional approaches. For example, while the resource utilization of the various management components is available via the API, it can be complex to retrieve this data across multiple installations. By collecting such data and putting it in a single pane of glass, we provided an at-a-glance view of the health of the infrastructure.
By dissecting the operational workload of the datacenter, we were able to determine some possible areas of optimization for our management stack. For example, certain operations require reconfiguration of virtual machines multiple times before the virtual machine is ultimately deployed. By tracking operational metrics and posting them for a group of management components, we were able to see the severity of this problem immediately.
By maintaining relationships between virtual machines and their hosts in an intuitive way, we were able to reason about certain operations more easily. For example, we had deployed our monitoring virtual machines across the infrastructure. At some point, we wanted to move some virtual machines from one host to another. We had created a special group consisting of just our monitoring virtual machines, and by listing the members of this group, we could easily find all of our monitoring virtual machines, and then by ‘mousing over’ those virtual machines, we could find the hosts on which they resided. Note that this is possible in a standard virtualized infrastructure, but the social network metaphor provides a natural way to perform such a search.
A final example illustrates an unintentional synergy between social media metaphors and virtualization management. One of the administrators wanted to change a cluster to allow automated virtual machine migrations and wanted to see how many migrations would result. Because we had tagged each migration with a hashtag (#Success_drm_executevmotionlro in this case), we were instantly able to search for the number of instances of that hash tag within a given cluster and find out how many migrations resulted from changing the setting in two different clusters. This case is shown in Figure 9.
Figure 9: Synergy between social media and virtualization management with hashtags. By tagging successful tasks with a hashtag (#Success_drm_executevmotionlro), we were able to instantly determine the number of such tasks performed by 2 different vCenter servers (las-cg39, 909 times, and las-cg41, 534 times), without adding any customized aggregation code.
5. Related Work
Many corporations have used the Socialcast developer API to create real-time communication and collaboration tool. One example is an integration of Socialcast and Microsoft Sharepoint, in which a Socialcast community can be embedded into a SharePoint site. Communication within the site can be viewed outside of SharePoint, and newcomers to the group can view the archives of previous conversations. In this paper, we have extended the notion of a community to include not just humans but also entities like virtual machines and hosts. Virtual machines and hosts can use automated monitoring in order to ‘communicate’ with each other and with humans. We have also used the metaphors of social media to assist in virtual management. While adding non-humans to social networks is not a new idea , tying together these social media metaphors to virtualization management is novel.
As indicated in section 2, in some sense, virtualization management is already a form of an online community. Virtual machines might generate alarms that can be viewed by vCenter, and vCenter can email administrators in turn with the alarm information. What is missing is an easy-to-use aggregation system based on arbitrary tags. VMware vSphere already contains tagging capabilities, so an administrator can aggregate virtual machines and perhaps utilize analytics tools like vCenter Operations Manager™  for generating alerts based on arbitrary aggregations.
Compared to standard virtualization management tools, our monitoring is more flexible: the monitoring can easily be customized for a particular type of virtual machine. Rather than trying to define a custom alarm for each type of application, a virtual machine owner can simply collect various application-level metrics within the virtual machine and then update status as required: an Exchange virtual machine owner might select messages processed per second; the owner of a virtual machine that does compilation jobs might choose to collect the build times and trigger a status change if the build times suddenly get worse; the owner of a virtual machine that is acting as an NFS datastore might trigger a notice if disk space within the virtual machine gets low. In each case, a simple shell-based script and GET/POST requests are all that are required to provide powerful notification capabilities.
6. Conclusions and Future Work
In this paper, we propose a social-media approach to monitoring virtualized environments. We draw an analogy between a social network and the network created by virtual machines, hosts, and vCenter servers. In a social network, users can follow each other’s updates, send each other messages, and create closed groups within the social network for selective communication; in a similar way, we propose taking a virtualized environment and creating a ‘community’ that includes the various entities of a virtualized environment along with the system administrator. Each entity (whether human or not) is capable of using a simple API to communicate status updates. By judicious creation of hierarchies (for example, having administrators ‘follow’ vCenter servers, vCenter servers follow hosts, and hosts follow virtual machines) and by using information reduction techniques (for example, ‘liking’ various types of messages instead of posting the same messages repeatedly), we can avoid excess information flow to the administrator, while still allowing the administrator to view important status updates in the environment. Moreover, the simple API also allows administrators the ability to send commands to any entity, providing a technique for platform-independent remote management via any mobile device.
The approach described in this paper might be quite disruptive for an existing environment. For the user that is reluctant to turn an entire inventory into a social network, there are intermediate steps to validate the approach. The first step is to simply incorporate the data streams from vCenter into a feed that is posted to a Socialcast site. This is the traditional use of a collaboration tool like Socialcast: administrators use a central repository for data storage, data sharing, and communication, and incorporate external data sources. The next step might be to add hosts but not virtual machines to the social network. The final stage would be to fully embrace the social networking model by adding all simple Socialcast clients to each virtual machine and host and allow virtual machines and hosts to follow and be followed.
There are many avenues for future work. First and foremost, we have learned that an integrated view of the relationships in a virtualization hierarchy is important, so we intend to continue adding more and more entities to the social network. For example, we can add virtual and physical networks, network switches, and intelligent storage devices. We can also add more application awareness and associate virtual machines with each other based on whether they communicate with each other. We can also refine our algorithms for associating users with virtual machines and applications to make that process more automated. As we increase the number of entities and possible churn, we must continue to tune the performance of our service virtual machines and the Socialcast VM.
Another promising avenue for future work is integration with other data sources like vCenter Operations Manager or Zenoss. Assuming modifications to the Operations Manager server to provide notifications on anomalies, the administrator can follow the Operations Manager and be apprised of anomalies or alarms. We can also envision even richer use of the data from this social network of virtualized entities. For example, the various community streams can be uploaded to off-line analytics engines to provide interesting statistics on a given environment, like the most commonly misbehaving virtual machines or hosts or the most common virtual machine error messages, or even the most common scripts that are run on a host. Socialcast itself has some basic analytics which are extremely valuable: we can perhaps imagine extending the architecture to allow plugins that provide virtualization-specific analytics.
While the use cases presented in this paper have focused on monitoring, we can also envision allowing simple commands to be sent via this interface. One simple example, as mentioned earlier, might be allowing a user to send private messages to an individual virtual machine to reboot or provide diagnostic information, although this would require strict security controls (limiting access to administrators or virtual machine owners, for example), or possibly enriching the interface to allow right-click actions on a given entity. This might require leveraging existing access control systems and coupling them to Socialcast’s mechanisms for creating users and assigning permissions. Finally, we can also potentially embed the administrator’s social media web page directly into the vSphere web-based client, providing a complete one-stop shop for virtualization management and monitoring, combining standard paradigms with an intuitive yet novel social media twist.
We thank Rajat Goel for providing an on-premise Socialcast virtual appliance that we could deploy for testing purposes. We thank Sean Cashin for numerous hints for using the Socialcast API effectively. We thank Steve Herrod and the office of the CTO for helpful comments on our work. Finally, we thank Conrad Albrecht-Buehler, for extremely helpful comments on our paper.
- Facebook, www.facebook.com
- Haxx. ‘libcurl: the multiprotocol file transfer library,’ http://curl.haxx.se/libcurl/
- Holland, S.W. Social Networking with Autonomous Agents. United States Patent US 2012/0066301, http://www.google.com/patents/US20120066301
- Linux iostat utility, http://www.unix.com/apropos-man/all/0/iostat/
- Linux vmstat utility, http://nixdoc.net/man-pages/Linux/man8/vmstat.8.html
- Microsoft. Perfmon, http://technet.microsoft.com/en-us/library-bb490957.aspx
- Socialcast. Making Sharepoint Social: Integrating Socialcast and SharePoint Using Reach and API, http://blog.socialcast.com/making-sharepoint-social-integrating-socialcast-and-sharepoint-using-reach-and-api
- Socialcast. Socialcast Developer API, http://www.socialcast.com/resources/api.html
- SPEC. SpecJBB2005, http://www.spec.org/jbb2005/
- Twitter. “What are hashtags?” https://support.twitter.com/articles/49309-what-are-hashtags-symbols#
- VMware. VMware API Reference Documentation, https://www.vmware.com/support/pubs/sdk_pubs.html
- VMware. VMware vCenter Application Discovery Manager, http://www.vmware.com/products/application-discovery-manager/overview.html
- VMware. VMware vCenter Operations Manager, http://www.vmware.com/products/datacenter-virtualization/vcenter-operations-management/overview.html
- VMware. VMware vCloud Director, http://www.vmware.com/products/vcloud-director/overview.html
- VMware. VMware vSphere, http://www.vmware.com/products/datacenter virtualization/vsphere/overview.html
- VMware. VMware View, http://www.vmware.com/products/view/overview.html
- VMware. VMware vStorage APIs for Array Integration, http://communities.vmware.com/docs/DOC-14090
- VMware. VMworld 2012, http://www.vmworld.com/community/conference/us
- VMware. VMworld Europe 2012, http://www.vmworld.com/community/conference/europe/
- Zenoss. Zenoss Virtualization Monitoring, http://www.zenoss.com/solution/virtualization-monitoring
- Zimman, A., Roberts, C., and Van Der Welt, Mornay. VMworld 2011 Hands-on Labs: Implementation and Workflow. VMware Technical Journal, Vol 1. No. 1. April 2012. pp. 70-80.