The Dynamic Data Tracker: A Software Package to Support Data Tracing

Mang Kwan Ma
VMware, Ecosystem
and Solution Engineering
kwanma@vmware.com

Naveen Ravenna
VMware, Ecosystem
and Solution Engineering
revannan@vmware.com

Chitrank Seshadri
VMware, Ecosystem
and Solution Engineering
chitranks@vmware.com

Abstract

When debugging kernel problems that are tied to a particular type of data, developers often instrument code to trace the execution paths on which the data of interest has travelled and also collect the change history of this data as it moves along the execution paths. However, manual code instrumentation has some drawbacks. It is tedious to do, and it is also easy to miss some trace points when the code area to cover is huge. If instrumentation code is left in the source file, it can clutter the source code and obscure the main logic. If it is removed after debugging, the developer might have to go to the trouble of putting the instrumentation back in again if the need arises in the future. In this paper, we detail the Dynamic Data Tracker, which is a package of software that facilitates the instrumentation of software using VProbes for tracing code paths and for collecting the change history of data of interest on the code path. The Dynamic Data Tracker is a work in progress.

1. Introduction

It is quite common to do data tracing when investigating software issues. One good example is that when we have data communication issues, we often resort to capturing the network traffic. By examining the data packets, we gain some understanding of the state of the communication between two network endpoints. Likewise, when we have storage issues, we trace the SCSI bus or the Fibre Channel fabric to see what is going on between the host controller and the storage targets. These types of data snooping are helpful and easy to do because the data is passing through a single physical transport medium. But what if the data path is a software path instead of a hardware path? How do we snoop the data that travels in software? By software path we mean the code path that a particular type of data can traverse between multiple layers or components of software. By data we mean the software objects that encapsulate the payload on a software path. The objects can be simple primitive data such as an integer, or a large object such as a networking packet. One example that easily comes to mind is the code path of an I/O request from the initiator down to the device driver. Another example is tracking the movement of a particular world that is being migrated constantly among different CPUs due to scheduling. In this example, the code path of interest is inside the scheduler, and the data of interest is the descriptor of the world that is being tracked. One can imagine that there are many code paths that can be of interest, for whatever reasons, for troubleshooting some issues. But we do not have any data analyzer that we can easily plug into software paths, as we have for hardware paths. Besides, often we do not even know where the software paths are, not to mention the inability to trace the data that gets passed along a software path. This is often a challenge that engineers from VMware partners face when they work with VMware software, and vice versa.

VMware partners often write low-level system code to integrate their products with VMware® ESXi™. As in any engineering work, issues always come up during development. Partners are dealing with VMware software that is totally unfamiliar to them. It is not easy to analyze the root cause without some understanding the dynamics inside VMware software. Likewise, VMware engineers also find it difficult to assist partners with troubleshooting issues if they do not know what is happening in the partner software. Often, both parties work independently due to time-zone differences. Having a tool that can produce some kind of execution log can be helpful. Partners can send VMware the log related to the execution in VMware software, and VMware can send the log related to the execution of the partners’ software. Together, both sides can collaborate in piecing together a picture of the dynamics happening in the software when the issue shows up. One traditional way to capture data within software is to use a debugger. But a debugger is only useful for finding the root cause of a problem when we are able to narrow down where the problem might be to small area of software. During initial root-cause analysis, the area in software where a problem might originate can be huge. Using a debugger to do initial debug information gathering can be time-consuming; besides, the amount of debugging data, initially, can be huge as well. A debugger is not the ideal tool in such an early phase of root-cause analysis. We want a lightweight process for gathering debugging information so that the cost of repeating the process is low.

Another traditional way to capture debugging data within software is by editing the software manually to add instrumentation code. This method can be effective but it is tedious and requires some degree of knowledge of the areas of code involved in order to capture useful data. If the area of code involved is huge, it takes time to instrument, and the developer would likely miss instrumenting some area that is not obvious. Another undesirable result is that if we leave too much instrumentation in our code, the instrumentation code will clutter the source and obscure the original logic of the code that we set out to instrument. But if we take out the instrumentation code, we might end up putting it back in again when the need arises in the future.

The idea of a dynamic data tracker is that it serves as a software tool analogous to a hardware data analyzer, a tool that can capture data on any code paths. Such a tool should not leave instrumentation code behind in the original code, and it should take little effort to enable data capturing on any code path. After the data is captured, the users can analyze the data in any way they want.

2. How Dynamic Data Tracker Works

Before capturing any data, we first need to identify all the access points of the data of interest. By access points we mean the code sites where the data is being read or modified. By data we mean any data; it can range from a primitive type of one byte in size to a multipage data object. For now, we limit our interest to data consisting of objects being allocated from a heap. To detect a data object being accessed, we implemented in ESXi a mechanism similar to the data watchpoint feature that is normally found on a CPU. We need to reimplement a data watchpoint in software because on x86_64 architecture, processors have only four debug registers that can be used to support data watchpoints. This is a severe limitation when large numbers of objects need to be tracked at all times. Besides the constraint of allowing only four data watchpoints, the size of the object to be watched is also constrained to at most 8 bytes. This is too small for the average size of objects that are found in VMkernel. Reimplementing watchpoints in software is necessary to remove these two constraints.

When a data object is being tracked using the software data watchpoint, access to any part of the object produces a trap. The trap handler saves the location of the access point, the code path, and the type of access in an access record. Each access record represents a unique access point. Access records are written to a journal file for processing at a later time. Watchpoints are dispensed to objects using the VProbes facility on ESXi. [1] Before objects can be tracked, we first prepare an Emmett script that defines a list of dynamic probes. The dynamic probes are installed in the software at each construction and destruction site of the objects of interest. When the object of interested is created, the dynamic probe tags it with a software watchpoint; and when the object is destroyed, the dynamic probe untags the object. When the Emmett script that tags watchpoints on objects is ready, we can load the Emmett script and run the test that reproduces the issues that we are trying to analyze. As the test runs, the objects of interest are created and tagged with software watchpoints, and when the objects are accessed, the watchpoint trap handler records the access location of the objects. At the end of the test, we get a journal that contains all the access records.

The journal file is then passed through a standalone tool to generate yet another Emmett script. The script contains dynamic probes that will be installed at each access point recorded in the journal. After the script is generated, we can manually edit the script to add whatever logic we want to produce the desired debugging information to help troubleshoot the issue at hand. When we finish editing the script, we can load it on ESXi and rerun the test. With this script, we get debugging information at the point of access of the objects. We can reuse the script again and again as we adjust the test to help narrow down the root cause of the issue.

The flow chart in Figure 1 depicts the use-case scenarios for the Dynamic Data Tracker.

Figure 1. Use-Case Scenarios for the Dynamic Data Tracker.

Figure 1. Use-Case Scenarios for the Dynamic Data Tracker.

3. Design of The Dynamic Data Tracker

The Dynamic Data Tracker is a package that contains the Software Watchpoint Dispenser, the Datavisor, and the Data Tracker Fabricator.

3.1 Software Watchpoint Dispenser
The Software Watchpoint Dispenser is implemented in an Emmett script. It is a template for the user to use as a base for creating the watchpoint-tagging Emmett script. The Software Watchpoint Dispenser implements the basic logic for the dynamic probes that are inserted at the construction and destruction sites of the data objects of interest.
At the exit of an object constructor, the Software Watchpoint Dispenser inserts a dynamic probe that is used to communicate with the Datavisor. When this probe is fired, the object is created and the return value from the constructor, which is the address of the object, is passed as argument to the probe. The probe communicates to the Datavisor to request a software watchpoint. As part of the request, the probe provides the address and size of the object as input. The software watchpoint allocated by the Datavisor is passed to the original code by the probe as the actual address of the object just constructed. Likewise, at the entry to the object destructor, the Software Watchpoint Dispenser installs a dynamic probe that is used to inform the Datavisor to free the watchpoint associated with the object. When the dynamic probe is fired, the watchpoint is passed as an argument to the probe. The probe sends the watchpoint in a request to the Datavisor for deallocation, and the Datavisor returns the original address of the object that is associated with the watchpoint. The probe passes this address to the original code to free the object.

The Software Watchpoint Dispenser exchanges data with the Datavisor via a per-CPU memory region maintained by the Datavisor. More details on this data exchange are given in section 3.2.

The Data Watchpoint Dispenser might need some customization for each use case, because the construction and destruction sites within the targeted software layers are different for different objects. But, for the most part, the Data Watchpoint Dispenser does not need many changes. Mostly, the locations of the constructor and destructor must be specified for dynamic probes. For objects that have welldefined constructors and destructors, the change will be minimal.

3.2 Datavisor

The Datavisor is implemented as a VMkernel module. It serves the following functions:

  • Allocates and deallocates software watchpoints
  • Keeps a mapping of software watchpoints to objects
  • Keeps a journal of the access history of the watchpoints
  • Carries out the load or store operations on the objects that are being watched

The Datavisor has five entry points: init_module() and cleanup_module() callbacks, a page fault handler, and a put and a get handler for a VSI node.

At Datavisor loading time, init_module() is invoked. The key thing that init_module() does, among all the essential initialization that it carries out, is to replace the system page-fault handler with the Datavisor page-fault handler. During module unload time, cleanup_module() is invoked, and it reverts the work done by init_module(). The important cleanup work is to restore the original system page-fault handler.

When a page fault happens, the Datavisor page-fault handler is invoked. The handler first checks if the faulting address is one of the addresses that Datavisor allocates. If the address is not from Datavisor, then the page fault is passed to the original system page-fault handler to process. If the address is indeed allocated by Datavisor, the Datavisor page-fault handler processes the page fault. The details of the processing are described in the sections below.

Addresses allocated by Datavisor are either software watchpoints or the fence address. These are virtual addresses that have no physical backing memory. They are virtual addresses that are allocated from one of the several virtual memory regions that are not in use. The fence address is fixed, and watchpoint addresses are allocated on-demand. Accessing these addresses causes a page fault. The Datavisor uses a per-CPU data area to communicate with the Software Watchpoint Dispenser. The per-CPU data area is defined as the following:

struct DatavisorCommBlock {
  uint64 message;
  uint64 objectaddr;
  uint64 objectsize;
  uint64 watchpoint;
  uint64 status;
  void *fence;
};

fence is a virtual address initialized by Datavisor. message is an enum that is made up of one of the following values: WPALLOC, WPALLOCDONE, WPFREE, and WPFREEDONE.

When the Software Watchpoint Dispenser needs a watchpoint, it sets message to WPALLOC, fills objectaddr with the address of the data object for which the watchpoint is allocated, and fills objectsize with the size of the object in number of bytes. It writes a value to the address pointed to by fence. The write action to fence triggers a page fault, and the Datavisor page-fault handler is entered. On finding a WPALLOC value in message, the handler allocates a watchpoint address and puts the address in watchpoint if the allocation is successful and puts VMK_OK in status; or it puts VMK_FAIL if allocation fails. It puts WPALLOCDONE in message to signify that the allocation request was served; and it adjusts the return address of the fault handler to the instruction after the instruction that triggers the page fault and then returns from the page fault. When Software Watchpoint Dispenser resumes execution, the watchpoint can be found in the watchpoint field.

Similar communication as above is used when Software Watchpoint Dispenser frees a watchpoint. The difference is that message is set to WPFREE and watchpoint is filled with the watchpoint to be freed. When the page-fault handler returns from the fault caused by the write access to fence, it puts WPFFREEDONE in message, puts VMK_OK in status if free is successful or puts VMK_FAILURE if free fails; if free is successful, it puts the address of the object that associates with the watchpoint in objectaddr. If a page fault happens with an address allocated by Datavisor but it is not a fence address, then the address is a watchpoint. When Datavisor sees that a watchpoint is causing a page fault, it checks if the backtrace of the stack has been seen before. If the backtrace indicates an already known access point, Datavisor executes the instruction that triggers the page fault with the watchpoint address found in the instruction operands changed to the address of the object that maps to the watchpoint. In this way, the correct read or store operation is performed on the object. After the execution is completed, Datavisor arranges the page-fault handler to return to the next instruction. If Datavisor determines that the backtrace identifies a new access point, it allocates a numeric ID and creates a mapping to associate the ID to the backtrace. Datavisor also examines the instruction that causes the fault to determine if the instruction was performing a read or write access to the watchpoint. A new access record is created to store the backtrace ID, the address of the faulting instruction, and the access type. The access record is then written to a journal. The put and the get handler of the VSI node are for controlling the Datavisor and for getting status and statistics of the Datavisor. VSI node is a simple infrastructure within ESXi that is used for communication between the user space and a kernel module. VSI nodes are arranged in a hierarchical structure like a file system. Writing to the VSI node invokes the put handler and, likewise, reading a VSI node invokes the get handler. Writing a value 0 to the Datavisor’s VSI node tells Datavisor to release all allocated resources and to restart. Writing a value 1 tells Datavisor to flush all cached access records to the journal and to write out all the ID-to-backtrace mappings to a file. Performing a read on the Datavisor VSI node returns simple statistics for debugging purposes.

3.3 Data Tracker Fabricator

The Data Tracker Fabricator is a standalone tool implemented in Python. It takes the following as input: the access journal, the ID-to-backtrace mapping file produced by the Datavisor, and the symbols addresses of the ESXi host where the Datavisor is installed. The Data Tracker Fabricator produces an Emmett script that defines a list of dynamic probes. Each probe corresponds to an access point that is recorded in the access journal. The Data Tracker Fabricator also generates a call graph to show the location of each access point in a graphical form.

The Emmett script produced can be further edited by hand to add whatever debugging logic is desirable to have for tracing the data objects and for troubleshooting the problem at hand. One way to use the dynamic probes is to have the probes log the critical data inside the data object together with the address of the data object, a timestamp, the backtrace ID, and the address of the instruction associated with the probe. Load the Emmett script to install the dynamic probes and run the test to reproduce the issue. When the test is done, the probes collectively generate a log that provides the history of the data during the test. A tool can be written to replay part of the log. As it replays each log record, the tool can filter out a particular set of data objects and display their access history in a graphical way showing the backtrace as the code path and the content of the data at the time of the access.

4. Related Work

Previously published work related to data tracking was usually done in the context of general interactive debugging. This is different from our purpose, which is to support massive data tracing for an extended period of time.

A similar concept for software watchpoints has also been developed. However, the implementation relies on dynamic code transformation by injecting additional code to check each memory reference to see if the memory address is one of the watchpoints [2] [3]. Although such an implementation is effective, it incurs overhead as much as 3x performance degradation even after some effective optimization techniques [2] are applied.

In our implementation, we chose a two-phase approach. In the first phase, we discover all the access points of the data of interest during the first test run. In the second phase, we install dynamic probes at the access points and rerun the test. There seems to be more work up front with our approach, but subsequent test runs incur only the overhead that is needed to fire the dynamic probe at a specific point in the code path.

The goal for our implementation is not to support interactive debugging, but rather to trace specific types of data that are of interest for debugging the problem at hand and for saving the access history of the data for subsequent root-cause analysis. We are also interested in using the data as a beacon to help trace out the code path that the data is passing along.

5. Summary

We implemented a prototype for the Dynamic Data Tracker as a proof of concept. It demonstrated that the idea is feasible. The development work on the Dynamic Data Tracker is still ongoing. We want to build more tools to increase our capability to do various post analyses of the access history of data. We expect to use this tool for root-cause analysis, especially in the area of code that is unfamiliar to VMware, such as code developed by our partners. Likewise, we expect our partners to benefit from this tool in a similar way. We also expect to use the tool to enable us to look for anomalies in various data paths, to identify bottlenecks, and to study the data access patterns of some hot paths so that we can build up simulators to model the data access behavior of such paths to gain insight into how to optimize data placement and data access sequence.

References

1. Martim Carbone, Alok Kataria, Radu Rugina, and Vivek Thampi. “VProbes: Deep Observability into the ESXi Hypervisor.” VMware Technical Journal, Summer 2014.
2. Qin Zhao, Rodric Rabbah, Saman Amarasinghe, Larry Rudolph, and Weng-Fai Wong. “How to do a million watchpoints: Efficient Debugging using Dynamic Instrumentation.”
3. M. Copperman and J. Thomas, “Poor Man’s Watchpoints.”
4. R. Wahbe, S. Lucco, and S. L. Graham, “Practical Data Breakpoints: Design and Implementation.” Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (1993), pp. 1–12.