Improving Dynamic Binary Optimization Through Early-Exit Guided Code Region Formation

Chun-Chen Hsu       Pangfeng Liu
National Taiwan University
(d95006,pangfeng)@csie.ntu.edu.tw

Jan-Jan Wu
Institute of Information Science, Academia Sinica
wuj@iis.sinica.edu.tw

Pen-Chung Yew
University of Minnesota
yew@cs.umn.edu

Ding-Yong Hong
Institute of Information Science, Academia Sinica
dyhong@iis.sinica.edu.tw

Wei-Chung Hsu
National Chiao Tung University
hsu@cs.nctu.edu.tw

Chien-Min Wang
Institute of Information Science, Academia Sinica
cmwang@iis.sinica.edu.tw

Abstract
Most dynamic binary translators (DBT) and optimizers (DBO) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. The Next-Executing-Tail (NET) trace formation method used in HP Dynamo is an early example of such techniques. Many existing trace formation schemes are variants of NET. They work very well for most binary traces, but they also suffer a major problem: the formed traces may contain a large number of early exits that could be branched out during the execution. If this happens frequently, the program execution will spend more time in the slow binary interpreter or in the unoptimized code regions than in the optimized traces in code cache. The benefit of the trace optimization is thus lost. Traces/regions with frequently taken early-exits are called delinquent traces/regions. Our empirical study shows that at least 8 of the 12 SPEC CPU2006 integer benchmarks have delinquent traces.

In this paper, we propose a light-weight region formation technique called Early-Exit Guided Region Formation (EEG) to improve the quality of the formed traces/regions. It iteratively identifies and merges delinquent regions into larger code regions. We have implemented our EEG algorithm in two LLVM-based multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISA), respectively. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to an NET-variant currently used in QEMU, a state-of-the-art retargetable DBT, EEG can achieve a significant performance improvement of up to 72% (27% on average), and to 49% (23% on average) for IA32 and ARM, respectively.

Categories and Subject Descriptors
C.4 [Performance of Systems]: Modeling techniques; D.3.4 [Processors]: Incremental Compilers; D.3.4 [Processors]: Optimization; D.3.4 [Processors]: Run-time environments

General Terms
Design, Performance

Keywords
Dynamic Binary Translation, Trace-Based JIT Compilation, Virtual Machine, Hardware-based Performance Monitoring, Hot Region Formation

1. Introduction
Dynamic binary translation and optimization are core technologies in system virtualization [22]. Most dynamic binary translators (DBTs) and optimizers (DBOs) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. As code regions are formed by traces, we will use the terms trace and region interchangeably for the rest of the paper.

Many DBT and DBO systems [6, 7] follow the well-known runtime trace formation algorithm, called Next-Executing-Tail (NET), developed in HP Dynamo [3].

Instead of profiling all execution traces at runtime to select the hottest trace, NET forms a trace by selecting the basic blocks1 that are most recently executed. The idea is that when a basic block becomes hot, it is likely that the following basic blocks are also hot.

As a hot trace is formed by cascading a sequence of hot basic blocks, there will be a conditional branch at the end of each member basic block, referred to as the early exit of the trace. DBTs need to generate compensation code at each of such early exits to handle the case when the conditional branch is taken [22]. If early exits are frequent, then not only will such extra compensation code need

---

1 A basic block is a sequence of instructions terminated by a control transfer instruction.
to be executed, but also program execution will spend more time in the slow binary interpreter or in unoptimized code regions. The benefit of trace optimization by the DBT is thus lost. Traces with frequently taken early-exits are called delinquent traces.

Since NET does not use edge profiling [3] information to select next basic blocks, early exits may occur when program behavior changes in different execution phases. For example, the function P7Viterbi in 456.hmmer (a SPEC 2006 CPU benchmark) contributes most of its execution time. P7Viterbi updates global variables according to different conditions in a performance critical for-loop as shown in Figure 1(a).

NET splits the for-loop into four traces as shown in Figure 1(b). Each large rectangle represents a trace. The execution time of each trace, shown as the percentage of total execution time, is noted on the left top corner of the trace. The probability of an early exit being taken is also noted on each exit edge. Figure 1(b) shows a trace for a loop starting at 0x80522be. The probability of taking an early exit during the loop execution is 98%. Such a high probability for an early exit will certainly diminish the performance benefit expected from the loop trace. Our proposed region formation technique (see Section 3) will merge those four traces into a large code region shown in Figure 1(a), which can improve its performance by 68%.

To accomplish this, we propose a light-weight technique called Early-Exit Guided (EEG) region formation to detect and merge delinquent regions. There are two key issues in EEG: (1) which regions should be merged, and (2) when to merge those regions. A simple approach for the first issue is to instrument counters into all traces. However, this approach is prohibitively expensive. Instead, we employ hardware-assisted dynamic profiling to select hot regions and to avoid monitoring and merging unimportant regions. To address the second issue, we monitor regions by instrumenting counters to detect early exits. When the counter exceeds a threshold, we merge this region with the region that begins at the branch target of the early exit. We also employ a heuristic to decide whether it is beneficial to merge the selected regions or not. We will not merge regions if it will cause too much register pressure; i.e. too many store/load operations to spill and fill values between registers and the stack (see Section 3.4).

We summarize the main contributions of this work as follows:

1. Our experimental results show that there is a substantial amount of delinquent traces, and that more than 100 early exits are taken for every million executed instructions in 65% of SPEC CPU2006 integer benchmarks. We proposed an Early-Exit-Guided region formation algorithm (EEG) that uses hardware-assisted dynamic profiling and instrumented software counters to detect and merge delinquent traces/regions into larger regions.

2. We implement the EEG scheme in two LLVM-based [1] multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISAs), respectively. They off-load DBTs to other cores and allow more aggressive and sophisticated optimizations to be done on the larger code regions formed by EEG.

3. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to NET, EEG can achieve a significant performance improvement of up to 72% (27% on average) for IA32, and to 49% (23% on average) for ARM.

The rest of the paper is organized as follows. Section 2 presents our region-based multi-threaded DBT. Section 3 describes our early exit detection technique and early-exit guided region formation scheme. Section 4 presents our experimental results. Section 5 describes related work, and Section 6 gives some concluding remarks.

2. Region-Based Multi-threaded Dynamic Binary Translator

In this section, we describe the design of our region-based multi-threaded dynamic binary translator, called LnQ [14]. We have implemented the EEG scheme in LnQ. LnQ uses QEMU [2] as the front-end emulation engine, and uses LLVM [1] compilation infrastructure to handle its back-end code optimization and target code generation. We implement our EEG scheme using this framework. Figure 2 shows the major components and the control flow of our region-based multi-threaded dynamic binary translator.

We use code segments to refer basic blocks and traces/regions, and use code fragment to refer a translated code segment by DBT. Therefore, there are basic block fragments and trace/region fragments. Each code fragment has a prologue to load the guest architecture states, such as the content of the guest registers, from the memory to the host registers before execution. Also, each code fragment has an epilogue to store modified machine states back to memory before leaving the code fragment. Each code fragment has its own register mapping decided by the LLVM register allocator.

LnQ uses execution threads and optimization threads. Execution threads are responsible for translating basic blocks and executing translated code fragments. That is, if an execution thread reaches a new guest basic block during execution, the execution thread generates a basic block fragment using LLVM. Optimization threads generate optimized traces and regions fragments also using LLVM. Execution threads compile blocks with “OO” optimization
level to minimize compilation overhead. On the other hand, optimization threads compile traces and regions with “O2” to generate optimized code. All execution threads share one software code cache. As shown in Figure 2, when traces or regions are formed as described in Section 3. We use a lock-free concurrent FIFO queue [19] to implement the task queue so that execution threads can insert trace/region compilation tasks into the queue while the optimization threads take those tasks from the queue without locks.

When an optimization thread generates a new trace or region, it dispatch execution threads to the newly generated code fragment by atomically patching jump instructions in the code cache. To do this in IA32, we need to align the patched instructions to 4-byte alignment, and use the self-branch technique mentioned in [24] to patch jumps atomically.

3. Early Exit Index and Early-Exit Guided Region Selection

In this section, we first describe the NET algorithm used in our system. We then define an early exit index to quantify how often early exits are taken in a trace. Finally we describe our early exit guided region selection technique.

3.1 Trace Selection Algorithm

We adopt a modified NET algorithm called NET*, which is similar to [6], to builds traces. The difference is that NET* considers all basic blocks as potential trace head candidates, while NET only considers blocks which are targets of backward branches as trace head candidates in that they may form potential loops.

The NET* algorithm has two advantages. First, the NET algorithm [3] was designed for DBT systems in which a single DBT thread is responsible for both execution and trace building. To reduce the overhead of building traces, NET needs to be very selective in potential traces. In contrast, NET* can take advantage of modern multi-core platforms to offload the overhead of building traces. Hence, it can afford to try all basic blocks as potential trace heads.

Second, NET may not identify all loops by only considering targets of backward branches. By considering all basic blocks as possible trace heads, NET* can discover more hot traces than NET can. As reported in Section 4.1.1, NET* achieves 12% and 5% performance improvement on average over NET for SPEC CINT2006 and CFP2006 benchmarks, respectively.

Our NET* algorithm works as follows. We instrument software counters to record the number of times each block is executed. A block becomes a trace head when the number of times the block has been executed exceeds a threshold value. NET* forms a trace by appending blocks along the execution path until one of the following terminal conditions is met: (1) A branch to the trace head is taken, (2) The number of blocks exceeds a threshold, (3) The next block is the head of another trace, or (4) A guest system call instruction is encountered.

3.2 Early Exit Index

We first define an early exit of a trace. A trace can be a straight-line execution path or a cycle. If a trace is a straight-line path, then all exit edges along the path are early exits except the exit edge of the last basic block in the trace. If a trace is a cycle, all exit edges are early exit.

We define an Early-Exit Index (EEI) to measure the frequency of early exits taken in traces. More specifically, EEI is the number of early exits being taken for every million instructions executed in traces. It can be formally defined as in the following equation.

\[ EEI = \sum_{i \in \Gamma} n_i \times \rho_i / N \]

where \( \Gamma \) is the set of traces, \( n_i \) is the number of times early exits being taken in trace \( i \), \( \rho_i \) is the percentage of instructions executed in trace \( i \), and \( N \) is the number of million instructions executed.

3.3 Early-Exit Guided Region Selection

In this section, we describe our proposed Early-Exit Guided (EEG) region selection scheme. It detects and merges regions that have frequently taken early exits. The key issues in EEG are (1) how to efficiently detect delinquent regions; and (2) when to merge them at runtime. We address them as follows.

The simplest approach to address the first issue is to instrument counters in all traces and regions. However, this approach is inefficient and may merge too many regions that are not frequently executed. Instead, we use a dynamic profiling approach with the help of on-chip hardware performance monitor (HPM) to select hot regions.

We create a profiling thread called profiler at the beginning of execution to perform dynamic profiling. The profiler collects program counters periodically for every million instructions retired. When a threshold number of samples are collected, the profiler accumulates the sample counts for each trace to determine the degree of hotness of each trace. The hotness of a trace is measured by the following equation.

\[ H_T = \max\{\alpha, \beta\} \]

Here, \( \alpha \) is the percentage of instructions executed in the trace during the last sampling period, and \( \beta \) is the percentage of instructions executed in the trace during the entire execution. Intuitively, \( \alpha \) represents the hotness of the trace during the last period, and \( \beta \) represents the accumulated hotness during the entire execution. We choose the maximum of \( \alpha \) and \( \beta \) as its hotness measure.

When the hotness of a trace exceeds a threshold, we start monitoring the trace by instrumenting counters to its early exits. Currently, we only monitor the early exits of conditional branches. If a counter exceeds a pre-defined threshold, it means the control leaves the region through the corresponding early exit very frequently. Then, we merge the monitored region with the target region of the early exit. We translate and optimize the merged region with our LLVM-based DBT, and replace the monitored region with the merged region.

We argue that the overhead of the instrumentation is negligible because early exits should be rarely taken. A frequently taken early exit would have triggered region formation when the counter exceeded the threshold.

3.4 Spill Index of a Region

The benefits of EEG region formation come from eliminating the overhead caused by frequently taken early exits, and potential optimization opportunities from a larger code region. Despite the fact that we can mostly eliminate the overhead of frequently taken early exits via region merging, we may not always have potential optimization opportunities from the merged region. In particular, if the quality of the translated code of a region is not good enough, it is not beneficial to merge such a region.

We define an index, called Spill Index, to assess the quality of the code generated by the LLVM compiler for a region formed.
by the EEG technique. A spill instruction is an instruction for load/store operations between registers and stack. The Spill Index is the percentage of spill instructions in the translated code fragment. When the Spill Index of a code fragment exceeds a threshold, that region should not be further merged because a high percentage of spill instructions often forestalls good performance due to improper register allocation of the LLVM compiler.

3.5 Region Versus Trace

By creating larger regions, we reduce the amount of specialization that the compiler can do for traces. As we know, the benefit of traces comes from the instruction scheduling within traces [15]. However, we need to limit the instruction scheduling optimizations when we compile traces in dynamic binary translation, because we have to rematerialize full guest state in case a hardware exception or a signal was raised.

The main advantage of EEG region formation is that it can improve DBT performance by removing transition overhead among traces, such as removing redundant loads/stores of guest state among traces.

We use Figure 3 as an example to illustrate our region formation strategy. Figure 3(a) is the control flow graph (CFG) of a hot region in a guest application. During execution, each block is first translated as shown in Figure 3(b). Then NET forms three traces as in Figure 3(c). Trace A would be the first selected for early exit detection (see Figure 3(c)) since a loop is likely to become hot. Thus the early exit of A, marked by a dashed arrow from the trace started with A (enclosed by the dotted rectangular) to the trace started with B, is monitored with an instrumented software counter.

We merge Trace A and Trace B to form a code region when the Spill Index of a code fragment exceeds a threshold, that is formed, we replace Trace A and Trace B with Region A so that the percentage of spill instructions in the translated fragment. After the code fragment of Region A is enclosed in the dotted rectangular in Figure 3(d), that consists of traces A and B is formed. After the code fragment of Region A is formed, we replace Trace A and Trace B with Region A so that Trace F now branches to Region A rather than to Trace A. Note that Region A will not be monitored because the spill index of Region A exceeds the threshold.

Figure 3. Illustration of region selection.

4. Experiments

In this section, we evaluate the performance of Early-Exit-Guided region selection algorithm in our LLVM-based parallel DBT systems. We start by describing our measurement methodology.

We evaluate the performance with SPEC CPU 2006 benchmarks on a 3.3GHz quad-core Intel Core i7 machine. The machine has 12 GB main memory and the operating system is 64-bit Gentoo Linux with kernel version 2.6.30. We use the LntO [14] dynamic binary translation framework to build two translators which translate IA32 and ARM guest ISAs to x86_64 host ISA. For CFP2006 benchmarks, we only compile them into IA32 binaries because most CFP2006 benchmarks are written in Fortran and the ARM tool chain we use does not provide cross-compilation for Fortran. The result of ARM 464.h264ref is not reported because the SPEC runspec tool reports a mis-match error even when it runs 464.h264ref in a native ARM machine.

The benchmarks are compiled with GCC 4.3.4 for IA32 binaries and GCC 4.4.1 for ARM binaries. For all benchmarks, “-O2” flag is used. For IA32 benchmarks, we use “-m32” to generate IA32 binaries. For CFP2006, we use “-msse2 -mfpmath=sse” extra flags to generate SSE vector instructions. We use runspec script provided by SPEC to run benchmarks and report the median of 5 runs for all performance metrics.

We compare three region selection strategies in our experiments, which are NET, NET* and EEG as described in Section 3. In EEG strategy, we first use NET* to select traces, and use EEG to merge traces into regions. We set block count threshold to 50 and allow at most 16 blocks in a trace. For EEG strategy, the threshold of spill index is set to 15%, i.e. regions cannot be further merged when the percentage of spill instructions in the translated fragment exceeds 15%.

We use Perfmon2 [21] for hardware-assisted dynamic profiling to collect runtime information for every one million retired instructions. The early exit threshold is set to 1000 and we use two optimization threads to compile traces and regions in all experiments.

4.1 Performance Results of SPEC CPU2006

The performance results of SPEC CPU2006 are shown in Figure 4 and Figure 5. For cleanness of presentation, the benchmarks in both figures are sorted in decreasing order of speedup ratio so that it is easier to see the maximum, the minimum, and the geometric average of the results. We explain the results in the following sections.

4.1.1 Performance of NET*

The performance of NET* algorithm compared to NET in SPEC CINT2006 benchmarks is shown as red bars in Figure 4. For CINT2006 benchmarks, NET* achieves an average improvement of 12% and 10% for the IA32 and ARM benchmarks, respectively, with up to 53% and 46% for IA32 456.hmmer and ARM 471.amnistpp. The results show that NET* discovers more hot traces than NET does by considering all blocks as possible trace heads, and our DBTIs do not incur significant overhead because the compilation overhead is offloaded to optimization threads.

We notice that only ARM 462.libquantum has 8% slowdown. We compare traces generated by the two algorithms and show the difference, in Figure 6, among traces generated by NET and NET* for a hot loop in function quantum_toffoli of 462.libquantum.

As shown in Figure 6 (a) and 6 (b), both NET and NET* have the same trace T-d10c, but NET* splits trace T-d094 of NET into T-d094 and T-d0b4 because NET* generates T-d0b4 before T-d094. The transition between traces T-d094 and T-d0b4 in NET* results in 8% slowdown compared to NET.

However, both NET and NET* have the delinquent trace T-d10c with frequently taken early exit to T-d094 due to an unbi-
The performance of EEG compared to NET in SPEC CFP2006 benchmarks is shown in Figure 4. For CFP 2006 benchmarks, EEG achieves an average improvement of 27.5% and 23% for the IA32 and ARM benchmarks, respectively, with up to 71.7% for IA32 and 54% by merging the two traces into one region as shown in Figure 6(c). As shown in Figure 5, EEG improves NET by 4.8% to 7% on CFP2006. The improvement is minor because there are few early exits in these floating point benchmarks. In the next section, we measure the early exit index and show the relation between the number of early exits and the performance improvement.

We also observe that EEG loses 2.7% and 2.9% performance compared to NET in 437.leslie3d, and 459.GemsFDTD respectively. In 437.leslie3d, the time is spent in a small number of nested loops in the procedure EXTRAPI of file tm1.†. The regions generated by EEG contain nested loops while each trace generated by NET contains only the innermost loop. Therefore, in 437.leslie3d and 459.GemsFDTD, the translated code for traces is better than translated code for regions. As a result, EEG loses about 2.7% performance compared to NET.

4.2 Early Exit Index

In this section, we measure the Early Exit Index (EEI) of benchmarks with the NET* strategy. We insert counters at each side exit to collect the number of early exits taken in each trace, and we measure the execution frequency of traces by sampling program counters per one million retired instructions. We calculate EEI with the collected numbers as described in Section 3.2. The results are shown in Figure 7. The Y-axis on the left side shows the measured
In this section, we collect the number of memory, branch instructions and the L1 instruction cache misses of NET* and EEG through hardware performance monitoring. We calculate the percentage of reduced memory/branch operations and cache misses in EEG compared to NET*. We focus on the profiles of CINT2006, which are shown in Table 1.

Table 1. Reduced memory/branch instructions and cache misses of EEG for CINT2006 benchmarks.

As shown in Table 1, benchmarks with large improvement tend to have high percentage of reduced operations or L1 instruction cache misses. For example, IA32 456.hmmer reduces 52.8%, 36.9% and 31% of memory, branch instructions and L1 i-cache misses, and achieves 70% improvement over NET*. There are also significant percentage of reduced instructions and misses in 458.sjeng and 445.gobmk, which contributes to the improvement of these two benchmarks. The profiling data show that EEG can not only reduce the memory and branch instructions but also reduces L1 instruction cache misses by merging delinquent traces into regions.

4.4 Effect of The Threshold of Spill Index

In this section, we study the effect of the threshold of spill index, described in Section 3.4, on the performance of EEG. As shown in Figure 8(a), the performance of EEG is less sensitive to the threshold of spill index for IA32 benchmarks except 471.omnetpp. The results show that the register pressure is not a problem in the region fragments of IA32 benchmarks because the IA32 guest architecture has only 8 general purpose registers while there are 16 registers on x86_64 host architecture.

For 471.omnetpp, the performance degrades by 13.5% when the threshold changes from 15% to 20%. The reason is that when threshold changes from 15% to 20%, the spill index of the hottest fragment changes from 18% to 36% because that fragment merges one more region and its CFG becomes complex when threshold is set to 20%. As a result, the extra spill instructions degrade the performance of 471.omnetpp.
Table 2 shows the statistics of selected regions in NET, NET*, and EEG for CINT2006. First, the number of traces in NET* increase by 54% and 59% on average compared to NET for IA32 and ARM benchmarks respectively. The average numbers of blocks per trace are similar in NET and NET*.

13.6% and 11.5% of traces in NET* are merged into regions by EEG for the IA32 and ARM benchmarks respectively, which indicates that our HPM-based region selection approach described in Section 3.3 can effectively select hot traces to be merged. The average numbers of blocks per region are 14.4 and 13.4 for the IA32 and ARM benchmarks respectively, which are 3.4X and 2.9X larger than the traces generated by NET*.

We also compute the number of merges in EEG. There are 2.1 and 1.7 merges per region on average in IA32 and ARM benchmarks, which indicates that most regions become stable after few number of merges. The last two columns of Table 2 are percentage of execution time spent in traces and regions. On average, our DBTs spend 72.3% and 58.4% execution time in regions for the IA32 and ARM benchmarks respectively.

5. Related Works

The choice of optimization unit is critical to achieving good performance for Just-In-Time compilation systems. In this section, we categorize the related works of finding hot code region into dynamic binary translation systems, dynamic binary optimization systems, and language virtual machines.

5.1 Dynamic Binary Translation Systems

Dynamic binary translation (DBT) is widely used to support legacy binary code to run on a new architecture such as IA-32EL [4], DAISY [9], and Transmeta [8]. IA32-EL is a process virtual machine that enables IA32 applications to run on Intel Itanium. IA32-EL uses hyper-blocks as its unit of optimization in the hot code translation phase. A hyper block is a set of predicated basic blocks with a single entry and multiple exits. IA32-EL forms hyper blocks based on the execution counts of basic blocks and edge counters collected during the cold code execution.

DAISY and Transmeta are system virtual machines, where DAISY supports IBM PowerPC applications to run on VLIW processors and Transmeta supports IA-32 applications to run on a proprietary VLIW processor. Transmeta did not reveal details about how to find hot code regions. IBM DAISY uses tree groups as the translation unit. Tree groups have a single entry point and multiple exit points. No control flow joins are allowed within a tree group. Control flow joins can only occur on group transitions. Like IA32-EL, DAISY also uses profiling information collected during interpretation for tree group formation. Both hyper-blocks and tree groups have little advantage to non-VLIW machines, such as
x86,64, since they are primarily designed to maximize instruction-level parallelism in VLIW architectures. Therefore we do not apply their approach in our system.

Moreover, DAISY, Transmeta, and IA32-EL handle early exits with chaining, i.e. the execution directly transfers to another code region. The transition overhead in those systems is not as high as in LnQ because most guest architecture states are mapped to the host architecture in these systems. For example, IA32-EL maps the state of IA-32 guest registers directly to Itanium registers. On the other hand LnQ, a retargetable dynamic binary translator, does not make any assumption about the guest and host ISAs. Consequently LnQ has to load guest states in the prologue of code fragments, and save them back to memory in the exit stubs, which incurs transition overheads.

5.2 Dynamic Optimization Systems

ADORE [18] and Dynamo [3] are same-ISA dynamic binary optimizers, which means the input and the output instructions are from the same instruction set architecture. Both ADORE and Dynamo use traces, i.e. super-blocks, as the unit of optimization. ADORE uses Hardware Performance Monitor (HPM) sampling approach to collect path profiles from the Branch Target Buffer (BTB) hardware performance counters in Itanium. It forms traces based on the collected path profiles. Dynamo was the first trace-based dynamic optimizing compiler that used the Next-Executing-Target (NET) algorithm. Dynamo pioneered many early concepts of trace formation and trace runtime management. Many DBT systems [6, 7, 13, 25] and just-in-time compilers [10, 16, 26] use NET or its variants to form traces.

<table>
<thead>
<tr>
<th>IA32 CINT2006</th>
<th>NET</th>
<th>NET*</th>
<th>EEG</th>
<th>Merges</th>
<th>%Time Spent in</th>
</tr>
</thead>
<tbody>
<tr>
<td>400.perlbench</td>
<td>6646</td>
<td>4.1</td>
<td>8966</td>
<td>5.9</td>
<td>1627</td>
</tr>
<tr>
<td>401.bzip2</td>
<td>583</td>
<td>3.8</td>
<td>894</td>
<td>5.0</td>
<td>206</td>
</tr>
<tr>
<td>403.gcc</td>
<td>23058</td>
<td>4.0</td>
<td>34019</td>
<td>3.9</td>
<td>2728</td>
</tr>
<tr>
<td>429.mcf</td>
<td>239</td>
<td>4.7</td>
<td>605</td>
<td>3.5</td>
<td>83</td>
</tr>
<tr>
<td>445.gobmk</td>
<td>9468</td>
<td>2.9</td>
<td>10961</td>
<td>3.9</td>
<td>2238</td>
</tr>
<tr>
<td>450.himmar</td>
<td>424</td>
<td>3.8</td>
<td>687</td>
<td>3.6</td>
<td>61</td>
</tr>
<tr>
<td>458.sjeng</td>
<td>1216</td>
<td>3.3</td>
<td>1749</td>
<td>4.7</td>
<td>764</td>
</tr>
<tr>
<td>462.libquantum</td>
<td>200</td>
<td>2.5</td>
<td>326</td>
<td>2.2</td>
<td>20</td>
</tr>
<tr>
<td>464.h264ref</td>
<td>2434</td>
<td>3.3</td>
<td>3974</td>
<td>4.4</td>
<td>781</td>
</tr>
<tr>
<td>471.omnetpp</td>
<td>2918</td>
<td>5.0</td>
<td>4859</td>
<td>5.5</td>
<td>345</td>
</tr>
<tr>
<td>473.astar</td>
<td>613</td>
<td>5.2</td>
<td>942</td>
<td>4.5</td>
<td>185</td>
</tr>
<tr>
<td>483.xalancbmk</td>
<td>4453</td>
<td>5.9</td>
<td>8355</td>
<td>4.4</td>
<td>538</td>
</tr>
</tbody>
</table>

Geometric Mean: 3.9 | 4.2 | 889 | 14.4 | 2.1 | 12.4% | 72.3% |

<table>
<thead>
<tr>
<th>ARM CINT2006</th>
<th>NET</th>
<th>NET*</th>
<th>EEG</th>
<th>Merges</th>
<th>%Time Spent in</th>
</tr>
</thead>
<tbody>
<tr>
<td>400.perlbench</td>
<td>7839</td>
<td>5.1</td>
<td>10438</td>
<td>5.1</td>
<td>1860</td>
</tr>
<tr>
<td>401.bzip2</td>
<td>672</td>
<td>4.7</td>
<td>1125</td>
<td>4.7</td>
<td>205</td>
</tr>
<tr>
<td>403.gcc</td>
<td>24703</td>
<td>3.7</td>
<td>36710</td>
<td>3.7</td>
<td>2595</td>
</tr>
<tr>
<td>429.mcf</td>
<td>362</td>
<td>3.6</td>
<td>726</td>
<td>3.6</td>
<td>125</td>
</tr>
<tr>
<td>445.gobmk</td>
<td>14175</td>
<td>3.5</td>
<td>16587</td>
<td>3.5</td>
<td>3071</td>
</tr>
<tr>
<td>456.himmar</td>
<td>847</td>
<td>4.8</td>
<td>1378</td>
<td>4.8</td>
<td>38</td>
</tr>
<tr>
<td>458.sjeng</td>
<td>1299</td>
<td>4.6</td>
<td>1811</td>
<td>4.6</td>
<td>760</td>
</tr>
<tr>
<td>462.libquantum</td>
<td>606</td>
<td>8.6</td>
<td>951</td>
<td>8.6</td>
<td>38</td>
</tr>
<tr>
<td>471.omnetpp</td>
<td>4584</td>
<td>3.5</td>
<td>7163</td>
<td>5.1</td>
<td>364</td>
</tr>
<tr>
<td>473.astar</td>
<td>959</td>
<td>3.9</td>
<td>1432</td>
<td>4.6</td>
<td>180</td>
</tr>
<tr>
<td>483.xalancbmk</td>
<td>4844</td>
<td>5.1</td>
<td>8690</td>
<td>3.9</td>
<td>559</td>
</tr>
</tbody>
</table>

Geometric Mean: 4.1 | 4.6 | 756 | 13.4 | 1.7 | 23.0% | 58.4% |

Table 2. Statistics of Traces/Regions in NET* and EEG.

StarDBT [25] uses MRET² [27], which improves NET by increasing the completion rate of traces. MRET² first uses NET to select a potential trace, then it clears block execution counters and restarts NET to select another potential trace. Both potential traces share the same starting address but may have different tails. MRET² then improves the completion rate by selecting the common path of both potential traces as a hot trace. Hiniker et al. [12] proposed Last-Executed Iteration (LEI) and a trace combination algorithm, which needs to interpret each taken branches to form traces.

The main difference between the proposed EEG and previous works is that EEG expands the existing regions and re-optimizes them during execution. The process of region expansion in EEG can be divided into three stages. The first stage is to decide how to form the initial region. The second stage is to decide when to expand the region. The third stage is to decide which blocks are to be merged. Previous trace formation algorithms, such as LEI and MRET², could be used in the first stage of EEG to build the initial regions. Therefore, the proposed EEG can be used effectively in most trace-based dynamic binary translators.

5.3 Language Virtual Machines

5.3.1 Method-Based Language Virtual Machines

Region expansion is widely used in method-based JIT systems, e.g., HotSpot Java VM [20]. These JIT systems compile methods as follows. When a method-based JIT system compiles a method for the first time, it only compiles those basic blocks whose execution counts exceed a threshold during interpretation. If the execution frequently leaves a region from side exits, the JIT system expands...
this region to include those basic blocks that are the destinations of these side exits.

Our EEG and method-based JIT systems use similar heuristics to decide when to expand regions during the second stage of region expansion, but they are very different in the first stage and the third stage of region expansion in terms of motivation and the type of blocks they merge.

The major difference between EEG and those systems in the first stage is the motivation in forming the initial regions. EEG uses traces as initial regions for two reasons. First, traces represent those frequently executed paths that may span across several methods. Second, it takes less time to optimize traces because of their simple control flow graph and small numbers of basic blocks. For example, we found only 4.2 blocks per trace in EEG. On the other hand, method-based JIT systems build initial regions by selecting blocks from hot methods, and excluding those blocks that are rarely executed. For example, HotSpot JVM excludes blocks that are never executed during interpretation.

The major difference between EEG and method-based JIT systems in the third stage is the type of blocks they merge. In the third stage EEG merges traces that contains frequently executed paths. However, in the third stage method-based JIT systems will only merge blocks that are rarely executed in the first stage, since those frequently executed blocks in the first stage have already been merged.

Suganuma et al. [23] investigate how to use region-based compilation to improve the performance of method-based Java Just-In-Time compilation. They use region-based compilation to partially inline procedures, instead of using traditional method inlining techniques. They collect execution counts of basic blocks in order to understand program runtime behavior, and they apply static code analysis on the Java bytecode to identify those rarely executed code blocks, such as those handle exception. They use these information to identify and optimize those often executed code blocks only, without optimizing the entire method.

In our case it is difficult to identify those rarely executed regions by a static code analysis, as they did for Java bytecode. Therefore we cannot apply their approach in our system.

5.3.2 Trace-Based Language Virtual Machines

Recently, trace-based compilation has gained popularity in dynamic scripting languages [5, 10] and high level language virtual machines [11, 16, 17, 26], Wu et al. [26] and Inoue et al. [16, 17] investigate the performance of several variations of NET on trace-based Java virtual machines.

Gal et al. [10] propose merging loop traces into a trace-tree. Their approach requires adding annotation while compiling JavaScript into bytecode, and thus cannot be applied in our case.

In contrast, our EEG merges delinquent traces/regions, which are not necessarily loop traces. EEG uses hardware monitoring to identify often executed code traces, then determines whether they have many side exits, and finally merges those often executed code regions that have many side exits to avoid early exits from a region, EEG also uses spill index to prevent generating regions which may degrade performance.

6. Conclusion

We have identified and quantified the delinquent trace problem in the popular Next-Executing-Tail (NET) trace selection algorithm. Delinquent traces contain frequently taken early exits which cause significant overhead. Motivated by this problem, we develop a light-weight region formation strategy called Early-Exit Guided region selection (EEG) to improve the performance of NET by merging delinquent traces into larger code regions. The EEG algorithm is implemented in two LLVM-based parallel dynamic binary translators (DBT), the IA32-to-x86_64 and ARM-to-x86_64 DBTs.

Experiment results show that EEG achieves performance improvement of up to 72% (27% on average), and up to 49% (23% on average) in IA32 and ARM SPEC CINT2006 benchmarks respectively. The profiling results show that EEG can reduce memory and branches instructions by up to 53% and 37% respectively because the transition overhead among traces is eliminated by merging delinquent traces. It also reduces the L1 instruction cache misses by up to 43.7% in CINT2006 benchmarks.

Acknowledgments

The authors would like to thank Dr. Filip Pizlo at Apple Inc. and the anonymous reviewers for their valuable comments and suggestions to improve the quality of this paper. This work is supported by the National Science Council of Taiwan under grant number NSC99-2221-E-001-003-MY3, NSC99-2221-E-001-004-MY3, and by NSF grant CNS-0834599.

References


