-
Computer Architecture, 2003. Proceedings. 30th Annual International Symposium on; 07/2003
-
[show abstract]
[hide abstract]
ABSTRACT: The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on the value prediction, which was proposed originally as an instruction-level-parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose to use value prediction and value speculative execution only for prefetching so that the complex prediction validation and misprediction recovery mechanisms are avoided and only minor changes in the microarchitecture are needed. The same hardware modifications also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups even with a simple stride value predictor.
06/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: Value prediction exploits localities in value streams. Previous research focused on exploiting two types of value localities, computational and context-based, in the local value history, which is the value sequence produced by the same instruction that is being predicted. Besides the local value history, value locality also exists in the global value history, which is the value sequence produced by all dynamic instructions according to their execution order. In this paper, a new type value locality, the computational locality in the global value history is studied. A novel prediction scheme, called the gDiff predictor, is designed to exploit one special and most common case of this computational model, the stridebased computation, in the global value history. Such a scheme provides a general framework to exploit global stride locality in any value stream. Experiments show that there exists very strong stride type of locality in global value sequences. Ideally, the gDiff predictor can achieve 73% prediction accuracy for all value producing instructions without any hybrid scheme, much higher than local stride and local context prediction schemes. However, the capability of realistically exploiting locality in global value history is greatly challenged by the value delay issue, i.e., the correlated value may not be available when the prediction is being made. We study the value delay issue in an out-of-order (OOO) execution pipeline model and propose a new hybrid scheme to maximize the exploitation of the global stride locality. This new hybrid scheme shows 91% prediction accuracy and 64% coverage for all value producing instructions. We also show that the global stride locality detected by gDiff in load address streams provides strong capabilities in predicting load addre...
05/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: this paper appeared in the Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT'01), September 2001
12/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: Lower threshold voltages in deep sub-micron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance.
04/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: In embedded computing, code size is very important for system cost and performance. In global scheduling for VLIW/EPIC style embedded processors, region-enlarging optimizations, especially tail duplication, are commonly used to exploit instruction level parallelism (ILP) to boost the performance. The code size increase due to such optimizations, however, raises serious concerns about the affected I-cache, branch and TLB performance. In this paper, we focus on the code size efficiency of code size related optimizations in global scheduling. First, we propose to use the ratio of static IPC (instruction per cycle) changes to code size changes as a quantitative measure of the code size efficiency at compile time for any code size related optimization. Then, based on the code size efficiency of tail duplication, we propose the solutions to the two related problems: (1) how to achieve the best performance for a given code size increase, (2) how to get the optimal code size efficiency for any program. Our study shows that code size increase resulting from tail duplication has a significant but varying impact on IPC, e.g., the first 2% code size increase results in 18.5% increase in static IPC, while the static IPC changes less than 1% when given code size increase ranging from 20% to 30%. We then use this feature to define the optimal code size efficiency and to derive a simple, yet robust threshold scheme finding it. The experimental results using SPECint95 benchmarks show that this threshold scheme finds the optimal efficiency accurately. While the optimal efficiency results show an average increase of 2% in code size over original code size, the improved I-cache performance (4% decrease in I-cache penalty for a 32KB I-cache) is observed due to the increased locality. Overal...
02/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: In global scheduling for ILP processors, regionenlarging optimizations, especially tail duplication, are commonly used. The code size increase due to such optimizations, however, raises serious concerns about the affected I-cache and TLB performance. In this paper, we propose a quantitative measure of the code size efficiency at compile time for any code size related optimization. Then, based on the efficiency of tail duplication, we propose the solutions to two related problems: (1) how to achieve the best performance for a given code size increase, (2) how to get the optimal code size efficiency for any program. Our study shows that code size increase has a significant but varying impact on IPC, e.g., the first 2% code size increase results in 18.5% increase in static IPC, but less than 1% when the given code size further increases from 20% to 30%. We then use this feature to define the optimal code size efficiency and to derive a simple, yet robust threshold scheme finding it. The experimental results using SPECint95 benchmarks show that this threshold scheme finds the optimal efficiency accurately. While the optimal efficiency results show an average increase of 2% in code size, the improved I-cache performance is observed and a speedup of 17% over the natural treegion results is achieved.
02/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a treegion-based global scheduling technique for wide issue VLIW/EPIC processors. A treegion is a single-entry/multiple-exit global scheduling scope that consists of basic blocks with control-flow forming a tree. We propose a two-phase approach to global scheduling within a treegion scope that enables speculative code motion in the first phase and uses predication of all instructions in the second phase. In the first scheduling phase, tree traversal scheduling (TTS) takes full advantage of speculation to speed up all possible paths in a treegion. Over-aggressive speculation is limited by scheduling block-ending branches as early as possible, enabled by downward code motion. A multiway branch transformation is also performed to reduce control dependence height. In the second scheduling phase, fully resolved predicates (FRPs) are used to enable branch barrier instructions, such as stores and subroutine calls, to move across branches. Selective if-conversion can also be applied to remove hard-to-predict branches in a treegion. The simulation results based on an 8-issue EPIC style machine model show an average speedup of 21% of TTS over BB scheduling, an additional speedup of 6.4% from multiway branch transformation, and another 1.9% speedup from FRP-guarded code motion. Other code transformations such as treegion code layout and the general operation combining are also presented in this paper.
02/2002;
-
Interaction between Compilers and Computer Architectures, 2002. Proceedings. Sixth Annual Workshop on; 02/2002
-
[show abstract]
[hide abstract]
ABSTRACT: Global scheduling in a treegion framework has been proposed to exploit instruction level parallelism (ILP) at compile time. A treegion is a single-entry / multiple-exit global scheduling scope that consists of basic blocks with control-flow that forms a tree. Because a treegion scope is nonlinear (includes multiple paths) it is distinguished from linear scopes such as traces or superblocks. Treegion scheduling has the capability of speeding up all possible paths within the scheduling scope. This paper presents a new global scheduling algorithm using treegions called Tree Traversal Scheduling (TTS). Efficient, incremental data-flow analysis in support of TTS is also presented. Performance results are compared to the scheduling of the linear regions that result from the decomposition of treegions. We refer to these resultant linear regions as linear treegions (LT) and consider them analogous to superblocks with the same amount of code expansion as the base treegion. Experimental results for TTS scheduling show a 35% speedup compared to basic block (BB) scheduling and a 4% speedup compared to LT scheduling. 1.
09/2001;
-
[show abstract]
[hide abstract]
ABSTRACT: Lower threshold voltages in deep sub-micron technologies cause store leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches
Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International Conference on; 02/2001
-
Languages and Compilers for Parallel Computing, 14th International Workshop, LCPC 2001, Cumberland Falls, KY, USA, August 1-3, 2001. Revised Papers; 01/2001
-
[show abstract]
[hide abstract]
ABSTRACT: We advocate using performance bounds to guide code optimizations. Accurate performance bounds establish an efficient way to evaluate benefits as well as overheads of code transformations without actually performing instruction scheduling. In this paper, we introduce a novel bound-guided approach to systematically regulate code size related instruction level parallelism (ILP) optimizations including tail duplication, loop unrolling and if-conversion. Our approach is based on the notion of code size efficiency, which is defined as the ratio of ILP improvement over code size increase. With such a notion, we can 1) develop a general approach to selectively perform optimizations to maximize the ILP improvement while minimizing the cost in code size; and 2) define the optimal tradeoff between ILP improvement and code size overhead and develop a simple heuristic to achieve this optimal tradeoff. Experimental results using SPEC CINT 2000 benchmarks show that the performance improves significantly with very little code size increase using our systematic way to regulate code transformations and the simple heuristic is effective in achieving the optimal tradeoff.
-
[show abstract]
[hide abstract]
ABSTRACT: Due to the ever-increasing computational power of contemporary microprocessors, the execution time spent on actual arithmetic computations (i.e., computations not involving slow memory operations such as cache misses) is significantly reduced. Therefore, for memory intensive workloads, it is more important to overlap multiple cache misses than to overlap slow memory operations with other computations. Based on the prediction of missing loads' addresses, data prefetching techniques have the capability to bring in the required data early so that the miss latency can be overlapped with prior memory operations or other computations. In this paper, we highlight the limitation of traditional prefetching techniques and advocate that value prediction, though proposed originally as an instruction-level-parallelism (ILP) optimization, is more capable of overlapping cache misses and increasing memory-level-parallelism (MLP) than traditional prefetching techniques for pointer-chasing workloads. We develop an analytical model to examine performance potential of both data prefetching and value prediction. Important and somewhat unexpected insights are revealed and the code characteristics leading to the performance differences between these two techniques are identified. The observations based on the model provide sound theoretical backgrounds of recent proposals on hiding memory access latencies and highlight their performance potential.