Article

A Framework for Modeling and Optimization of Prescient Instruction Prefetch

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes a framework for modeling macroscopic program behavior and applies it to optimizing prescient instruction prefetch -- novel technique that uses helper threads to improve single-threaded application performance by performing judicious and timely instruction prefetch. A helper thread is initiated when the main thread encounters a spawn point, and prefetches instructions starting at a distant target point. The target identifies a code region tending to incur I-cache misses that the main thread is likely to execute soon, even though intervening control flow may be unpredictable. The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics. Execution paths are considered stochastic outcomes, and aspects of program behavior are summarized via path expression mappings. Mappings for computing reaching, and posteriori probability; path length mean, and variance; and expected path footprint are presented. These are used with Tarjan's fast path algorithm to efficiently estimate the benefit of spawn-target pair selections. Using this framework we propose a spawn-target pair selection algorithm for prescient instruction prefetch. This algorithm has been implemented, and evaluated for the Itanium Processor Family architecture. A limit study finds 4.8%to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high I-cache miss rates. The framework in this paper is potentially applicable to other thread speculation techniques.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Section 3 summarizes the findings of our detailed investigation into sources of modeling error in p-thread optimization. In Section 4, two extensions of our earlier path expression modeling technique [2] are proposed, which can be used instead of execution trace data. Section 5 presents our experimental evaluation of the proposed techniques. ...
... We found the impact is equivalent to a constant multiplicative factor applied to the minimum of the prefetch slack and the memory latency. Branch Outcome Correlation: The path expression framework described in [2] assumes that branch outcomes are uncorrelated. This approximation may lead the modeling framework to underestimate the frequency of execution paths from the spawn point to the target delinquent load causing the benefit of the associated helper threads to be underesti- mated. ...
... In this section we describe an algorithm for computing these quantities from data that is easy to obtain within the framework of present day feedback directed optimizing compiler frameworks. The algorithm we describe uses path expressions [34, 35, 2]. A path expression is a regular expression summarizing all paths between two vertices in a directed graph. ...
Conference Paper
This paper investigates helper threads that improve per- formance by prefetching data on behalf of an application's main thread. The focus is data prefetch helper threads that lack branch instructions and which generate prefetches for one dynamic instance of a delinquent load instruction per spawned helper thread. This form of helper thread, some- times called a simple p-thread, has been studied previously by Roth et al. (29, 26) who proposed a framework for opti- mizing their impact. A key step in that framework is pre- dicting the performance impact of a helper thread. In this paper we propose and evaluate a novel performance predic- tion technique that achieves comparable results yet requires less detailed information about dynamic program behavior. This technique extends a path expression based statistical modeling framework (2) by incorporating information about branch correlation (which we show is important) and by considering data flow information in a statistical manner. Significantly, the profile information we use is similar to that provided within current optimizing compilers. This pa- per also provides the first comprehensive assessment of the sources of modeling error relevant to predicting the perfor- mance impact of simple p-threads.
... A key challenge for instruction prefetch is to accurately predict control flow sufficiently in advance of the fetch unit to tolerate the latency of the memory hierarchy. The notion of prescient instruction prefetch [1] was first introduced as a technique that uses helper threads to improve single-threaded application performance by performing judicious and timely instruction prefetch. ...
... When supplied with prescient instruction prefetch enabled application software, these hardware mechanisms can yield significant performance improvement while remaining complexity-efficient. While the potential benefit of prescient instruction prefetch was demonstrated previously via a limit study [1], the techniques presented in this paper show how to implement prescient instruction prefetch under realistic constraints. In addition, we evaluate the performance scalability of prescient instruction prefetch as memory latency scales up, and in comparison to aggressive hardware-only I-prefetch mechanisms. ...
... An efficient methodology for optimizing the selection of spawn-target pairs based on a rigorous statistical model of dynamic program execution has been described previously [1]. ...
Conference Paper
This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.
... Other approaches to instruction prefetching include using helper prefetching threads whose only purposes is to run ahead to provide prefetching for the main thread. [1]. This option is not explored here because server applications are typically highly multithreaded so the use of helper threads instead of worker threads may be a liability rather than an advantage. Another problems of their approach are the overhead of triggering helper threads and that helper threads need to run ahead enough of the work ...
Article
Full-text available
We present a detailed characterization of instruction cache performance for IBM's J2EE-enabled web server, Web- Sphere Application Server (WAS). When running two J2EE benchmarks on WebSphere, we find that instruc- tion cache misses cause a 12% performance penalty on current-generation Power5-based multiprocessor systems. To mitigate this performance loss, we describe a new call-chain based algorithm for inserting software prefetch instructions and evaluate its potential for improved instruc- tion cache performance. The performance of this algorithm depends on the selection of several independent parameters which control the distance and number of prefetches inserted for a particular method. Through characterization of the WebSphere applications, we select these parameters, and ultimately find that our call-chain based insertion algorithm achieves an 18% reduction in instructon cache miss rate for Java methods.
... For example, [9] introduces the concept of a control quasiindependent point, which they define as a future instruction 95% likely to be on a future control-flow path. Other recent works have proposed the use of control quasi-independent points for instruction prefetching [1,2]. Falcón et al. present Prophet/Critic Hybrid Branch Prediction [7]. ...
Conference Paper
Full-text available
This paper presents a novel microarchitecture technique for accurately predicting control flow reconvergence dynamically. A reconvergence point is the earliest dynamic instruction in the program where we can expect program paths to reconverge regardless of the outcome or target of the current branch. Thus, even if the immediate control flow after a branch is uncertain, execution following the reconvergence point is certain. This paper proposes a novel hardware re-convergence predictor which is both implementable and accurate, with a 4KB predictor achieving more than 95% accuracy for SPEC INT, and larger implementations achieving greater than 99% accuracy. The information provided from reconvergence prediction can increase the effectiveness of a range of previously proposed performance optimizations, including speculative multithreading, control independence, and squash reuse. This paper also demonstrates a new technique that takes advantage of the dynamic reconvergence prediction information in order to predict a wrong path excursion ahead of branch resolution. On average, 34% of wrong path fetches are eliminated.
... Other approaches to instruction prefetching include using helper prefetching threads whose only purposes is to run ahead to provide prefetching for the main thread. [1]. We do not explore this option since server applications are typically highly multithreaded and as such, the use of helper threads instead of worker threads may be a liability rather than an advantage. Another problem with this approach is the overhead of triggering helper threads and that helper threads need to run ahead enough of the ...
Conference Paper
Full-text available
We present a detailed characterization of instruction cache performance for IBM's J2EE-enabled Web server, WebSphere Application Server (WAS). When running two J2EE benchmarks on WebSphere, we find that instruction cache misses cause a 12% performance penalty on current-generation Power5-based multiprocessor systems. To mitigate this performance loss, we describe a new call-chain based algorithm for inserting software prefetch instructions, and evaluate its potential for improved instruction cache performance. The performance of this algorithm depends on the selection of several independent parameters which control the distance and number of prefetches inserted for a particular method. We select these parameters through characterization of the WebSphere applications, and ultimately find that our call-chain based insertion algorithm achieves significant reduction in instruction cache miss rate for Java methods.
Article
Full-text available
ACKNOWLEDGMENTS First of all, I am indebted to my advisor Prof. Csaba Andras Moritz for his encouragement, advice, mentoring, and research support throughout my whole PhD study. Professor Moritz’s insights and ideas have provided me the greatest help to make this dissertation possible. I want to thank Prof. C. Mani Krishna, Prof. Israel Koren and Dr. Zahava Koren for their help during my research at UMass. Through all the research meetings and discussions, their technical advices have helped my research work and this dissertation
Article
Modern microprocessors devote a large portion of their chip area to caches in order to bridge the speed and bandwidth gap between the core and main memory. One known problem with caches is that they are usually used with low efficiency; only a small fraction of the cache stores data that will be used before getting evicted. As the focus of microprocessor design shifts towards achieving higher performance-perwatt, cache efficiency is becoming increasingly important. This dissertation proposes techniques to improve both data cache efficiency in general and instruction cache efficiency for Explicit Data Graph Execution (EDGE) architectures. To improve the efficiency of data caches and L2 caches, dead blocks (blocks that will not be referenced again before their eviction from the cache) should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such schemes result in lower prediction accuracy and coverage. We delay the prediction to achieve better prediction accuracy and coverage. For the L1 cache, we propose a new class of dead-block prediction schemes that predict dead blocks based on cache bursts. A cache burst begins when a block moves into the MRU position and ends when it moves out of the MRU position. Cache burst history is more predictable than individual reference history and results in better dead-block prediction accuracy and coverage. Experiment results show that predicting the death of a block at the end of a burst gives the best tradeoff between timeliness and prediction accuracy/coverage. We also propose mechanisms to improve counting-based dead-block predictors, which work best at the L2 cache. These mechanisms handle reference-count variations, which cause problems for existing counting-based deadblock predictors. The new schemes can identify the majority of the dead blocks with approximately 90% or higher accuracy. For a 64KB, two-way L1 D-cache, 96% of the dead blocks can be identified with a 96% accuracy, half way into a block’s dead time. For a 64KB, four-way L1 cache, the prediction accuracy and coverage are 92% and 91% respectively. At any moment, the average fraction of the dead blocks that has been correctly detected for a two-way or four-way L1 cache is approximately 49% or 67% respectively. For a 1MB, 16-way set-associative L2 cache, 66% of the dead blocks can be identified with a 89% accuracy, 1/16th way into a block’s dead time. At any moment, 63% of the dead blocks in such an L2 cache, on average, has been correctly identified by the dead-block predictor. The ability to accurately identify the majority of the dead blocks in the cache long before their eviction can lead to not only higher cache efficiency, but also reduced power consumption or higher reliability. In this dissertation, we use the dead-block information to improve cache efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into dead blocks. Replacement optimization evicts blocks that become dead after several reuses, before they reach the LRU position. Cache bypassing identifies blocks that cause cache misses but will not be reused if they are written into the cache and do not store these blocks in the cache. Prefetching into dead blocks replaces dead blocks with prefetched blocks that are likely to be referenced in the future. Simulation results show that replacement optimization or bypassing improves performance by 5% and prefetching into dead blocks improves performance by 12% over the baseline prefetching scheme for the L1 cache and by 13% over the baseline prefetching scheme for the L2 cache. Each of these three techniques can turn part of the identified dead blocks into live blocks. As new techniques that can better utilize the space of the dead blocks are found, the deadblock information is likely to become more valuable. Compared to RISC architectures, the instruction cache in EDGE architectures faces challenges such as higher miss rate, because of the increase in code size, and longer miss penalty, because of the large block size and the distributed microarchitecture. To improve the instruction cache efficiency in EDGE architectures, we decouple the next-block prediction from the instruction fetch so that the nextblock prediction can run ahead of instruction fetch and the predicted blocks can be prefetched into the instruction cache before they cause any I-cache misses. In particular, we discuss how to decouple the next-block prediction from the instruction fetch and how to control the run-ahead distance of the next-block predictor in a fully distributed microarchitecture. The performance benefit of such a look-ahead instruction prefetching scheme is then evaluated and the run-ahead distance that gives the best performance improvement is identified. In addition to prefetching, we also estimate the performance benefit of storing variable-sized blocks in the instruction cache. Such schemes reduce the inefficiency caused by storing NOPs in the I-cache and enable the I-cache to store more blocks with the same capacity. Simulation results show that look-ahead instruction prefetching and storing variable-sized blocks can improve the performance of the benchmarks that have high I-cache miss rates by 17% and 18% respectively, out of an ideal 30% performance improvement only achievable by a perfect I-cache. Such techniques will close the gap in I-cache hit rates between EDGE architectures and RISC architectures, although the latter will still have higher I-cache hit rates because of the smaller code size. Computer Sciences
Article
Full-text available
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources.While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
Conference Paper
Full-text available
We describe the Slice Processor micro-architecture that implements a generalized operation-based prefetching mechanism. Operation-based prefetchers predict the series of operations, or the computation slice that can be used to calculate forthcoming memory references. This is in contrast to outcome-based predictors that exploit regularities in the (address) outcome stream. Slice processors are a generalization of existing operation-based prefetching mechanisms such as stream buffers where the operation itself is fixed in the design (e.g., address + stride). A slice processor dynamically identifies frequently missing loads and extracts on-the-fly the relevant address computation slices. Such slices are then executed in-parallel with the main sequential thread prefetching memory data. We describe the various support structures and emphasize the design of the slice detection mechanism. We demonstrate that a relatively simple organization can significantly improve performance over an aggressive, dynamically-scheduled processor and for a set of pointer-intensive programs and for some integer applications from the SPEC'95 suite. In particular, a slice processor that can detect slices of up to 8 instructions extracted over of a region of up to 32 instructions improves performance by 11% on the average (even if slice detection requires up to 32 cycles). Allowing slices of up to 16 instructions results in an average performance improvement of 15%. Finally, we study how our operation-based predictor interacts with an outcome-based one and find them mutually beneficial.
Conference Paper
Full-text available
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations.
Conference Paper
Full-text available
Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
Conference Paper
Full-text available
Branch misprediction penalties continue to increase as microprocessor cores become wider and deeper. Thus, improving branch prediction accuracy remains an important challenge. Simultaneous subordinate microthreading (SSMT) provides a means to improve branch prediction accuracy. SSMT machines run multiple, concurrent microthreads in support of the primary thread. We propose to dynamically construct microthreads that can speculatively and accurately pre-compute branch outcomes along frequently mispredicted paths. The mechanism is intended to be implemented entirely in hardware. We present the details for doing so. We show how to select the right paths, how to generate accurate predictions, and how to get this information in a timely way. We achieve an average gain of 8.4% (42% maximum) over a very aggressive baseline machine on the SPECint95 and SPECint2000 benchmark suites
Conference Paper
Full-text available
Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular access patterns make it difficult to accurately predict the address sufficiently early to mask large cache miss latencies. This paper explores an alternative to predicting prefetch addresses, namely precomputing them. The Dependence Graph Precomputation scheme (DGP) introduced in this paper is a novel approach for dynamically identifying and precomputing the instructions that determine the addresses accessed by those load/store instructions marked as being responsible for most data cache misses. DGP's dependence graph generator efficiently generates the required dependence graphs at run time. A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching. Our results show that 94% of the prefetches issued by DGP are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes as about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the perfect D-cache performance
Conference Paper
Full-text available
Instruction supply is a crucial component of processor performance. Instruction prefetching has been proposed as a mechanism to help reduce instruction cache misses, which in turn can help increase instruction supply to the processor. In this paper we examine a new instruction prefetch architecture called Fetch Directed Prefetching, and compare it to the performance of next-line prefetching and streaming buffers. This architecture uses a decoupled branch predictor and instruction cache, so the branch predictor can run ahead of the instruction cache fetch. In addition, we examine marking fetch blocks in the branch predictor that are kicked out of the instruction cache, so branch predicted fetch blocks can be accurately prefetched. Finally, we model the use of idle instruction cache ports to filter prefetch requests, thereby saving bus bandwidth to the L2 cache
Article
Full-text available
Microprocessors continue on the relentless path to provide more performance. Every new innovation in computing-distributed computing on the Internet, data mining, Java programming, and multimedia data streams-requires more cycles and computing power. Even traditional applications such as databases and numerically intensive codes present increasing problem sizes that drive demand for higher performance. Design innovations, compiler technology, manufacturing process improvements, and integrated circuit advances have been driving exponential performance increases in microprocessors. To continue this growth in the future, Hewlett Packard and Intel architects examined barriers in contemporary designs and found that instruction-level parallelism (ILP) can be exploited for further performance increases. This article examines the motivation, operation, and benefits of the major features of IA-64. Intel's IA-64 manual provides a complete specification of the IA-64 architecture
Article
Full-text available
Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhancedby an ad...
Article
Full-text available
Instruction supply is a crucial component of processor performance. Instruction prefetching has been proposed as a mechanism to help reduce instruction cache misses, which in turn can help increase instruction supply to the processor. In this paper we examine a new instruction prefetch architecture called Fetch Directed Prefetching, and compare it to the performance of next-line prefetching and streaming buffers. This architecture uses a decoupled branch predictor and instruction cache, so the branch predictor can run ahead of the instruction cache fetch. In addition, we examine marking fetch blocks in the branch predictor that are kicked out of the instruction cache, so branch predicted fetch blocks can be accurately prefetched. Finally, we model the use of idle instruction cache ports to filter prefetch requests, thereby saving bus bandwidth to the L2 cache. 1 Introduction At a high-level, a modern high-performance processor is composed of two processing engines: the front-end proc...
Article
Full-text available
Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and effciently analyze program behavior over large sections of execution. Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a prog ram. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based onclustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.
Article
Full-text available
This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile-based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile-based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile-based code optimizer significantly improves the performance of production C programs that have already been optimized by a high-quality global code optimizer
Article
Full-text available
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations. 1.
Article
This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile-based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile-based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile-based code optimizer significantly improves the performance of production C programs that have already been optimized by a high-quality global code optimizer.
Article
The memory reference trace of a computation is modeled as a probabilistic process and the information content of that process is derived. Techniques are developed for analyzing the effectiveness of the addressing architecture and Memory/CPU traffic of existing machines with respect to the information theoretic bound for a given trace. Several techniques for analyzing particular aspects of addressing architecture are also developed. Possible areas of improvement for addressing architecture, compilers, and memory architecture are suggested for performance enhancement.
Conference Paper
Program slicing is a method for automatically decomposing programs by analyzing their data flow and control flow. Starting from a subset of a program's behavior, slicing reduces that program to a minimal form which still produces that behavior. The reduced program, called a ``slice,'' is an independent program guaranteed to represent faithfully the original program within the domain of the specified subset of behavior. Some properties of slices are presented. In particular, finding statement-minimal slices is in general unsolvable, but using data flow analysis is sufficient to find approximate slices. Potential applications include automatic slicing tools for debuggng and parallel processing of slices.
Conference Paper
Current work in Simultaneous Multithreading provides little benefit to programs that aren't partitioned into threads. We propose Simultaneous Subordinate Microthreading (SSMT) to correct this by spawning subordinate threads that perform optimizations on behalf of the single primary thread. These threads, written in microcode, are issued and executed concurrently with the primary thread. They directly manipulate the microarchitecture to improve the primary thread's branch prediction accuracy, cache hit rate, and prefetch effectiveness. All contribute to the performance of the primary thread. This paper introduces SSMT and discusses its potential to increase performance. We illustrate its usefulness with an SSMT machine that executes subordinate microthreads to improve the branch prediction of the primary thread. We show simulation results for the SPECint95 benchmarks
Conference Paper
Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.
Article
A general method is described for solving path problems on directed graphs. Such path problems include finding shortest paths, solving sparse systems of linear equations, and carrying out global flow analysis of computer programs. The method consists of two steps. First, a collection of regular expressions representing sets of paths in the graph is constructed. This can be done by using any standard algorithm, such as Gaussian or Gauss-Jordan elimination. Next, a natural mapping from regular expressions into the given problem domain is applied. The mappings required to find shortest paths are exhibited, sparse systems of linear equations are solved, and global flow analysis is carried out. The results provide a general-purpose algorithm for solving any path problem and show that the problem of constructing path expressions is in some sense the most general path problem.
Article
Let G equals (V,E) be a directed graph with a distinguished source vertex s. The single-source path expression problem is to find, for each vertex v, a regular expression P(s,v) which represents the set of all paths in G from s to v. A solution to this problem can be used to solve shortest path problems, solve sparse systems of linear equations, and carry out global flow analysis. A method is described for computing path expressions by dividing G into components, computing path expressions on the components by Gaussian elimination, and combining the solutions. This method reuires O(m alpha (m,n)) time on a reducible flow graph, where n is the number of vertices in G, m is the number of edges in G, and alpha is a functional inverse of Ackermann's function. The method makes use of an algorithm for evaluating functions defined on paths in trees. A simplified version of the algorithm, which runs in O(m log n) time on reducible flow graphs, is quite easy to implement and efficient in practice.
Article
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependance assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.
Conference Paper
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central to the success of pre-execution. We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features. We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under difference conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution.
Conference Paper
Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.
Conference Paper
A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately anticipated using existing prefetching or branch prediction mechanisms. The behavior of many problem instructions can be predicted by executing a small code fragment called a speculative slice. If a speculative slice is executed before the corresponding problem instructions are fetched then the problem instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the program's execution path) that they will not be used. This enables the remaining predictions to be correctly correlated
Conference Paper
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution-essentially a combined act of speculative address generation and prefetching to accelerate the main thread. In this paper we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching
Conference Paper
Mispredicted branches and loads that miss in the cache cause the majority of retirement stalls experienced by sequential processors; we call these critical instructions. Despite their importance, a sequential processor has difficulty prioritizing critical computations (computations of critical instructions), because it must fetch all computations sequentially, regardless of their contribution to performance. Speculative data-driven multithreading (DDMT) is a general-purpose mechanism for overcoming this limitation. In DDAT critical computations are annotated so that they can execute standalone. When the processor predicts an upcoming instance of a critical instruction, it microarchiturally forks a copy of its computation as a new kind of speculative thread: a data-driven thread (DDT). The DDT executes in parallel with the main program thread, but typically generates the critical result much faster since it fetches and executes only the critical computation and not the whole program. A DDT “pre-executes” a critical computation and effectively “consumes” its latency on behalf of the main thread. A DDMT component called integration incorporates results completed in DDTs directly, into the main thread, sparing it from having to repent the work. We simulate an implementation of DDMT on top of a simultaneous multithreading (SMT) processor and use program profiles to create DDTs and annotate them into the executable. Our experiments show that DDMT pre-execution of critical loads and branches can improve performance significantly
Conference Paper
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.
Conference Paper
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multi-threading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
Article
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar 's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processorswith a variety of organizationsoutperform corresponding conventional multiprocessors with similar execution resources.
Article
Current work in Simultaneous Multithreading provides little benefit to programs that aren't partitioned into threads. We propose Simultaneous Subordinate Microthreading (SSMT) to correct this by spawning subordinate threads that perform optimizations on behalf of the single primary thread. These threads, written in microcode, are issued and executed concurrently with the primary thread. They directly manipulate the microarchitecture to improve the primary thread's branch prediction accuracy, cache hit rate, and prefetch effectiveness. All contribute to the performance of the primary thread. This paper introduces SSMT and discusses its potential to increase performance. We illustrate its usefulness with an SSMT machine that executes subordinate microthreads to improve the branch prediction of the primary thread. We show simulation results for the SPECint95 benchmarks.
Article
Conventional dataflow analysis computes information about what facts may or will not hold during the execution of a program. Sometimes it is useful, for program optimization, to know how often or with what probability a fact holds true during program execution. In this paper, we provide a precise formulation of this problem for a large class of dataflow problems --- the class of finite bi-distributive subset problems. We show how it can be reduced to a generalization of the standard dataflow analysis problem, one that requires a sum-over-all-paths quantity instead of the usual meet-overall -paths quantity. We show that Kildall's result expressing the meet-over-all-paths value as a maximal-fixed-point carries over to the generalized setting. We then outline ways to adapt the standard dataflow analysis algorithms to solve this generalized problem, both in the intraprocedural and the interprocedural case. 1 Introduction Conventional dataflow analysis computes information about what facts...
Article
This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium TM ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%. 1.
Article
We introduce a new execution paradigm called assisted execution. In this model, a set of auxiliary "assistant" threads, called nanothreads, is attached to each thread of an application. Nanothreads are very lightweight threads which run on the same processor as the main (application) thread and help execute the main thread as fast as possible. Nanothreads exploit resources that are idled in the processor because of dependencies and memory access delays. Assisted execution has the potential to alter the current trade-offs between static and dynamic execution mechanisms. Nanothreads can monitor and reconfigure the underlying hardware, can emulate hardware and can profile applications with little or no interference to improve the program on-line or off-line. We demonstrate the power of assisted execution with an important application, namely data prefetching to fight the memory wall problem. Simulation results on several SPEC95 benchmarks show that sequential and stride prefet...
Intel's multi-threading technology . Microprocessor Forum
  • G Hinton
  • J Shen
G. Hinton and J. Shen. Intel's multi-threading technology. Microprocessor Forum, Oct. 2001.