Conference PaperPDF Available

Speculative Prefetching of Induction Pointers

Abstract

We present an automatic approach for prefetching data for linked list data structures. The main idea is based on the observation that linked list elements are frequently allocated at constant distance from one another in the heap. When linked lists are traversed, a regular pattern of memory accesses with constant stride emerges. This regularity in the memory footprint of linked lists enables the development of a prefetching framework where the address of the element accessed in one of the future iterations of the loop is dynamically predicted based on its previous regular behavior. We automatically identify pointer-chasing recurrences in loops that access linked lists. This identification uses a surprisingly simple method that looks for induction pointers — pointers that are updated in each loop iteration by a load with a constant offset. We integrate induction pointer prefetching with loop scheduling. A key intuition incorporated in our framework is to insert prefetches only if there are processor resources and memory bandwidth available. In order to estimate available memory bandwidth we calculate the number of potential cache misses in one loop iteration. Our estimation algorithm is based on an application of graph coloring on a memory access interference graph derived from the control flow graph. We implemented the prefetching framework in an industry-strength production compiler, and performed experiments on ten benchmark programs with linked lists. We observed performance improvements between 15% and 35% in three of them.
... Namely, the difference between two successive data addresses changes only infrequently at runtime. Stoutchinin et al [23] and Collins et al [4] notice that several important loads in 181.mcf of CPU2000 integer benchmark suite have near-constant strides. Our experience indicates that many irregular programs, in addition to 181.mcf, contain loads with near-constant strides. ...
... The most relevant compile-time stride prefetching method is proposed by Stoutchinin et al [23]. It uses compiler analysis to detect induction pointers and insert instructions into user programs to compute strides and perform stride prefetching for the induction pointers. ...
... This experiment compares VPGSP with a start-of-the-art static prefetching implemented in a production compiler for IPF. The production compiler includes implementation of a number of prefetching techniques for array and irregular code [23] [20]. Specifically, it includes the technique proposed by Stoutchinin et al [23] to detect induction pointers for stride prefetching. ...
Conference Paper
Full-text available
Memory operations in irregular code are difficult to prefetch, as the future address of a memory location is hard to anticipate by a compiler. However, recent studies as well as our experience indicate that many irregular programs contain loads with near-constant strides. This paper presents a novel compiler technique to profile and prefetch for those loads. The profile captures not only the dominant stride values for each profiled load, but also the differences between the successive strides of the load. The profile information helps the compiler to classify load instructions into strongly or weakly strided and single- strided or phased multi-strided. The prefetching decisions guided by the load classifications are highly selective and beneficial. We obtain significant performance improvement for the CPU2000 integer programs running on ItaniumTM machines. For example, we achieve a 1.55x speedup for ″181.mcf″, 1.15x for ″254.gap″, 1.08x for ″197.parser″ and smaller gains in other benchmarks. We also show that the performance gain is stable across profile data sets and that the profiling overhead is low. These benefits make the new technique suitable for a production compiler.
... Deciding which addresses to prefetch can be done in many different ways, such as greedily prefetching all pointer fields of an object when it is encountered [10,45], imposing particular layouts on the RDS which allow address arithmetic to be used [45], adding explicit prefetch pointer field(s) to objects [10,40,45,59], and software or hardware runtime methods to detect patterns in addresses and perform prefetching (including the linear address prefetch units in modern desktop processors) [3,16,36,47,53,58,67,75,76]. In all these schemes, the emphasis is on prefetching what is needed and no more, because unnecessary prefetches will significantly increase the required memory bandwidth, evict useful data from the cache, incur an instruction overhead, and increase node size unnecessarily when prefetch field(s) are in use. ...
... ., where b is the stride) and prefetches ahead of the current access. Stoutchinin et al. [67] modified a compiler to identify pointer-chasing loops and conservatively determine whether there is available bandwidth to perform prefetching. Wu et al. and others [47,75,76] use profiling to guide the insertion of code into suitable loops, which allows less conservatism, and report that their scheme outperforms the Itanium's hardware prefetch unit by about a third. ...
... Artour et. al (Stoutchinin et al. 2001) have presented speculative stride prefetching for linked list prefetching by computing the stride value in run time. ...
... Recent studies (Wu 2002;Stoutchinin et al. 2001) show that there are irregular references whose stride of accessed address for two consecutive iterations appears as a constant value, such as p = p->next if the list is allocated in a sequential memory space. Apparently, this implicit stride information is not guaranteed in run time and just a probability. ...
Article
Full-text available
Abstract Software data prefetching is a key technique for hiding memory latencies on modern,high performance,processors. Stride memory references are prime candidates for software prefetches on archi- tectures with, and without, support for hardware prefetching. Com- pilers typically implement,software prefetching in the context of loop nest optimizer (LNO), which focuses on affine references in well formed loops but miss out on opportunities in C++ STL style codes. In this paper, we describe a new inductive data prefetching al- gorithm implemented,in the global optimizer. It bases the prefetch- ing decisions on demand,driven speculative recognition of induc- tive expressions, which equals to strongly connected component detection in data flow graph, thus eliminating the need to invoke the loop nest optimizer. This technique allows accurate computa- tion of stride values and exploits phase ordering. We present an efficient implementation after SSAPRE optimization, which fur-
... Given an object o, we know the address of objects that o references, and we cannot prefetch other objects without following pointer chains. Recent pointer prefetching work considers C programs only [66,70,90,56,101]. Object-oriented Java programs pose additional analysis challenges because they mostly allocate data dynamically, contain frequent method invocations, and often implement loops with recursion. ...
... based upon the idea of induction pointers[101]. They identify linked structure traversals in a loop through pointer load instructions that are updated by a constant offset in each iteration. ...
Article
EFFECTIVE COMPILE-TIME ANALYSIS FOR DATASEPTEMBER 2002BRENDON D. CAHOONB.A., CLARK UNIVERSITYM.S., UNIVERSITY OF MASSACHUSETTS, AMHERSTPh.D., UNIVERSITY OF MASSACHUSETTS AMHERSTDirected by: Professor Kathryn S. McKinleyThe memory hierarchy in modern architectures continues to be a major performancebottleneck. Many existing techniques for improving memory performance focus on Fortranand C programs, but memory latency is also a barrier to achieving high performancein...
... Previous approaches that use static analysis to detect access to persistent data have targeted specific types of data structures such as linked data structures [1,4,8], recursive data structures [11,13] or matrices [12]. To the best of our knowledge, our work is the first that predicts access to persistent objects of any type prior to application execution. ...
Conference Paper
Full-text available
In this paper, we present a fully-automatic, high-accuracy approach to predict access to persistent objects through static code analysis of object-oriented applications. The most widely-used previous technique uses a simple heuristic to make the predictions while approaches that offer higher accuracy are based on monitoring application execution. These approaches add a non-negligible overhead to the application’s execution time and/or consume a considerable amount of memory. By contrast, we demonstrate in our experimental study that our proposed approach offers better accuracy than the most common technique used to predict access to persistent objects, and makes the predictions farther in advance, without performing any analysis during application execution.
... Some of them have been designed to be completely implemented in hardware [24,2,33,17,34,26,14,9], having the advantages of being performed at run-time, being transparent to the programmer, not introducing explicit execution overhead on the program by prefetch instruction addition, and requiring no code transformations. Others are software solutions [4,16,6,29,20,31,3], allowing a larger analysis scope, adding no complexity to the processor and allowing to implement sophisticated optimization strategies. Finally, a third class of methods are hybrid solutions [25,1,30,32], combining both hardware and software approaches in order to reduce hardware requirements and adjust to dynamic behavior. ...
Article
Full-text available
It is unknown, at compile time, the locations in the heap where the nodes of a linked data structure shall be allocated during the execution of a dynamic code. This lack of information limits our ability of exploiting the reference locality when such data structures are created, traversed or modified. For pointer-based codes, where the heap memory locations that will be referenced in the near future are unknown, the performance efficiency of the prefetcher greatly depends on the accuracy of the memory location prediction techniques. For this reason, many predictors in the context of irregular and pointer-based applications have been proposed in the literature. Unfortunately, each one of these predictors performs well only for certain data access patterns while it performs poor for others. In this work we propose a model to parameterize a wide selection of pre-diction methods in the context of pointer-based codes. With this model we intend to state a baseline to compare the performance of these predic-tors and determine in which conditions (code and data properties) each predictor works the best. In the context of this model, we present an experimental evaluation of the behavior of several predictors on a repre-sentative set of pointer-intensive codes extracted from the Olden suite. These results show the usefulness of our model as a tool to conduct unbiased comparisons of such predictors.
Conference Paper
Irregular data references are difficult to prefetch, as the future memory address of a load instruction is hard to anticipate by a compiler. However, recent studies as well as our experience indicate that some important load instructions in irregular programs contain stride access patterns. Although the load instructions with stride patterns are difficult to identify with static compiler techniques, we developed an efficient profiling method to discover these load instructions. The new profiling method integrates the profiling for stride information and the traditional profiling for edge frequency into a single profiling pass. The integrated profiling pass runs only 17% slower than the frequency profiling alone. The collected stride information helps the compiler to identify load instructions with stride patterns that can be prefetched efficiently and beneficially. We implemented the new profiling and prefetching techniques in a research compiler for Itanium Processor Family (IPF), and obtained significant performance improvement for the SPECINT2000 programs running on Itanium machines. For example, we achieved a 1.59x speedup for 181.mcf, 1.14x for 254.gap, and 1.08x for 197.parser. We also showed that the performance gain is stable across input data sets. These benefits make the new profiling and prefetching techniques suitable for production compilers.
Article
Full-text available
This article presents Forma, a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve data locality. Forma splits large aggregated data-types into smaller ones to improve data locality. Arrays of these large data types are then replaced by multiple arrays of the smaller types. These new arrays form natural data streams that have smaller memory footprints, better locality, and are more suitable for hardware stream prefetching. Forma consists of a field-sensitive alias analyzer, a data type checker, a portable structure reshaping planner, and an array reshaper. An extensive experimental study compares different data reshaping strategies in two dimensions: (1) how the data structure is split into smaller ones (maximal partition × frequency-based partition × affinity-based partition); and (2) how partitioned arrays are linked to preserve program semantics (address arithmetic-based reshaping × pointer-based reshaping). This study exposes important characteristics of array reshaping. First, a practical data reshaper needs not only an inter-procedural analysis but also a data-type checker to make sure that array reshaping is safe. Second, the performance improvement due to array reshaping can be dramatic: standard benchmarks can run up to 2.1 times faster after array reshaping. Array reshaping may also result in some performance degradation for certain benchmarks. An extensive micro-architecture-level performance study identifies the causes for this degradation. Third, the seemingly naive maximal partition achieves best or close-to-best performance in the benchmarks studied. This article presents an analysis that explains this surprising result. Finally, address-arithmetic-based reshaping always performs better than its pointer-based counterpart.
Conference Paper
The memory wall problem is one of the important issues in modern computer system, and it affects the system performance in spite of the powerful processor. The emergence of multi-core processors has further exacerbated the problem. On the other hand, the increasing use of the linked data structure in applications aggravates the memory access latency. This paper utilizes multi-threading technology based on CMP, and dispatches a helper thread when the program is running which prefetches the demanded data into the shared cache in advance to hide the long memory access latency. The helper thread shows great performance by controlling the distance between helper thread and main thread. Simple analysis of the effect of the computation workload between the accesses of the pointers to prefetching is also provided.
Article
Full-text available
Modulo scheduling is a framework within which algorithms for software pipelining innermost loops may be defined. The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. A wide variety of algorithms and heuristics can be defined within this framework. Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modulo scheduling is computationally expensive as well as difficult to implement, have inhibited its incorporation into product compilers. This paper presents iterative modulo scheduling, a practical algorithm that is capable of dealing with realistic machine models. The paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Article
Full-text available
This paper is a scientific comparison of two code generation techniques with identical goals --- generation of the best possible software pipelined code for computers with instruction level parallelism. Both are variants of modulo scheduling, a framework for generation of software pipelines pioneered by Rau and Glaser [RaG181], but are otherwise quite dissimilar.One technique was developed at Silicon Graphics and is used in the MIPSpro compiler. This is the production compiler for SGI's systems which are based on the MIPS R8000 processor [Hsu94]. It is essentially a branch--and--bound enumeration of possible schedules with extensive pruning. This method is heuristic because of the way it prunes and also because of the interaction between register allocation and scheduling.The second technique aims to produce optimal results by formulating the scheduling and register allocation problem as an integrated integer linear programming (ILP1) problem. This idea has received much recent exposure in the literature [AlGoGa95, Feautrier94, GoAlGa94a, GoAlGa94b, Eichenberger95], but to our knowledge all previous implementations have been too preliminary for detailed measurement and evaluation. In particular, we believe this to be the first published measurement of runtime performance for ILP based generation of software pipelines.A particularly valuable result of this study was evaluation of the heuristic pipelining technology in the SGI compiler. One of the motivations behind the McGill research was the hope that optimal software pipelining, while not in itself practical for use in production compilers, would be useful for their evaluation and validation. Our comparison has indeed provided a quantitative validation of the SGI compiler's pipeliner, leading us to increased confidence in both techniques.
Article
We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identzj+ing producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching eflect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs.
Conference Paper
Previous research on hiding memory latencies has tended to focus on regular numerical programs. This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs. By assuming a lock-up free cache and instruction scoreboarding, our technique >re!oads ’ the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency. We have developed simple compiler heuristics to identify load instructions that are likely to cause a cache-miss. Experimentation with a set of SPEC92 benchmarks shows that our heuristics are successful in identifying 85% of cache misses. We have also developed an algorithm that flexibly schedules the selected load instruction and instructions that use the loaded data to hide memory latency. Our simulation suggests that our technique is successful in hiding memoy latency and improves the overall performance.
Article
An algorithm to enumerate all the elementary circuits of a directed graph is presented. The algorithm uses back-tracking with lookahead to avoid unnecessary work, and it has a time bound of $O ((V+E)(C+1))$ when applied to a graph with $V$ vertices, $E$ edges, and $C$ elementary circuits. Keywords: Algorithm, circuit, cycle, graph
Article
To narrow the widening gap between processor and memory performance, the authors propose improving the cache locality of pointer-manipulating programs and bolstering performance by careful placement of structure elements. It is concluded that considering past trends and future technology, it seems clear that the processor-memory performance gap will continue to increase and software will continue to grow larger and more complex. Although cache-conscious algorithms and data structures are the first and perhaps best place to attack this performance problem, the complexity of software design and an increasing tendency to build large software systems by assembling smaller components does not favor a focused, integrated approach. We propose another, more incremental approach of cache-conscious data layout, which uses techniques such as clustering, coloring, and compression to enhance data locality by placing structure elements more carefully in the cache.
Article
The Cydra 5 is a VLIW minisupercomputer with hardware designed to accelerate a broad class of inner loops, presenting unique challenges to its compilers. We discuss the organization of its Fortran/77 compiler and several of the key approaches developed to fully exploit the hardware. These include the intermediate representation used; the preparation, overlapped scheduling, and register allocation of inner loops; the speculative execution model used to control global code motion; and the machine model and local instruction scheduling approach.
Conference Paper
Global Instruction Schedulers can be classified as either structure or profile driven. Structure driven approaches attempt to find instruction level parallelism by redistributing instructions along all possible execution paths. When resources are limited, poor choices may penalize the frequently executed paths. By contrast, profile driven approaches use feedback information to identify frequently executed (hot) regions, and attempt to improve their performance. This may be at the expense of less frequently executed (cold) regions, for instance by inserting fixup code. The overall performance improves if the frequency information is accurate and there is a dominant trace in the program. If either of these conditions does not hold, performance may degrade. We present a novel algorithm that attempts to combine the individual merits of the above two approaches while avoiding some of their drawbacks. We have also incorporated several techniques which improve the global scheduling performance on out-of-order (OOO) processors. Our algorithm is integrated with a parametric resource model and can be applied both before and after register allocation. It has been implemented in the SGI MIPSpro compiler, and the results have been evaluated on the MIPS R8000 and R10000 processors
Conference Paper
Software prefetching, typically in the context of numeric- or loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic-SPAID-for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic