Conference PaperPDF Available

Cache performance of combinator graph reduction

Authors:

Abstract

The threaded Interpretive Graph Reduction Engine (TIGRE) was developed for the efficient reduction of combinator graphs in support of functional programming languages and other applications. Results are presented of cache simulations of the TIGRE graph reducer with the following parameters varied: cache size, cache organization, block size, associativity, replacement policy, write policy, and write allocation. As a check on these results, the simulations are compared to measured performance on real hardware. From the results of the simulation study, it is concluded that graph reduction in TIGRE has a very heavy dependence on a write-allocate strategy for good performance, and very high spatial and temporal locality
Article
Full-text available
We present a new abstract machine for graph reduction called TIGRE. Benchmark results show that TIGRE's execution speed compares quite favorably with previous combinator-graph reduction techniques on similar hardware. Furthermore, the mapping of TIGRE onto conventional hardware is simple and efficient. Mainframe implementations of TIGRE provide performance levels exceeding those previously available on custom graph reduction hardware.
Conference Paper
Cache performance is critical in high-speed computing systems. However, heap-intensive programs such as LISP codes typically have low cache performance because of inherently poor data locality. To improve the cache performance, the system should reuse heap cells while they are in cache, thus reducing the number of cache misses due to heap references. Furthermore, the system can adopt multi-threaded architecture to hide the cache miss overhead by switching to different control threads when a cache miss occurs. In this paper, a novel architectural scheme called cache-level garbage collection based on multi-threading is developed to improve the cache performance for heap-intensive programs. Consequently both the cache hit ratio is improved and the cache miss overhead is masked, thereby minimizing the total cache miss penalty. We present the garbage collection algorithm and its architectural support features, together with initial performance evaluation.
Conference Paper
Several researchers have attempted to improve locality in garbage-collected heaps by changing the traversal algorithm used by a copying garbage collector. Unfortunately, these studies met with small success. We hypothesized that the disappointing results of these previous studies were due to two flaws in the traversal algorithms tested. They failed to group data structures in a manner reflecting their hierarchical organization, and more importantly, they ignored the dkastrous grouping effects caused by reaching data structures from a linear traversal of hash tables (i.e., in pseudo-random order). To test this hypothesis, we modified the garbage collector of a Lisp system (specifically, the Scheme-48 system) to avoid both problems in reorganizing the system heap image. We implemented our “hierarchical decomposition” algorithm (a cousin of Moon’s “approximately depth-fiist” algorithm) that is quite efficient on stock hardware. We aleo changed the collector to traverse global variable bindings in the order of their creation rather than in the memory order imposed by hash tables. The effects of these changes confirm our hypothesis. Some improvement comes from the basic traversal algorithm, and a greater effect results from the special treatment of hash tables. Initial page faults are reduced significantly and repeated page faults are reduced tremendously (by roughly an order of magnitude). In addition, improved measures of static locality (such as the percentage of on-page pointers) indicate that heap data can be cheaply and effectively compressed, and this may allow more effective paging and prefetching strategies; we suggest a level of “compressed in-RAM storage”, with price and performance between those of RAM and disk.
Article
Full-text available
Multiparadigm research is a relatively new direction in programming language design. In this paper we discuss several aspects of this research area. We consider some of the ideas that underlie the multiparadigm point of view, we examine some of the motivations ...
Article
Full-text available
. Dynamic memory allocation has been a fundamental part of most computer systems since roughly 1960, and memory allocation is widely considered to be either a solved problem or an insoluble one. In this survey, we describe a variety of memory allocator designs and point out issues relevant to their design and evaluation. We then chronologically survey most of the literature on allocators between 1961 and 1995. (Scores of papers are discussed, in varying detail, and over 150 references are given.) We argue that allocator designs have been unduly restricted by an emphasis on mechanism, rather than policy, while the latter is more important; higher-level strategic issues are still more important, but have not been given much attention. Most theoretical analyses and empirical allocator evaluations to date have relied on very strong assumptions of randomness and independence, but real program behavior exhibits important regularities that must be exploited if allocator...
Article
Full-text available
We present a new abstract machine for graph reduction called TIGRE. Benchmark results show that TIGRE's execution speed compares quite favorably with previous combinator-graph reduction techniques on similar hardware. Furthermore, the mapping of TIGRE onto conventional hardware is simple and efficient. Mainframe implementations of TIGRE provide performance levels exceeding those previously available on custom graph reduction hardware.
Conference Paper
This paper is a description of the three instruction machine Tim, an abstract machine for the execution of supercombinators. Tim usually executes programmes faster than the G-machine style of abstract machine while being at least as easy to implement as an S-K combinator reducer. It has a lower overhead for passing unevaluated arguments than the G-machine, resulting in good performance even without strictness analysis, and is probably easier to implement in hardware. The description begins with a presentation of the instruction set of the machine, followed by the operational semantics of the normal order version and the algorithm to convert combinators to instructions. It then develops the machine to allow lazy evaluation and the use of sharing and strictness analysis. The final sections of the paper give some performance figures and comment upon the suitability of the machine for hardware implementation.
Conference Paper
Advances in integrated circuit density are permitting the implementation on a single chip of functions and performance enhancements beyond those of a basic processors. One performance enhancement of proven value is a cache memory; placing a cache on the processor chip can reduce both mean memory access time and bus traffic. In this paper we use trace driven simulation to study design tradeoffs for small (on-chip) caches. Miss ratio and traffic ratio (bus traffic) are the metrics for cache performance. Particular attention is paid to sub-block caches (also known as sector caches), in which address tags are associated with blocks, each of which contains multiple sub-blocks; sub-blocks are the transfer unit. Using traces from two 16-bit architectures (Z8000, PDP-11) and two 32-bit architectures (VAX-11, System/370), we find that general purpose caches of 64 bytes (net size) are marginally useful in some cases, while 1024-byte caches perform fairly well; typical miss and traffic ratios for a 1024 byte (net size) cache, 4-way set associative with 8 byte blocks are: PDP-11: .039, .156, Z8000: .015, .060, VAX 11: .080, .160, Sys/370: .244, .489. (These figures are based on traces of user programs and the performance obtained in practice is likely to be less good.) The use of sub-blocks allows tradeoffs between miss ratio and traffic ratio for a given cache size. Load forward is quite useful. Extensive simulation results are presented.
Article
Cache memories are used in modern, medium and high-speed CPUs to hold temporarily those portions of the contents of main memory which are {believed to be) currently in use. Since instructions and data in cache memories can usually be referenced in 10 to 25 percent of the time required to access main memory, cache memories permit the executmn rate of the machine to be substantially increased. In order to function effectively, cache memories must be carefully designed and implemented. In this paper,we explain the various aspects of cache memorms and discuss in some detail the design features and trade-offs. A large number of original, trace-driven simulation results are presented. Consideration is given to practical implementatmn questions as well as to more abstract design issues.
Article
It is shown how by using results from combinatory logic an applicative language, such as LISP, can be translated into a form from which all bound variables have been removed. A machine is described which can efficiently execute the resulting code. This implementation is compared with a conventional interpreter and found to have a number of advantages. Of these the most important is that programs which exploit higher order functions to achieve great compactness of expression are executed much more efficiently.
Article
This short article presents an algorithm for bracket abstraction [1] which avoids a combinatorial explosion in the size of the resulting expression when applied repeatedly for abstraction in a series of variables. It differs from a previous solution [2] in introducing only a finite number of additional combinators and in not requiring that all the variables to be abstracted be treated together in a single operation.
Facilities and Services Guide
  • Pittsburgh Supercomputer
  • Center
Pittsburgh Supercomputer Center (1989) Facilities and Services Guide, Pittsburgh PA.