Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically pertain to a specific implementation with fixed parameters mapped to a specific hardware implementation. The implementations are not designed to be easily ported to other GPUs, even from the same manufacturer. When target hardware changes, the application must be re-optimized.
In this paper we address a different problem. We aim to deliver working, efficient GPU code in a library that is downloaded and run by many different users. The issue is to deliver efficiency independent of the individual user parameters and without a priori knowledge of the hardware the user will employ. This problem requires a different set of tradeoffs than finding the best runtime for a single solution. Solutions must be adaptable to a range of different parameters both to solve users' problems and to make the best use of the target hardware.
Another issue is the integration of GPUs into a Problem Solving Environment (PSE) where the use of a GPU is almost invisible from the perspective of the user. Ease of use and smooth interactions with the existing user interface are important to our approach. We illustrate our solution with the incorporation of GPU processing into the Scientific Computing Institute (SCI)Run Biomedical PSE developed at the University of Utah. SCIRun allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms in the scientist's simulations. Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.
The results of a study of the VAX 8800 processor performance using a hardware monitor that collects histograms of the processor's micro-PC and memory bus status are presented. The monitor keeps a count of all machine cycles executed at each micro-PC location, as well as counting all occurrences of each bus transaction. It can measure a running system without interfering with it, and these results are based on measurements of live timesharing. Because the 8800 is a microcoded machine, a great deal of information can be gleaned from these data. Opcode and operand specifier frequencies are reported, as well as the amount of time spent in instruction execution and various kinds of overhead, such as memory management and cache-wait stalls. The histogram method yields a detailed picture of the amount of time an average VAX instruction spends in various activities on the 8800
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. Both matched and unmatched memories are considered: we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.
Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. The authors show how a combination of less expensive mechanisms can achieve better performance than BTB's. This combination relies on a number of design choices described in the paper. They used trace-driven simulation to show that their proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicate how to further improve this design
Four alternative implementations for achieving higher data rates in a disk subsystem (parallel heads without replication, parallel heads with replication, parallel actuators without replication, and parallel actuators with replication) are studied. Focus is on the tradeoffs between the number of devices and the number of data paths while keeping the number of physical devices constant (which may keep the cost roughly constant). The performance advantages and limitations of the alternative implementations are analyzed using an analytic queuing model and compared to a conventional disk subsystem. The study shows that parallel heads with replication from a single actuator performs the best for the average application environments, although other configurations may be more cost-effective
Several experiments using a versatile optimizing compiler to evaluate the benefit of four forms of microarchitectural parallelisms (multiple microoperations issued per cycle, multiple result-distribution buses, multiple execution units, and pipelined execution units) are described. The first 14 Livermore loops and 10 of the linpack subroutines are used as the preliminary benchmarks. The compiler generates optimized code for different microarchitecture configurations. It is shown how the compiler can help to derive a balanced design for high performance. For each given set of technology constraints, these experiments can be used to derive a cost-effective microarchitecture to execute each given set of workload programs at high speed
Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algorithm for uniform, hotspot, and local traffic patterns. Our simulations indicate that the partially adaptive northlast algorithm, which causes unbalanced traffic in the network, performs worse than the nonadaptive e-cube routing algorithm for all three traffic patterns. Another result of our study is that the performance does not necessarily improve with full-adaptivity. In particular, a commonly discussed fully-adaptive routing algorithm, which uses 2n virtual channels per physical channel of a k-ary n-cube, performs worse than e-cube for uniform and hotspot traffic patterns. The other three fully-adaptive algorithms, which give priority to messages based on distances traveled, perform much better than the e-cube and partially-adaptive algorithms for all three traffic patterns. One of the conclusions of this study is that adaptivity, full or partial, is not necessarily a benefit in wormhole routing.
Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request.
In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol.
Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.
As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. We propose a new dynamic branch predictor (Two-Level Adaptive Branch Prediction) that achieves substantially higher accuracy than any other scheme reported in the literature. The mechanism uses two levels of branch history information to make predictions, the history of the last k branches encountered, and the branch behavior for the last s occurrences of the specific pattern of these k branches. We have identified three variations of the Two-Level Adaptive Branch Prediction, depending on how finely we resolve the history information gathered. We compute the hardware costs of implementing each of the three variations, and use these costs in evaluating their relative effectiveness. We measure the branch prediction accuracy of the three variations of two-Level Adaptive Branch Prediction, along with several other popular proposed dynamic and static prediction schemes, on the SPEC benchmarks. We show that the average prediction accuracy for Two-Level Adaptive Branch Prediction is 97 percent, while the other known schemes achieve at most 94.4 percent average prediction accuracy. We measure the effectiveness of different prediction algorithms and different amounts of history and pattern information. We measure the costs of each variation to obtain the same prediction accuracy.
The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer — called a TLB slice — which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices.
The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically-indexed cache. Because of the virtual tag, the TLB slice needs to hold only enough physical page number bits — typically 4 to 8 — to complete the physical cache index, in contrast with a conventional TLB, which needs to hold both a virtual page number and a physical page number. The virtual page number is unnecessary because the TLB slice needs to provide only a hint for the translated physical address rather than a guarantee. The full physical page number is unnecessary because the cache hit logic is based on the virtual tag. Furthermore, if the cache is multi-level and references to the TLB slice are “shielded” by hits in a virtually indexed primary cache, the slice can get by with very few entries, once again lowering its cost and increasing its speed. With this mechanism, the simplicity of a physical cache can been combined with the speed of a virtual cache.
Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer) machines. The authors use a trace-generation mechanism based on link-time code modification which is simple to use, generates accurate long traces of multiuser programs, runs on a RISC (reduced-instruction-set-computer) machine, and can be flexibly controlled. Accurate performance data for large second-level caches can be obtained by on-the-fly analysis of the traces. A comparison is made of the performance of systems with 512 K to 16 M second-level caches, and it is show that, for today's large programs, second-level caches of more than 4 MB may be unnecessary. It is also shown that set associativity in second-level caches of more than 1 MB does not significantly improve system performance. In addition, the experiments provide insights into first-level and second-level cache line size
Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space generated is dependent upon the dimension of the requested subcube dynamically, instead of being predetermined and fixed. The basic idea of this strategy lies in collapsing the binary tree representations of a hypercube successively so that the nodes which form a subcube but are distant would be brought close to each other for recognition. The strategy can be implemented efficiently by using shuffle operations on the leaf node addresses of binary tree representations. Extensive simulation runs are carried out to collect experimental performance measures of interest of different allocation strategies. It is shown from analytic and experimental results that this strategy compares favorably in many situations with any other known allocation scheme capable of achieving complete subcube recognition
This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads with random single block reads and writes, array performance is compared in four different modes - normal mode (no disks have failed), degraded mode (a disk has failed and its data has not been reconstructed), rebuild mode (a disk has failed and its data is being reconstructed), and copyback mode(which is needed for distributed sparing and parity sparing when failed disks are replaced with new disks). Attention is concentrated on small disk subsystems (fewer than 32 disks) where choice of sparing method has significant impact on array performance, rather than large disk subsystems (64 or more disks). It is concluded that, for disk subsystems with a small number of disks, distributed sparing offers major advantages over dedicated sparing in normal, degraded and rebuild modes of operation, even if one has to pay a copyback penalty. Furthermore, it is better than parity sparing in rebuild mode and similar to it in other operating modes, making it the sparing method of choice.
The performance of message-passing applications depends on cpu speed, communication throughput and latency, and message handling overhead. In this paper we investigate the effect of varying these parameters and applying techniques to reduce message handling overhead on the execution efficiency of ten different applications. Using a message level simulator set up for the architecture of the AP1000, we showed that improving communication performance, especially message handling, improves total performance. If a cpu that is 32 times faster is provided, the total performance increases by less than ten times unless message handling overhead is reduced. Overlapping computation with message reception improves performance significantly. We also discuss how to improve the AP1000 architecture.
Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of-the-art processors, we must take cache memory into account. This paper presents an architecture for low-latency message comunication and implementation, and performance evaluation.
We developed a message controller (MSC) to support low-latency message passing communication for the AP1000, to minimize message handling overhead. MSC sends messages directly from cache memory and automatically receives messages in the circular buffer. We designed communication functions between cells and evaluated communication performance by running benchmark programs such as the Pingpong benchmark, the LINPACK benchmark, the SLALOM benchmark, and a solver using the scaled conjugate gradient method.
A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware `pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications. We use the results to evaluate the trade-offs in the design of multicomputer architectures.
The VAX architecture has been extended to include an integrated, register-based vector processor. This extension allows both high-end and low-end implementations and can be supported with only small changes by VAX/VMS and VAX/ULTRIX operating systems. The extension is effectively exploited by the new vectorizing capabilities of VAX Fortran. Features of the VAX vector architecture and the design decisions which make it a consistent extension of the VAX architecture are discussed
RAID-5 arrays need 4 disk accesses to update a data block—2 to read old data and parity, and 2 to write new data and parity. Schemes previously proposed to improve the update performance of such arrays are the Log-Structured File System  and the Floating Parity Approach . Here, we consider a third approach, called Fast Write, which eliminates disk time from the host response time to a write, by using a Non-Volatile Cache in the disk array controller. We examine three alternatives for handling Fast Writes and describe a hierarchy of destage algorithms with increasing robustness to failures. These destage algorithms are compared against those that would be used by a disk controller employing mirroring. We show that array controllers require considerably more (2 to 3 times more) bus bandwidth and memory bandwidth than do disk controllers that employ mirroring. So, array controllers that use parity are likely to be more expensive than controllers that do mirroring, though mirroring is more expensive when both controllers and disks are considered.
The Flagship project aims to produce a computing technology based on the declarative style of programming. A major component of that technology is the design for a parallel machine that can efficiently utilize the implicit parallelism in declarative programs. The computational models that expose this implicit parallelism are described, and an architecture designed to use it is outlined. The operational issues, such as dynamic load balancing, that arise in such a system are discussed, and the mechanisms being used to evaluate the architecture are described
The architecture of a coprocessor that supports the communication primitives of the Linda parallel-programming environment in hardware is described. The coprocessor is a critical element in the architecture of the Linda machine, a MIMD (multiple-instruction, multiple-data-stream) parallel-processing system that is designed top-down from the specifications of Linda. Communication in Linda programs takes place through a logically shared associative memory mechanism called tuple space. The Linda machine, however, has no physically shared memory. The microprogrammable coprocessor implements distributed protocols for executing tuple-space operations over the Linda machine communication network. The coprocessor has been designed and is in the process of fabrication. The projected performance of the coprocessor is discussed and compared with software implementation of Linda
General purpose microprocessor based computers usually speed their arithmetic processing performance by using a floating point co-processor. Because adding more co-processors represents neither a technological nor a cost problem, the authors investigated a system based on a MIPS R2000 and 4 floating point units. They show a block diagram of such an implementation and how two important scientific operations can be accelerated using a single unmodified data bus. A large percentage of the engineering applications are solved with the help of linear algebra methods like BLAS3 algorithms; it is precisely for these primitives that the proposed architecture brings significant performance gains. The first operation described is a matrix multiplication algorithm, its timing diagram and some results. Next a polynomial evaluation technique is examined. Finally they show how to use the same ideas with various other microprocessors
A multithreaded processor architecture that improves machine throughput is proposed. Instructions from different threads (not a single thread) are issued simultaneously to multiple functional units, and these instructions can begin execution unless there are functional unit conflicts. This parallel execution scheme greatly improves the utilization of the functional unit. Simulation results show that by executing two and four threads in parallel on a nine-functional-unit processor, a 2.02 and a 3.72 times speedup can be achieved over a conventional RISC processor. The architecture is also applicable to the efficient execution of a single loop. In order to control functional unit conflicts between loop iterations, a new static code scheduling technique has been developed. Another loop execution scheme that uses the multiple control flow mechanism of the architecture makes it possible to parallelize loops that are difficult to parallelize in vector or VLIW machines.
Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to perform the data fetch ahead of demand by the execute process and hence are often less sensitive to memory access delays than conventional architectures. Past performance studies of decoupled computers used memory systems that are interleave or pipelined. We undertake a simulation study of the latency effects in decoupled computers when connected to a single, conventional non-interleaved data memory module so that the effect of decoupling is isolated from the improvement caused by interleaving. We compare decoupled computer performance to single processors with caches, study the memory latency sensitivity of the decoupled systems, and also perform simulations to determine the significance of data caches in a decoupled computer architecture. The Lawrence Livermore Loops and two signal processing algorithms are used as the simulation benchmark.
Several local data buffers are proposed and measurements are presented for variations of the Warren abstract machine (WAM) architecture for Prolog. Choice-point buffers, stack buffers, split-stack buffers, multiple-register sets, copyback caches, and smart caches are examined. Statistics collected from four benchmark programs indicate that small conventional local memories perform quite well because of the WAM's high locality. The data memory performance results are equally valid for native code and reduced instruction set implementations of Prolog
Parity encoded redundant disk arrays provide highly reliable, cost effective secondary storage with high performance for read accesses and large write accesses. Their performance on small writes, however, is much worse than mirrored disks—the traditional, highly reliable, but expensive organization for secondary storage. Unfortunately, small writes are a substantial portion of the I/O workload of many important, demanding applications such as on-line transaction processing. This paper presents parity logging, a novel solution to the small write problem for redundant disk arrays. Parity logging applies journalling techniques to substantially reduce the cost of small writes. We provide a detailed analysis of parity logging and competing schemes—mirroring, floating storage, and RAID level 5— and verify these models by simulation. Parity logging provides performance competitive with mirroring, the best of the alternative single failure tolerating disk array organizations. However, its overhead cost is close to the minimum offered by RAID level 5. Finally, parity logging can exploit data caching much more effectively than all three alternative approaches.
Performance tuning becomes harder as computer technology advances. One of the factors is the increasing complexity of memory hierarchies. Most modern machines now use at least one level of cache memory. To reduce execution stalls, cache misses must be very low. Software techniques used to improve locality have been developed for numerical codes, such as loop blocking and copying. Unfortunately, the behavior of direct mapped and set associative caches is still erratic when large numerical data is accessed. Execution time can vary drastically for the same loop kernel depending on uncontrolled factors such as array leading size. The only software method available to improve execution time stability is the copying of frequently used data, which is costly in execution time. Users are not usually cache organisation experts. They are not aware of such phenomena, and have no control over it. In this paper, we show that the recently proposed four-way skewed associative cache yields very stable execution times and good average miss ratios on blocked algorithms. As a result execution time is faster and much more predictable than with conventional caches. As a result of its better comportment, it is possible to use larger blocks sizes with blocked algorithms, which will furthermore reduce blocking overhead costs.
It is often very difficult for programmers of parallel computers to understand how their parallel programs behave at execution time, because there is not enough insight into the interactions between concurrent activities in the parallel machine. Programmers do not only wish to obtain statistical information that can be supplied by profiling, for example. They need to have detailed knowledge about the functional behaviour of their programs. Considering performance aspects, they need timing information as well. Monitoring is a technique well suited to obtain information about both functional behaviour and timing. Global time information is essential for determining the chronological order of events on different nodes of a multiprocessor or of a distributed system, and for determining the duration of time intervals between events from different nodes. A major problem on multiprocessors is the absence of a global clock with high resolution. This problem can be overcome if a monitor system capable of supplying globally valid time stamps is used.
In this paper, the behaviour and performance of a parallel program on the SUPRENUM multiprocessor is studied. The method used for gaining insight into the runtime behaviour of a parallel program is hybrid monitoring, a technique that combines advantages of both software monitoring and hardware monitoring. A novel interface makes it possible to measure program activities on SUPRENUM. The SUPRENUM system and the ZM4 hardware monitor are briefly described. The example program under study is a parallel ray tracer. We show that hybrid monitoring is an excellent method to provide programmers with valuable information for debugging and tuning of parallel programs.
The I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. The aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed
Characteristics and constraints of real-time geometric-feature extraction are discussed. Extracting geometric features from a digital image can be characterized as a computation-intensive task in the environment of a real-time automated vision system. Such tasks require algorithms with a high degree of parallelism and pipelining under the raster-scan I/O constraint. Using the divide-and-conquer technique, many feature extractions have been formulated as a pyramid structure and then mapped into a binary tree. An efficient mapping from a tree structure into a pipelined array of 2log N stages is presented for processing an N × N image. In the proposed mapping structure, the identification of the information growing property allows the exploitation of bit-level concurrency in the architecture design. Accordingly, the design of each staged pipelined processor is simplified containing only bit-serial arithmetic. A single VLSI chip that can generate ( p +1)( q +1) moments concurrently in real-time applications is described. This chip has a hardware complexity of O( pq ( p + q )log<sup>2</sup> N ) units, where p , q stand for the orthogonal orders of the moment. This hardware complexity is better than the O( pq ( p + q )<sup>2</sup>log<sup>2</sup> N ) units required by the other methods. A single VLSI chip to generate ten moments for a (512×512×8)/pixel image in real time is presented
The interactions between a cache's block size, fetch size, and fetch policy from the perspective of maximizing system-level performance are explored. It has been previously noted that, given a simple fetch strategy, the performance optimal block size is almost always four or eight words. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of 2 greater than the data cache fetch size, which in turn should be equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above
Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. The authors describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. The authors investigate the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems
General synthesis methods for efficiently implementing self-timed
combinational logic (CL) and finite-state machines (FSM) are presented.
The resulting CL is shown to require fewer gates than other proposed
methods. The FSM is implemented by interconnecting a CL module with a
self-time master-slave regime. Alternate FSM synthesis methods are also
considered. A formal system of behavioral sequential constraints is
presented for each of the systems and their behavior is proven correct.
Thus, the synthesized CLs and FSMs can serve as correct-by-construction
building blocks for self-timed silicon system compilation
Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches). The authors evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. The authors use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in their evaluation. They present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, the authors find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes
Transactional memory (TM), thread-level speculation (TLS), and checkpointed multiprocessors are three popular architectural techniques based on the execution of multiple, cooperating speculative threads. In these environments, correctly maintaining data dependences across threads requires mechanisms for disambiguating addresses across threads, invalidating stale cache state, and making committed state visible. These mechanisms are both conceptually involved and hard to implement. In this paper, we present bulk, a novel approach to simplify these mechanisms. The idea is to hash-encode a thread's access information in a concise signature, and then support in hardware signature operations that efficiently process sets of addresses. Such operations implement the mechanisms described. Bulk operations are inexact but correct, and provide substantial conceptual and implementation simplicity. We evaluate Bulk in the context of TLS using SPECint2000 codes and TM using multithreaded Java workloads. Despite its simplicity, Bulk has competitive performance with more complex schemes. We also find that signature configuration is a key design parameter
The IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. With this open architecture Futurebus+ can be applied to many different computing applications. Profile B is a subset of the IEEE 896 Futurebus+ standard and targets high performance, general purpose computer I/O applications. This paper describes how and why the functional, electrical, mechanical and environmental characteristics were chosen.
The design and performance analysis of partial-multiple-bus interconnection networks is described. One such structure, called processor-oriented partial-multiple-bus (or PPMB), is proposed. It serves as an alternative to the conventional structure called memory-oriented partial-multiple-bus (or MPMB) and is aimed at higher system performance at less or equal system cost. PPMB's structural feature, which distinguishes itself from the conventional, is to provide every memory module with B paths to processors (where B is the total number of buses). This, in contrast to the B / g paths provided in the conventional MPMB structure (where g is the number of groups), suggests a potential for higher system bandwidth. This potential is fully fulfilled by the load-balancing arbitration mechanism suggested, which in turn highlights the advantages of the proposed structure. As a result, it has been shown, both analytically and by simulation, that a substantial increase in system bandwidth (up to 20%) is achieved by the PPMB structure over the MPMB structure. In addition to the fact that the cost of PPMB is less than, or equal to, that of MPMB, its reliability is shown to be slightly increased
This paper introduces an innovative cache design for vector computers, called prime-mapped cache. By utilizing the special properties of a Mersenne prime, the new design does not increase the critical path length of a processor, nor does it increase the cache access time as compared to a direct-mapped cache. The prime-mapped cache minimizes cache miss ratio caused by line interferences that have been shown to be critical for numerical applications by previous investigators. We show that significant performance gains are possible by adding the proposed cache memory into an existing vector computer provided that application programs can be blocked. The performance gain will increase with the increase of the speed gap between processors and memories. We develop an analytical performance model based on a generic vector computation model to study the performance of the design. Our preliminary performance analysis on various vector access patterns shows that the prime-mapped cache can provide as much as a factor of 2 to 3 performance improvement over the conventional direct-mapped cache in the vector processing environment. Moreover, the additional hardware cost introduced by the new mapping scheme is negligible.
A technique for reducing direct-mapped cache misses caused by conflicts for a particular cache line is introduced. A small finite state machine recognizes the common instruction reference patterns for which storing an instruction in the cache actually harms performance. Such instructions are dynamically excluded, that is, they are passed directly through the cache without being stored. This reduces misses to the instructions that would have been replaced. The effectiveness of dynamic exclusion is dependent on the severity of cache conflicts and thus on the particular program and cache size of interest. However, across the SPEC benchmarks, simulation results show an average reduction in miss rate of 33% for a 32-KB instruction cache with 16 B lines. Applying dynamic exclusion to one level of a cache hierarchy can improve the performance of the next level, since instructions do not need to be stored on both levels. Dynamic exclusion also improves combined instruction and data cache miss rates.
Software-assisted cache coherence enforcement schemes for large multiprocessor systems with shared global memory and interconnection network have gained increasing attenuation. The authors propose a new solution that offers the fast operation of the indiscriminate invalidation approach and can selectively invalidate cache items without extensive run-time book-keeping and checking. The solution relies on the combination of compile-time reference tagging and individual invalidation of potentially stale cache lines only when referenced. Performance improvement over an indiscriminate invalidation approach is presented
Considers three simple extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since they target different components of the read and write penalties, they can be combined effectively. Detailed architectural simulations using five benchmarks show substantial combined performance gains obtained at a modest additional hardware cost. Prefetching in combination with competitive-update is the best combination under release consistency in systems with sufficient network bandwidth. By contrast, prefetching plus the migratory sharing optimization is advantageous under sequential consistency and/or in systems with limited network bandwidth
An instruction-level simulator for IBM 3090 with VF (vector facility) has been developed for studying the performance of vector processors and their memory hierarchies. Results of a study of the locality of several large scientific applications are presented. The cache miss ratios of vectorized applications are found to be almost equal to those of their original scalar executions. Moreover, both the spatial and temporal locality of these applications (in scalar and vector executions) are strong enough to show a sufficiently high hit ratio on conventional cache structures
DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance, and protocol complexity. The design of the DASH coherence protocol is presented and discussed from the viewpoint of how it addresses the above issues. Also discussed is a strategy for verifying the correctness of the protocol. A brief comparison of the protocol with the IEEE Scalable Coherent Interface protocol is made