Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically pertain to a specific implementation with fixed parameters mapped to a specific hardware implementation. The implementations are not designed to be easily ported to other GPUs, even from the same manufacturer. When target hardware changes, the application must be re-optimized.
In this paper we address a different problem. We aim to deliver working, efficient GPU code in a library that is downloaded and run by many different users. The issue is to deliver efficiency independent of the individual user parameters and without a priori knowledge of the hardware the user will employ. This problem requires a different set of tradeoffs than finding the best runtime for a single solution. Solutions must be adaptable to a range of different parameters both to solve users' problems and to make the best use of the target hardware.
Another issue is the integration of GPUs into a Problem Solving Environment (PSE) where the use of a GPU is almost invisible from the perspective of the user. Ease of use and smooth interactions with the existing user interface are important to our approach. We illustrate our solution with the incorporation of GPU processing into the Scientific Computing Institute (SCI)Run Biomedical PSE developed at the University of Utah. SCIRun allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms in the scientist's simulations. Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.
The results of a study of the VAX 8800 processor performance using a hardware monitor that collects histograms of the processor's micro-PC and memory bus status are presented. The monitor keeps a count of all machine cycles executed at each micro-PC location, as well as counting all occurrences of each bus transaction. It can measure a running system without interfering with it, and these results are based on measurements of live timesharing. Because the 8800 is a microcoded machine, a great deal of information can be gleaned from these data. Opcode and operand specifier frequencies are reported, as well as the amount of time spent in instruction execution and various kinds of overhead, such as memory management and cache-wait stalls. The histogram method yields a detailed picture of the amount of time an average VAX instruction spends in various activities on the 8800
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. Both matched and unmatched memories are considered: we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.
Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. The authors show how a combination of less expensive mechanisms can achieve better performance than BTB's. This combination relies on a number of design choices described in the paper. They used trace-driven simulation to show that their proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicate how to further improve this design
Four alternative implementations for achieving higher data rates in a disk subsystem (parallel heads without replication, parallel heads with replication, parallel actuators without replication, and parallel actuators with replication) are studied. Focus is on the tradeoffs between the number of devices and the number of data paths while keeping the number of physical devices constant (which may keep the cost roughly constant). The performance advantages and limitations of the alternative implementations are analyzed using an analytic queuing model and compared to a conventional disk subsystem. The study shows that parallel heads with replication from a single actuator performs the best for the average application environments, although other configurations may be more cost-effective
Several experiments using a versatile optimizing compiler to evaluate the benefit of four forms of microarchitectural parallelisms (multiple microoperations issued per cycle, multiple result-distribution buses, multiple execution units, and pipelined execution units) are described. The first 14 Livermore loops and 10 of the linpack subroutines are used as the preliminary benchmarks. The compiler generates optimized code for different microarchitecture configurations. It is shown how the compiler can help to derive a balanced design for high performance. For each given set of technology constraints, these experiments can be used to derive a cost-effective microarchitecture to execute each given set of workload programs at high speed
As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. We propose a new dynamic branch predictor (Two-Level Adaptive Branch Prediction) that achieves substantially higher accuracy than any other scheme reported in the literature. The mechanism uses two levels of branch history information to make predictions, the history of the last k branches encountered, and the branch behavior for the last s occurrences of the specific pattern of these k branches. We have identified three variations of the Two-Level Adaptive Branch Prediction, depending on how finely we resolve the history information gathered. We compute the hardware costs of implementing each of the three variations, and use these costs in evaluating their relative effectiveness. We measure the branch prediction accuracy of the three variations of two-Level Adaptive Branch Prediction, along with several other popular proposed dynamic and static prediction schemes, on the SPEC benchmarks. We show that the average prediction accuracy for Two-Level Adaptive Branch Prediction is 97 percent, while the other known schemes achieve at most 94.4 percent average prediction accuracy. We measure the effectiveness of different prediction algorithms and different amounts of history and pattern information. We measure the costs of each variation to obtain the same prediction accuracy.
Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algorithm for uniform, hotspot, and local traffic patterns. Our simulations indicate that the partially adaptive northlast algorithm, which causes unbalanced traffic in the network, performs worse than the nonadaptive e-cube routing algorithm for all three traffic patterns. Another result of our study is that the performance does not necessarily improve with full-adaptivity. In particular, a commonly discussed fully-adaptive routing algorithm, which uses 2n virtual channels per physical channel of a k-ary n-cube, performs worse than e-cube for uniform and hotspot traffic patterns. The other three fully-adaptive algorithms, which give priority to messages based on distances traveled, perform much better than the e-cube and partially-adaptive algorithms for all three traffic patterns. One of the conclusions of this study is that adaptivity, full or partial, is not necessarily a benefit in wormhole routing.
Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request.
In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol.
Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.
Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer) machines. The authors use a trace-generation mechanism based on link-time code modification which is simple to use, generates accurate long traces of multiuser programs, runs on a RISC (reduced-instruction-set-computer) machine, and can be flexibly controlled. Accurate performance data for large second-level caches can be obtained by on-the-fly analysis of the traces. A comparison is made of the performance of systems with 512 K to 16 M second-level caches, and it is show that, for today's large programs, second-level caches of more than 4 MB may be unnecessary. It is also shown that set associativity in second-level caches of more than 1 MB does not significantly improve system performance. In addition, the experiments provide insights into first-level and second-level cache line size
The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer — called a TLB slice — which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices.
The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically-indexed cache. Because of the virtual tag, the TLB slice needs to hold only enough physical page number bits — typically 4 to 8 — to complete the physical cache index, in contrast with a conventional TLB, which needs to hold both a virtual page number and a physical page number. The virtual page number is unnecessary because the TLB slice needs to provide only a hint for the translated physical address rather than a guarantee. The full physical page number is unnecessary because the cache hit logic is based on the virtual tag. Furthermore, if the cache is multi-level and references to the TLB slice are “shielded” by hits in a virtually indexed primary cache, the slice can get by with very few entries, once again lowering its cost and increasing its speed. With this mechanism, the simplicity of a physical cache can been combined with the speed of a virtual cache.
Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space generated is dependent upon the dimension of the requested subcube dynamically, instead of being predetermined and fixed. The basic idea of this strategy lies in collapsing the binary tree representations of a hypercube successively so that the nodes which form a subcube but are distant would be brought close to each other for recognition. The strategy can be implemented efficiently by using shuffle operations on the leaf node addresses of binary tree representations. Extensive simulation runs are carried out to collect experimental performance measures of interest of different allocation strategies. It is shown from analytic and experimental results that this strategy compares favorably in many situations with any other known allocation scheme capable of achieving complete subcube recognition
This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads with random single block reads and writes, array performance is compared in four different modes - normal mode (no disks have failed), degraded mode (a disk has failed and its data has not been reconstructed), rebuild mode (a disk has failed and its data is being reconstructed), and copyback mode(which is needed for distributed sparing and parity sparing when failed disks are replaced with new disks). Attention is concentrated on small disk subsystems (fewer than 32 disks) where choice of sparing method has significant impact on array performance, rather than large disk subsystems (64 or more disks). It is concluded that, for disk subsystems with a small number of disks, distributed sparing offers major advantages over dedicated sparing in normal, degraded and rebuild modes of operation, even if one has to pay a copyback penalty. Furthermore, it is better than parity sparing in rebuild mode and similar to it in other operating modes, making it the sparing method of choice.
The performance of message-passing applications depends on cpu speed, communication throughput and latency, and message handling overhead. In this paper we investigate the effect of varying these parameters and applying techniques to reduce message handling overhead on the execution efficiency of ten different applications. Using a message level simulator set up for the architecture of the AP1000, we showed that improving communication performance, especially message handling, improves total performance. If a cpu that is 32 times faster is provided, the total performance increases by less than ten times unless message handling overhead is reduced. Overlapping computation with message reception improves performance significantly. We also discuss how to improve the AP1000 architecture.
Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of-the-art processors, we must take cache memory into account. This paper presents an architecture for low-latency message comunication and implementation, and performance evaluation.
We developed a message controller (MSC) to support low-latency message passing communication for the AP1000, to minimize message handling overhead. MSC sends messages directly from cache memory and automatically receives messages in the circular buffer. We designed communication functions between cells and evaluated communication performance by running benchmark programs such as the Pingpong benchmark, the LINPACK benchmark, the SLALOM benchmark, and a solver using the scaled conjugate gradient method.
A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware `pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications. We use the results to evaluate the trade-offs in the design of multicomputer architectures.
The Flagship project aims to produce a computing technology based on the declarative style of programming. A major component of that technology is the design for a parallel machine that can efficiently utilize the implicit parallelism in declarative programs. The computational models that expose this implicit parallelism are described, and an architecture designed to use it is outlined. The operational issues, such as dynamic load balancing, that arise in such a system are discussed, and the mechanisms being used to evaluate the architecture are described
The VAX architecture has been extended to include an integrated, register-based vector processor. This extension allows both high-end and low-end implementations and can be supported with only small changes by VAX/VMS and VAX/ULTRIX operating systems. The extension is effectively exploited by the new vectorizing capabilities of VAX Fortran. Features of the VAX vector architecture and the design decisions which make it a consistent extension of the VAX architecture are discussed
The architecture of a coprocessor that supports the communication primitives of the Linda parallel-programming environment in hardware is described. The coprocessor is a critical element in the architecture of the Linda machine, a MIMD (multiple-instruction, multiple-data-stream) parallel-processing system that is designed top-down from the specifications of Linda. Communication in Linda programs takes place through a logically shared associative memory mechanism called tuple space. The Linda machine, however, has no physically shared memory. The microprogrammable coprocessor implements distributed protocols for executing tuple-space operations over the Linda machine communication network. The coprocessor has been designed and is in the process of fabrication. The projected performance of the coprocessor is discussed and compared with software implementation of Linda
General purpose microprocessor based computers usually speed their arithmetic processing performance by using a floating point co-processor. Because adding more co-processors represents neither a technological nor a cost problem, the authors investigated a system based on a MIPS R2000 and 4 floating point units. They show a block diagram of such an implementation and how two important scientific operations can be accelerated using a single unmodified data bus. A large percentage of the engineering applications are solved with the help of linear algebra methods like BLAS3 algorithms; it is precisely for these primitives that the proposed architecture brings significant performance gains. The first operation described is a matrix multiplication algorithm, its timing diagram and some results. Next a polynomial evaluation technique is examined. Finally they show how to use the same ideas with various other microprocessors
A multithreaded processor architecture that improves machine throughput is proposed. Instructions from different threads (not a single thread) are issued simultaneously to multiple functional units, and these instructions can begin execution unless there are functional unit conflicts. This parallel execution scheme greatly improves the utilization of the functional unit. Simulation results show that by executing two and four threads in parallel on a nine-functional-unit processor, a 2.02 and a 3.72 times speedup can be achieved over a conventional RISC processor. The architecture is also applicable to the efficient execution of a single loop. In order to control functional unit conflicts between loop iterations, a new static code scheduling technique has been developed. Another loop execution scheme that uses the multiple control flow mechanism of the architecture makes it possible to parallelize loops that are difficult to parallelize in vector or VLIW machines.