Article

Analysis of Memory Latency Factors and their Impact on KSR1 MPP Performance

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The Kendall Square Research KSR1 MPP system has a shared address space, which spreads over physically distributed memory modules. Thus, memory access time can vary over a wide range even when accessing the same variable, depending on how this variable is being referenced and updated by the various processors. Since the processor stalls during this access time, the KSR1 performance depends considerably on the program's locality of reference. The KSR1 provides two novel features to reduce such long memory latencies: prefetch and post-store instructions. This paper analyzes the various memory latency factors which stalls the processor during program execution. A suitable model for evaluating these factors is developed for the execution of FORTRAN DO-loops parallelized with the Tile construct using the Slice strategy. The DO-loops used in the benchmark program perform sparse matrix-vector multiply, vector-vector dot product, and vector-vector addition, which are typically executed in an it...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The performance of the KSR1 system has been analyzed by Dunigan [14], Windheiser et al [13], Kahhaleh [17], and Rosti et al [15]. ...
Article
Full-text available
Communication has a dominant impact on the performance of massively parallel processors (MPPs). We propose a methodology to evaluate the internode communication performance of MPPs using a controlled set of synthetic workloads. By generating a range of sparse matrices and measuring the performance of a simple parallel algorithm that repeatedly multiplies a sparse matrix by a dense vector, we can determine the relative performance of different communication workloads. Specifiable communication parameters include the number of nodes, the average amount of communication per node, the degree of sharing among the nodes, and the computation-communication ratio. We describe a general procedure for constructing sparse matrices that have these desired communication and computation parameters, and apply a range of these synthetic workloads to evaluate the hierarchical ring interconnection and cacheonly memory architecture (COMA) of the Kendall Square Research KSR1 MPP. This analysis discusses th...
Conference Paper
This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize critical-path communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directory-based shared memory networks. We integrate STAP into a MOESI cache-coherence (prefetching-enabled) protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 15 benchmarks from the PARSEC and SPLASH-2 suites we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 9% performance improvement, with at most 3% on-chip storage overhead.
Article
Direct and iterative methods are often used to solve linear systems in engineering. The matrices involved can be large, which leads to heavy computations on the central processing unit. A graphics processing unit can be used to accelerate these computations. In this paper, we propose a new library, named Alinea, for advanced linear algebra. This library is implemented in C++, CUDA and OpenCL. It includes several linear algebra operations and numerous algorithms for solving linear systems. For both central processing unit and graphic processing unit devices, there are different matrix storage formats, and real and complex arithmetics in single- and double-precision. The CUDA version includes a self-tuning of the grid, i.e. threading distribution, depending upon the hardware configuration and the size of the problems. Numerical experiments and comparison with existing libraries illustrates the efficiency, accuracy and robustness of the proposed library.
Article
Many large-scale computational problems are based on irregular (unstructured) domains. Some examples are finite element methods in structural analysis, finite volume methods in fluid dynamics, and circuit simulation for VLSI design. Domain decomposition is a common technique for distributing the data and work of irregular scientific applications across a distributed memory parallel machine. To obtain efficiency, subdomains must be constructed such that the work is divided with a reasonable balance among the processors while the communication-causing subdomain boundary is kept small. Application and machine specific information can be used in conjunction with domain decomposition to achieve a level of performance not possible with traditional domain decomposition methods. Application profiling characterizes the performance of an application on a specific machine. We present a method that uses curve-fitting of application profile data to calculate vertex and edge weights for use with weighted graph decomposition algorithms. We demonstrate its potential on two routines from a production finite element application running on the IBM SP2. Our method combined with a multilevel spectral algorithm reduced load imbalance from 52% to less than 10% for one routine in our study. Many irregular applications have several phases, that must be load balanced individually to achieve high overall application performance. We propose finding one decomposition that can be used effectively for each phase of the application, and introduce a decomposition that can be used effectively for each phase of the application, and introduce a decomposition algorithm which load balances according to two vertex weight sets for use on two-phase applications. We show that this dual weight algorithm call be as successful at load balancing two individual routines together as the traditional single weight algorithm is at load balancing each routine independently. Domain decomposition algorithms take a simplistic view of multiprocessor communication. Higher performance can be achieved by considering the communication characteristics of the target multiprocessor in conjunction with decomposition techniques. We provide a methodology for tuning an application for a shared-address space multiprocessor by using intelligent layout of the application data to reduce coherence traffic and employing latency hiding mechanisms to overlap communication with useful work. These techniques have been applied to a finite element radar application running on the Kendall Square KSR1.
Conference Paper
Full-text available
We have developed an automatic technique for evaluating the communication performance of massively parallel processors (MPPs). Both communication latency and the amount of communication are investigated as a function of a few basic parameters that characterize an application workload. Parameter values are captured in an automatically generated sparse matrix that multiplies a dense vector in the synthetic workload. Our approach is capable of explaining the degradation of processor performance caused by communication.Using the Kendall Square Research KSR1 MPP as a case study, we demonstrate the effectiveness of the technique through a series of experiments used to characterize the communication performance. We show that read and write communciation latencies vary from 150 to 180 and from 80 to 100 processor cycles, respectively. We show that the read communication latency approximates a linear function of the total system communciation (in subpages), write communication approximates a linear function of the number of distinct shared subpages, and that KSR's automatic update feature is effective in reducing the number of read communications given careful binding of threads to processors.
Conference Paper
This paper presents insight into important aspects of the performance of KSR1 multiprocessor. We report performance degradations caused by false sharing of memory subpages (128 bytes long units of transfer and consistency) between local caches in the KSR1. In other words, the performance is measured when multiple processing nodes issue simultaneous write requests for a single subpage. Our measurements show that low level knowledge about the organization of the ALLCACHE memory and explicit use of non-shared variables in parallel threads can result in performance improvement of almost an order of magnitude. However, writing such programs is much closer to distributed memory MIMD programming than shared memory programming
Article
The performance of a computation on a machine with a hierarchical memory organization depends to a great extent on the success of tolerating high latencies. The Kendall Square Research KSR1 provides four architectural features that hide long remote memory accesses: auto-update, combining, poststore, and prefetch. In this paper we investigate the benefits and limitations of the auto-update. We also derive accurate analytical models that capture the interaction of the different components of the memory system. Finally, we apply the model to predict the memory access times for several architectural alternatives. 1 Introduction and Goals The effectiveness of large scale computing on scalable multiprocessors depends to a great degree on how well the memory hierarchy is managed, and to what extent locality of communication can be achieved automatically. The KSR1 multiprocessor includes four architectural features that address the memory latency problem: (i) automatic update, (ii) prefetch, ...
Article
yield a better overall scheme. We give a detailed description of the compiler analysis necessary for integrated prefetching. The performance of integrated prefetching is compared to software and hardware prefetching, and we show the effect of adapting the scheduling iii of prefetches at compile time. Finally, we discuss approaches that combine integrated prefetching with the adaptive hardware prefetching technique. iv ACKNOWLEDGMENTS First, I would like to thank my advisor Alexander Veidenbaum for his guidance and support as I conducted the research for this dissertation. I would also like to thank the members of my committee: Professors Duncan Lawrie, Dennis Mickunas, Constantine Polychronopoulos and Daniel Reed. I would like to express my gratitude to Lani Granston, Sunil Kim and Sharad Mehrotra for reviewing drafts of this dissertation and making many helpful suggestions. In addition, I am extremely grateful to Lani Granston for writing the PARAFRASE code genera
Conference Paper
Full-text available
This paper analyzes and evaluates some novel latency hiding features of the KSR1 multiprocessor: prefetch and poststore instructions and automatic updates. As a case study, the authors analyze the performance of an iterative sparse solver which generates irregular communications. They show that automatic updates significantly reduce the amount of communication. Although prefetch and poststore instructions reduce the coherence miss ratios, they do not significantly improve the sparse solver performance due to the overhead in executing these instructions
Article
Full-text available
An edge-based finite element formulation with vector absorbing boundary conditions is presented for scattering by composite structures having boundaries satisfying impedance and/or transition conditions. Remarkably accurate results are obtained by placing the mesh a small fraction of a wavelength away from the scatterer
Article
Full-text available
The long latencies introduced by remote accesses in a large multiprocessor can be hidden by caching. Caching also decreases the network load. We introduce a new class of architectures called Cache Only Memory Architectures (COMA). These architectures provide the programming paradigm of the shared-memory architectures, but have no physically shared memory; instead, the caches attached to the processors contain all the memory in the system, and their size is therefore large. A datum is allowed to be in any or many of the caches, and will automatically be moved to where it is needed by a cache-coherence protocol, which also ensures that the last copy of a datum is never lost. The location of a datum in the machine is completely decoupled from its address. We also introduce one example of COMA: the Data Diffusion Machine (DDM), and its simulated performance for large applications. The DDM is based on a hierarchical network structure, with processor/memory pairs at its tips. Remote accesses...
Article
The Wisconsin Multicube, is a large-scale, shared-memory multiprocessor architecture that employs a snooping cache protocol over a grid of buses. Each processor has a conventional (SRAM) cache optimized to minimize memory latency and a large (DRAM) snooping cache optimized to reduce bus traffic and to maintain consistency. The large snooping cache should guarantee that nearly all the traffic on the buses will be generated by I/O and accesses to shared data. The programmer's view of the system is like a multi -- a set of processors having access to a common shared memory with no notion of geographical locality. Thus writing software, including the operating system, should be a straightforward extension of those techniques being developed for multis. The interconnection topology allows for a cache-coherent protocol for which most bus requests can be satisfied with no more than twice the number of bus operations required of a single-bus multi. The total symmetry guarantees that there are no topology-induced bottlenecks. The total bus bandwidth grows in proportion to the product of the number of processors and the average path length. The proposed architecture is an example of a new class of interconnection topologies -- the Multicube -- which consists of N =nk processors, where each processor is connected to k buses and each bus is connected to n processors. The hypercube is a special case where n=2. The Wisconsin Multicube is a two-dimensional Multicube (k=2), where n scales to about 32, resulting in a proposed system of over 1,000 processors.
Conference Paper
The Wisconsin Multicube, a large-scale, shared-memory multiprocessor architecture that uses a snooping cache protocol over a grid of buses, is introduced. The authors describe its cache coherence protocol and discuss efficient synchronization primitive. Then they discuss a number of other important design issues and modeling results. They introduce the general Multicube topology and discuss the scalability of the Wisconsin Multicube. A formal description of the cache consistency protocol is also given
Article
Initial performance results and early experiences are reported for the Kendall Square Research multiprocessor. The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs. Experiences in porting various applications are described. - v - 1. Introduction In September of 1991, a Kendall Square Research (KSR) multiprocessor was installed at Oak Ridge National Laboratory (ORNL). This report describes the results of this initial field test. The performance of the KSR shared-memory multiprocessor is compared with other shared-memory and distributed-memory multiprocessors, using synthetic benchmarks and real applications. Performance figures must be considered preliminary, since the KSR system was in its first field test. The KSR multiprocessor runs a modified version of OSF/1 (Mach). To the user, the KSR system appears like typical UNIX TM , but providing performance advantages similar to...