COMIC++: A software SVM system for heterogeneous multicore accelerator clusters.
ABSTRACT In this paper, we propose a software shared virtual memory (SVM) system for heterogeneous multicore accelerator clusters with explicitly managed memory hierarchies. The target cluster consists of a single manager node and many compute nodes. The manager node contains a generalpurpose processor and larger main memory, and each compute node contains a heterogeneous multicore processor and smaller main memory. These nodes are connected with an interconnection network, such as Gigabit Ethernet. The heterogeneous multicore processor in each compute node consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE runs an OS and the multiple APEs are dedicated to compute-intensive workloads. The GPE is typically backed by a deep on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small explicitly-addressed local memory instead of caches. This APE local memory is not coherent with the main memory. Different main and local memory units in the accelerator cluster can be viewed as an explicitly managed memory hierarchy: global memory, node local memory, and APE local memory. Since coherence protocols of previous software SVM proposals cannot effectively handle such a memory hierarchy, we propose a new coherence and consistency protocol, called hierarchical centralized release consistency (HCRC). Our software SVM system is built on top of HCRC and software-managed caches. We evaluate the effectiveness and analyze the performance of our software SVM system on a 32-node heterogeneous multicore cluster (a total of 192 APEs).
- SourceAvailable from: psu.edu[show abstract] [hide abstract]
ABSTRACT: Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be writtenmany times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors. In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler /runtime, run...11/1994;
Conference Proceeding: Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network.[show abstract] [hide abstract]
ABSTRACT: Low-latency remote-write networks, such as DEC's Memory Ch an- nel, provide the possibility of transparent, inexpensive,large-scale shared-memory parallel computing on clusters of shared mem ory multiprocessors (SMPs). The challenge is to take advantageof hardware shared memory for sharing within an SMP, and to ensu re that software overhead is incurredonly when actively sharing data across SMPs in the cluster. In this paper, we describe a "two- level" software coherent shared memory system—Cashmere-2L— that meets this challenge. Cashmere-2L uses hardware to sha re memory within a node, while exploiting the Memory Channel's remote-write capabilities to implement "moderately lazy" release consistency with multiple concurrent writers, directorie s, home nodes, and page-size coherence blocks across nodes. Cashme re- 2L employs a novel coherence protocol that allows a high leve l of asynchrony by eliminating global directory locks and the ne ed for TLB shootdown. Remote interrupts are minimized by exploiti ng the remote-write capabilities of the Memory Channel network. Cashmere-2L currently runs on an 8-node, 32-processor DEC AlphaServer system. Speedups range from 8 to 31 on 32 process ors for our benchmark suite, depending on the application's cha rac- teristics. We quantify the importance of our protocol optim izations by comparing performance to that of several alternative pro tocols that do not share memory in hardware within an SMP, and requir e more synchronization. In comparison to a one-level protoco l that does not share memory in hardware within an SMP, Cashmere-2L improves performance by up to 46%.01/1997
Conference Proceeding: Memory---Sequoia: programming the memory hierarchy[show abstract] [hide abstract]
ABSTRACT: We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing; 01/2006