COMIC++: A software SVM system for heterogeneous multicore accelerator clusters.
ABSTRACT In this paper, we propose a software shared virtual memory (SVM) system for heterogeneous multicore accelerator clusters with explicitly managed memory hierarchies. The target cluster consists of a single manager node and many compute nodes. The manager node contains a generalpurpose processor and larger main memory, and each compute node contains a heterogeneous multicore processor and smaller main memory. These nodes are connected with an interconnection network, such as Gigabit Ethernet. The heterogeneous multicore processor in each compute node consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE runs an OS and the multiple APEs are dedicated to compute-intensive workloads. The GPE is typically backed by a deep on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small explicitly-addressed local memory instead of caches. This APE local memory is not coherent with the main memory. Different main and local memory units in the accelerator cluster can be viewed as an explicitly managed memory hierarchy: global memory, node local memory, and APE local memory. Since coherence protocols of previous software SVM proposals cannot effectively handle such a memory hierarchy, we propose a new coherence and consistency protocol, called hierarchical centralized release consistency (HCRC). Our software SVM system is built on top of HCRC and software-managed caches. We evaluate the effectiveness and analyze the performance of our software SVM system on a 32-node heterogeneous multicore cluster (a total of 192 APEs).
- [Show abstract] [Hide abstract]
ABSTRACT: We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE is typically backed by a deep, on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small, explicitly addressed local memory that is not coherent with the main memory. Programmers of such multicore architectures suffer from explicit memory management and coherence problems. The STM for such multicores can alleviate the burden of the programmer and transparently handle data transfers at run time. Moreover, it makes the programmer free from controlling locks. Our TM is based on an existing software SVM for the accelerator architecture. The software SVM exploits software-managed caches and coherence protocols between the GPE and APEs. We also propose an optimization technique, called abort prediction, for the TM. It blocks a transaction from running until the chance of potential conflicts is eliminated. We implement the TM system and the optimization technique for a single Cell BE processor and evaluate their effectiveness with six compute-intensive benchmark applications.19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
Conference Paper: An OpenCL framework for heterogeneous multicores with local memory.[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
- [Show abstract] [Hide abstract]
ABSTRACT: The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM.10/2013;