-
41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), November 8-12, 2008, Lake Como, Italy; 01/2008
-
IEEE Trans. Parallel Distrib. Syst. 01/2007; 18:1028-1040.
-
Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, Cambridge, Massachusetts, USA, June 20-22, 2005; 01/2005
-
[show abstract]
[hide abstract]
ABSTRACT: Multiprocessing and multithreading are becoming ubiquitous even on single chips. With increasing cache sizes, coherence misses in such systems will account for a larger fraction of all cache misses. As communication latencies increase, this larger fraction of coherence misses will cause significant and increased performance losses. Tuning coherence protocols for specific communication patterns and applications can reduce communication latencies. However, these optimizations increase a protocol's design complexity, making the protocol difficult to verify. A competing approach requires parallel programmers to tune applications to work well with simpler protocols. Speculative execution has successfully improved performance in various scenarios. We propose a new type of load speculation, called coherence decoupling. Coherence decoupling is a microarchitectural mechanism that implements separate protocols for speculative use and for the eventual verification of values. The technique reduces the effect of long communication latencies while mitigating the burdens on the coherence protocol designer and the parallel programmer
IEEE Micro 12/2004; · 1.78 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the e#ects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.
09/2004;
-
TACO. 01/2004; 1:62-93.
-
IEEE Micro. 01/2004; 24:104-109.
-
[show abstract]
[hide abstract]
ABSTRACT: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems. It does so by employing the concept of polymorphism, which permits the runtime system to configure the hardware execution resources to match the mode of execution and demands of the compiler and application.
IEEE Micro 12/2003; 23(6):46- 51. · 1.78 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We describe the polymorphous TRIPS architecture, which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue grid processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes-ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
Computer Architecture, 2003. Proceedings. 30th Annual International Symposium on; 07/2003
-
[show abstract]
[hide abstract]
ABSTRACT: False sharing of data is an important phenomenon affecting performance in shared memory multiprocessors. False sharing results in unnecessary coherency overhead by causing invalidation of the shared cache line, and increasing the latency of the load accessing the cache line. As microprocessors incorporate increasingly large caches with large cache lines, false sharing will become more common, and unless methods are proposed to alleviate, can reduce the performance of shared memory multiprocessors. In this report, we propose sharing speculation, a mechanism to speculatively load values from cache lines that are falsely shared to reduce the latency of these loads, and later validating the speculation when the coherence permissions are eventually granted. We give a specific example of our proposed mechanism in the context of the Grid Processor, and show how the inherent data speculation mechanisms in the Grid Processor lend themselves to the efficient implementation of sharing speculation.
06/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP--demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
05/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP--demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
05/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes-ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
Computer Architecture, International Symposium on. 05/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP--demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
05/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at lOOnre, to 256KB at 7Onto, and to 1MB at 50 and 35rim, reducing the number of cores that would otherwise be possible.
05/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: We study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issues, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible
Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International Conference on; 02/2001
-
2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 8-12 September 2001, Barcelona, Spain; 01/2001
-
[show abstract]
[hide abstract]
ABSTRACT: This paper describes the polymorphous TRIPS architecture that can be configured for different granularities and types of parallelism. The TRIPS architecture is the first in a class of post-RISC, dataflow-like instruction sets called explicit data-graph execution (EDGE). This EDGE ISA is coupled with hardware mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture prototype contains two out-of-order, 16-wide-issue grid processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes---ILP, TLP, and DLP---demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
ACM Transactions on Architecture and Code Optimization 1(1):62-93. · 0.57 Impact Factor