[Show abstract][Hide abstract] ABSTRACT: Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of in- tervening misses at the last-level cache between the eviction of a particular block and its reuse can be very large, pre- venting traditional victim caching mechanisms from exploiting this repeating behavior. In this paper, we present Scavenger, a new architecture for last-level caches. Scavenger divides the total storage budget into a conventional cache and a novel victim file architecture, which employs a skewed Bloom filter in conjunction with a pipelined priority heap to identify and retain the blocks that most frequently missed in the conven- tional part of the cache in the recent past. When compared against a baseline configuration with a 1MB 8-way L2 cache, a Scavenger configuration with a 512kB 8-way conventional cache and a 512kB victim file achieves an IPC improvement of up to 63% and on average (geometric mean) 14.2% for nine memory-bound SPEC 2000 applications. On a larger set of sixteen SPEC 2000 applications, Scavenger achieves an aver- age speedup of 8%.
[Show abstract][Hide abstract] ABSTRACT: We present core fusion, a reconfigurable chip multiprocessor (CMP) architecture where groups of fundamentally independent cores can dynamically morph into a larger CPU, or they can be used as distinct processing elements, as needed at run time by applications. Core fusion gracefully accommodates software diversity and incremental parallelization in CMPs. It provides a single execution model across all configurations, requires no additional programming effort or specialized compiler support, maintains ISA compatibility, and leverages mature micro-architecture technology.
[Show abstract][Hide abstract] ABSTRACT: This work investigates the integration of CMOS-compatible optical technology to on-chip coherent buses for future CMPs. The analysis results in a hierarchical optoelectrical bus that exploits the advantages of optical technology while abiding by projected limitations. This bus achieves significant performance improvement for high-bandwidth applications relative to a state-of-the-art fully electrical bus
[Show abstract][Hide abstract] ABSTRACT: This paper presents core fusion, a reconfigurable chip mul- tiprocessor (CMP) architecture where groups of fundamen- tally independent cores can dynamically morph into a larger CPU, or they can be used as distinct processing elements, as needed at run time by applications. Core fusion grace- fully accommodates software diversity and incremental par- allelization in CMPs. It provides a single execution model across all configurations, requires no additional program- ming effort or specialized compiler support, maintains ISA compatibility, and leverages mature micro-architecture tech- nology.
[Show abstract][Hide abstract] ABSTRACT: Although silicon optical technology is still in its formative stages, and the more near-term application is chip-to-chip communication, rapid advances have been made in the de- velopment of on-chip optical interconnects. In this paper, we investigate the integration of CMOS-compatible optical technology to on-chip cache-coherent buses in future CMPs. While not exhaustive, our investigation yields a hierarchi- cal opto-electrical system that exploits the advantages of op- tical technology while abiding by projected limitations. Our evaluation shows that, for the applications considered, com- pared to an aggressive all-electrical bus of similar power and area, significant performance improvements can be achieved using an opto-electrical bus. This performance improvement is largely dependent on the application's bandwidth demand and on the number of implemented wavelengths per optical waveguide. We also present a number of critical areas for future work that we discover in the course of our research.
[Show abstract][Hide abstract] ABSTRACT: Checkpointed early resource recycling (Cherry) is a recently-proposed microarchitectural technique that aims at improving critical resource utilization by performing aggressive resource recycling decoupled from instruction retirement, using a checkpoint/rollback mechanism to recover from occasional incorrect execution. In this paper, we explore correctness and performance issues that arise when Cherry-enabled processors are used in chip multiprocessor architectures. We propose mechanisms to address cache coherence, memory consistency, and forward progress issues in such environments. We also provide quantitative insight on the performance impact of the Cherry mechanism on parallel processing.
[Show abstract][Hide abstract] ABSTRACT: Long-latency loads are critical in today's processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions, they also prevent other instructions from moving through the in-order reorder buffer (ROB) and retire. As a result, the processor quickly fills up with uncommitted instructions, and computation ultimately stalls. To attack this problem, we propose checkpointed early load retirement, a mechanism that combines register checkpointing and back-end .e., at retirement - load-value prediction. When a long-latency load hits the ROB head unresolved, the processor enters clear mode by (1) taking a checkpoint of the architectural registers, (2) supplying a load-value prediction to consumers, and (3) early-retiring the long-latency load. This unclogs the ROB, thereby "clearing the way" for subsequent instructions to retire, and also allowing instructions dependent on the long-latency load to execute sooner. When the actual value returns from memory, it is compared against the prediction. A misprediction causes the processor to roll back to the checkpoint, discarding all subsequent computation. The benefits of executing in clear mode come from providing early forward progress on correct predictions, and from warming up caches and other structures on wrong predictions. Our evaluation shows that a clear implementation with support for four checkpoints yields an average speedup of 1.12 for both eleven integer and eight floating-point applications (1.27 and 1.19 for five integer and five floating point memory-bound applications, respectively), relative to a contemporary out-of-order processor with an aggressive hardware prefetcher.