Modern graphics processing units (GPUs) are delivering tremendous computing horsepower by running tens of thousands of threads concurrently. The massively parallel execution model has been effective to hide the long latency of off-chip memory accesses in graphics and other general computing applications exhibiting regular memory behaviors. With the fast-growing demand for general purpose computing on GPUs (GPGPU), GPU workloads are becoming highly diversified, and thus requiring a synergistic coordination of both computing and memory resources to unleash the computing power of GPUs. Accordingly, recent graphics processors begin to integrate an on-die level-2 (L2) cache. The huge number of threads on GPUs, however, poses significant challenges to L2 cache design. The experiments on a variety of GPGPU applications reveal that the L2 cache may or may not improve the overall performance depending on the characteristics of applications. In this paper, we propose efficient techniques to improve GPGPU performance by orchestrating both L2 cache and memory in a unified framework. The basic philosophy is to exploit the temporal locality among the massive number of concurrent memory requests and minimize the impact of memory divergence behaviors among simultaneously executed groups of threads. Our major contributions are twofold. First, a priority-based cache management is proposed to maximize the chance of frequently revisited data to be kept in the cache. Second, an effective memory scheduling is introduced to reorder memory requests in the memory controller according to the divergence behavior for reducing average waiting time of warps. Simulation results reveal that our techniques enhance the overall performance by 10% on average for memory intensive benchmarks, whereas the maximum gain can be up to 30%.
Modern graphic processing units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.
This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
We consider the problem of how to improve memory latency tolerance in massively multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to hide memory latency. One solution used in conventional CPU systems is prefetching, both in hardware and software. However, we show that straightforwardly applying such mechanisms to GPGPU systems does not deliver the expected performance benefits and can in fact hurt performance when not used judiciously. This paper proposes new hardware and software prefetching mechanisms tailored to GPGPU systems, which we refer to as many-thread aware prefetching (MT-prefetching) mechanisms. Our software MT-prefetching mechanism, called inter-thread prefetching, exploits the existence of common memory access behavior among fine-grained threads. For hardware MT-prefetching, we describe a scalable prefetcher training algorithm along with a hardware-based inter-thread prefetching mechanism. In some cases, blindly applying prefetching degrades performance. To reduce such negative effects, we propose an adaptive prefetch throttling scheme, which permits automatic GPGPU application- and hardware-specific adjustment. We show that adaptation reduces the negative effects of prefetching and can even improve performance. Overall, compared to the state-of-the-art software and hardware prefetching, our MT-prefetching improves performance on average by 16%(software pref.) / 15% (hardware pref.) on our benchmarks.
Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.
Hybrid main memory with both DRAM and emerging non-volatile memory (NVM) becomes a promising solution for high performance and energy-efficient embedded systems. Cache plays an important role and highly affects the number of write backs to NVM and DRAM blocks. However, existing cache policies fail to fully address the significant asymmetry between NVM operations (especially writes) and DRAM operations, leading to non-optimal system designs. We propose a write-back aware last-level cache management scheme for the hybrid main memory, which improves the cache hit ratio of NVM memory blocks and minimizes write-backs to NVM. Experimental results show that our proposed framework leads to better performance and energy saving compared with the state-of-the-art cache management scheme for hybrid main memory architecture.
Data-center servers require large capacity main memory to run multiple workloads simultaneously. However, the scalability and power consumption of DRAM limit its capability of constructing large capacity memory. Emerging non-volatile memories (e.g. PCM and STT-RAM) provide better scalability and lower power leakage than DRAM. Especially, hybrid memory consisting of DRAM and NVM is able to exploit advantages of different memory medias. However, NVMs have a few drawbacks, such as relatively longer read and write latency. Cache miss at the shared last level cache (LLC) suffers from longer latency if the missing data resides in NVM. Current LLC policies manage the cache space without being aware of the underlying heterogeneous medias. This results in cache performance degradation if a large number of missing data come from NVM. Taking the asymmetric cache miss cost into account, we first propose a new performance metric -TMPKI, which can exactly reflect the LLC performance on the top of hybrid memories. Then we propose a hybrid memory aware cache partitioning technique (HAP) to dynamically adjust the cache spaces for DRAM and NVM data based on TMPKI. Experimental results show that HAP improves performance against the traditional LRU policy by up to 54.3% (19.6% on average) while it incurs a little storage overhead (0.2%).
GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.
The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average I.32X) performance speedup for a variety of GPU applications.
Phase-change memory (PCM) has many advantages compared to conventional DRAM, including nonvolatility, low static energy consumption, and high reliability. Its drawbacks include limited write-endurance and higher energy consumption for write operations. We consider a hybrid main memory consisting of PCM and DRAM at the same level of the memory hierarchy, and address the problem of partitioning and allocating data variables in a multitasking system to either memory type to minimize the energy consumption in the hyper-period while respecting the user-specified upper bounds on CPU utilization. We present an optimal integer linear programming (ILP) formulation and a heuristic algorithm with polynomial time complexity.
Phase change memory has emerged as one of the most promising technologies to incorporate into the memory hierarchy of future computer systems. However, PCM has two critical weaknesses to substitute DRAM memory in its entirety. First, the number of write operations allowed to each PCM cell is limited. Second, write access time of PCM is about 6-10 times slower than that of DRAM. To cope with this situation, hybrid memory architectures that use a small amount of DRAM together with PCM have been suggested. This paper presents a new memory management technique for hybrid PCM and DRAM memory architecture that efficiently hides the slow write performance of PCM. Specifically, we aim to estimate future write references accurately and then absorb frequent memory writes into DRAM. To do this, we analyze the characteristics of memory write references and find two noticeable phenomena. First, using write history alone performs better than using both read and write history in estimating future write references. Second, the frequency characteristic is a better estimator than temporal locality in predicting future memory writes. Based on these two observations, we present a new page replacement algorithm called CLOCK-DWF that significantly reduces the number of writes that occur on PCM and also increases the lifespan of PCM memory.
While GPUs are designed to hide memory latency with massive multi-threading, the tremendous demands for memory bandwidth and power consumption constrain the system performance scaling. In this paper, we propose a hybrid graphics memory architecture with different memory technologies (DRAM, STT-RAM, and RRAM), to improve the memory bandwidth and reduce the power consumption. In addition, we present an adaptive data migration mechanism that exploits various memory access patterns of GPGPU applications for further memory power reduction. We evaluate our design with a set of multi-threaded GPU workloads. Compared to traditional GDDR5 memory, our design leads to 16% of GPU system power reduction, and improves the system throughput and energy efficiency by 12% and 33%.
GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A program's irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.
Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of parallelism. This work studies the problem in a massively parallel setting. Specifically, it investigates the special implications to hybrid memory imposed by the massive parallelism in GPU. It empirically shows that, contrary to promising results demonstrated for CPU, previous designs of PCM-based hybrid memory result in significant degradation to the energy efficiency of GPU. It reveals that the fundamental reason comes from a multi-facet mismatch between those designs and the massive parallelism in GPU. It presents a solution that centers around a close cooperation between compiler-directed data placement and hardware-assisted runtime adaptation. The co-design approach helps tap into the full potential of hybrid memory for GPU without requiring dramatic hardware changes over previous designs, yielding 6% and 49% energy saving on average compared to pure DRAM and pure PCM respectively, and keeping performance loss less than 2%.
In recent years, non-volatile memory (NVM) technologies have emerged as candidates for future universal memory. NVMs generally have advantages such as low leakage power, high density, and fast read spead. At the same time, NVMs also have disadvantages. For example, NVMs often have asymetric read and write speed and energy cost, which poses new challenges when applying NVMs. This paper contains a collection of four contributions, presenting basic introduction on three emerging NVM technologies, their unique characteristics, potential challenges, and new opportunities that they may bring forward in memory systems.
Phase-Change Memory (PCM) technology has received substantial attention recently. Because PCM is byte-addressable and exhibits access times in the nanosecond range, it can be used in main memory designs. In fact, PCM has higher density and lower idle power consumption than DRAM. Unfortunately, PCM is also slower than DRAM and has limited endurance. For these reasons, researchers have proposed memory systems that combine a small amount of DRAM and a large amount of PCM. In this paper, we propose a new hybrid design that features a hardware-driven page placement policy. The policy relies on the memory controller (MC) to monitor access patterns, migrate pages between DRAM and PCM, and translate the memory addresses coming from the cores. Periodically, the operating system updates its page mappings based on the translation information used by the MC. Detailed simulations of 27 workloads show that our system is more robust and exhibits lower energy-delay2 than state-of-the-art hybrid systems.
This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the applica- tion that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is benecia l for performance to invest cache resources in the application that benets more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each ap- plication at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each applica- tion. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.
Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.
An Efficient Compiler Framework for Cache Bypassing on GPUs
IEEE T COMPUT AID D
Y. Liang, X. Xie, G. Sun, and D. Chen, "An Efficient Compiler Framework
for Cache Bypassing on GPUs," IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 34(10), pp. 1677-1690, 2015.
Many-thread aware prefetching mechanisms for GPGPU application
N B Lakshmiarayana
Priority-based cache allocation in throughput processors
D R Johnson
D S Fussell
S W Redder
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S.
Fussell and S. W. Redder, "Priority-based cache allocation in throughput
processors," International Symposium on High Performance Computer
Architecture, Feb 2015, pp. 89-100.
Coordinated static and dynamic cache bypassing for GPUs
X. Xie, Y. Liang, Y. Wang, G. Sun and T. Wang, "Coordinated static and
dynamic cache bypassing for GPUs," International Symposium on High
Performance Computer Architecture, Feb 2015, pp. 76-88.