Conference Paper

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA
DOI: 10.1109/HPCA.2010.5416628 Conference: High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
Source: IEEE Xplore

ABSTRACT Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency is the major optimization criterion for systems-on-chip (SoCs) for mobile devices (smartphones and tablets). Through silicon via (TSV) technology enables 3-D integration of dies and the heterogeneous stacking of multiple memory or logic layers, allowing increased bandwidth and lower energy consumption of the memory interface compared to traditional approaches. In this paper, we explore the 3-D-DRAM architecture design space. The result is an optimized 2 Gb 3-D-DRAM, which shows a 83% lower energy/bit than a 2 Gb device. Furthermore, we propose a highly energy-efficient DRAM subsystem for next-generation 3-D-integrated SoCs, consisting of a SDR/DDR 3-D-DRAM controller and an attached 3-D-DRAM cube with fine-grained access and a flexible (WIDE-IO) interface. We assess the energy efficiency using a synthesizable model of the SDR/DDR 3-D-DRAM channel controller (CC) as well as functional models of the 3-D-stacked DRAM, including an accurate power estimation engine. We also investigate different DRAM families (WIDE IO SDR/DDR, LPDDR, and LPDDR2) and densities from 256 Mb to 4 Gb per channel. The implementation results of the proposed 3-D-DRAM subsystem show that energy optimized accesses to the 3-D-DRAM enable up to 50% energy savings compared to standard accesses. To the best of our knowledge this is the first design space exploration for 3-D-stacked DRAM considering different technologies based on real-world physical data and the first design of a 3-D-DRAM CC and 3-D-DRAM model featuring co-optimization of memory and controller architecture.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 01/2013; 32(4):597-610. · 1.09 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 3D architecture with DRAM memory stacked on a multi-core processor has many benefits for the embedded system. Compared with a conventional 2D design, it reduces memory access latency, increases memory bandwidth and reduces energy consumption. However it poses a thermal challenge as the heat generated by the processor cannot dissipate efficiently through the DRAM memory layer. Due to the fact that DRAM is very sensitive to high temperature as well as temperature variance, 3D stacking causes more failures to occur because DRAM thermal variance is higher than the conventional 2D architecture. To address this thermal challenge we propose to reduce temperature variance and peak temperature of a 3D multi-core processor and stacked DRAM by thermally aware thread migration among processor cores. This method has very limited impact on processor performance. Using migration-based policy we reduce peak steady-state temperature in the processor by up to 8.3 degrees Celsius, with the average of 4.7 degrees.
    Quality Electronic Design (ISQED), 2013 14th International Symposium on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate bandwidth congestion. Our approach supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%, while performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage memory explicitly like in the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.


Available from