Conference Paper

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA
DOI: 10.1109/HPCA.2010.5416628 Conference: High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
Source: IEEE Xplore

ABSTRACT Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.

  • Source
    • "T HE development of exa-flop-scale high-performance data center for cloud computing has imposed the need of tera-flop-scale high performance data server with hundreds of processing cores integrated on a single chip [1], [2], [3]. 3D integration [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] is one of the promising solutions for integration of many-core microprocessors with memory. However, such a high density integration in 3D can introduce severe power and thermal issues, which may significantly affect the system performance and reliability. "
    IEEE Transactions on Computers 01/2015; DOI:10.1109/TC.2015.2389827 · 1.47 Impact Factor
  • Source
    • "One of the first solutions implementing multiple memory layers on top of processors was presented by Kgil [15], modeling a web server as a CMP built of four DRAM layers stacked on top of a processing die hosting up to eight parallel cores. Other CMPs have been designed in later years exploiting multiple 3-D-DRAM layers [16], [17]; these solutions showed the possibility to reorganize modules and interconnections in order to have a significant bandwidth increase, resulting in a relevant speedup in the routine execution. Loh's [18] solution demonstrated an achievable speed-up of 280% with respect to the baseline CMP (an Intel QuadCore) connected to off-chip DRAM. "
    [Show abstract] [Hide abstract]
    ABSTRACT: An innovative modular 3-D stacked multi-processor architecture is presented. The platform is composed of completely identical stacked dies connected together by through-silicon-vias (TSVs). Each die features four 32-bit embedded processors and associated memory modules, interconnected by a 3-D network-on-chip (NoC), which can route packets in the vertical direction. Superimposing identical planar dies minimizes design effort and manufacturing costs, ensuring at the same time high flexibility and reconfigurability. A single die can be used either as a fully testable standalone chip multi-processor (CMP), or integrated in a 3-D stack, increasing the overall core count and consequently the system performance. To demonstrate the feasibility of this architecture, fully functional samples have been fabricated using a conventional UMC 90 nm complementary metal–oxide–semiconductor process and stacked using an in-house, via-last Cu-TSV process. Initial results show that the proposed 3-D-CMP is capable of operating at a target frequency of 400 MHz, supporting a vertical data bandwidth of 3.2 Gb/s.
    06/2012; 2(2):295-306. DOI:10.1109/JETCAS.2012.2193837
  • Source
    • "We set up three types of sample chips (small/medium/large) based on TSV count, that is, thousands [3]/tens of thousands [17]/hundreds of thousands [7] of TSVs. We group TSVs as bundles (16 × 16 for small and medium chip and 32 × 32 for large chip) and place them randomly on the chip. "
    Proc. IEEE/ACM Design, Automation, and Test in Europe; 01/2012
Show more


Available from