Conference Paper

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA
DOI: 10.1109/HPCA.2010.5416628 Conference: High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
Source: IEEE Xplore

ABSTRACT Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: DRAM memory is a major contributor for the total power consumption in modern computing systems. Consequently, power reduction for DRAM memory is critical to improve system-level power efficiency. Fine-grained DRAM architecture [1, 2] has been proposed to reduce the activation/ precharge power. However, those prior work either incurs significant performance degradation or introduces large area overhead. In this paper, we propose a novel memory architecture Half-DRAM, in which the DRAM array is reorganized to enable only half of a row being activated. The half-row activation can effectively reduce activation power and meanwhile sustain the full bandwidth one bank can provide. In addition, the half-row activation in Half-DRAM relaxes the power constraint in DRAM, and opens up opportunities for further performance gain. Furthermore, two half-row accesses can be issued in parallel by integrating the sub-array level parallelism to improve the memory level parallelism. The experimental results show that Half-DRAM can achieve both significant performance improvement and power reduction, with negligible design overhead.
    2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA); 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Die-stacking technology is opening up new options for memory system design. Through silicon vias (TSV) provide a configurable interface between memory and processing unit and allow a high bandwidth. System performance can be increased significantly by a sophisticated DRAM architecture design. This paper presents a framework providing a design recommendation for memory based on application execution data. The analysis approach can be adapted for different system configurations and applications. In this work a single-core execution of JPEG2000 algorithm for different picture sizes is analyzed with the tool.
    2014 3rd Mediterranean Conference on Embedded Computing (MECO); 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The lack of accurate yet open to public simulation infrastructure has puzzled researchers in the memcomputing area for sometime. In this paper, we propose for the first time a full tool chain called MSim that supports the cycle-accurate microarchitecture level simulation for memcomputing studies. With MSim, the performance gains of utilizing memcomputing for arbitrary applications on user configurable computer system architectures can be evaluated in high accuracy. In addition, MSim provides flexible interfaces with pervasive object-oriented design, which makes it well-suited as a good base platform for researchers to explore new memcomputing technologies.
    Design Automation and Test in Europe; 01/2014


Available from