H.-H.S. Lee

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you H.-H.S. Lee?

Claim your profile

Publications (28)17.21 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in high demand, the DRAM industry has started to undertake an alternative approach to address these looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for computer architects to contemplate a new memory hierarchy for future system design. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In this paper, to address these issues, we propose a novel floorplan and several architectural techniques that fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memory-intensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31%.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2013; 21(1):1-13. · 1.22 Impact Factor
  • Sungkap Yeo, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: To optimize datacenter energy efficiency, SimWare analyzes the power consumption of servers, cooling units, and fans as well as the effects of heat recirculation and air supply timing. Experiments using SimWare show a high loss of cooling efficiency due to the nonuniform inlet air temperature distribution across servers.
    Computer 01/2012; 45(9):48-55. · 1.68 Impact Factor
  • Xiaodong Wang, D. Vasudevan, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: 3D integration is a promising technology that provides high memory bandwidth, reduced power, shortened latency, and smaller form factor. Among many issues in 3D IC design and production, testing remains one of the major challenges. This paper introduces a new design-for-test technique called 3D-GESP, an efficient Built-In-Self-Repair (BISR) algorithm to fulfill the test and reliability needs for 3D-stacked memories. Instead of the local testing and redundancy allocation method as most current BISR techniques employed, we introduce a global 3D BISR scheme, which not only enables redundancy sharing, but also parallelizes the BISR procedure among all the stacked layers of a 3D memory. Our simulation results show that our proposed technique will significantly increase the memory repair rate and reduce the test time.
    3D Systems Integration Conference (3DIC), 2011 IEEE International; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents Ally, a server platform architecture that supports compute-intensive management services on multi-core processors. Ally introduces simple hardware mechanisms to sequester cores to run a separate software environment dedicated to management tasks, including packet processing software appliances (e.g. for Deep Packet Inspection, DPI) with efficient mechanisms to safely and transparently intercept network packets. Ally enables distributed deployment of compute-intensive management services throughout a data center. Importantly, it uniquely allows these services to be deployed independent of the arbitrary OSs and/or hyper visor that users may choose to run on the remaining cores, with hardware isolation preventing the host environment from tampering with the management environment. Experiments using full system emulation and a Linux-based prototype validate Ally functionality and demonstrate low overhead packet interception, e.g., using Ally to host the well-known Snort packet inspection software incurs less over-head than deploying Snort as a Xen virtual machine appliance, resulting in up to 2× improvement in throughput for some workloads.
    Architectures for Networking and Communications Systems (ANCS), 2011 Seventh ACM/IEEE Symposium on; 11/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As DRAM scaling becomes more challenging and its energy efficiency receives a growing concern for data center operation, an alternative approach- stacking DRAM die with thru-silicon vias (TSV) using 3-D integration technology is being undertaken by industry to address these looming issues. Furthermore, 3-D technology also enables heterogeneous die stacking within one DRAM package. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency, in particular, we propose a novel floorplan and several architectural techniques to fully exploit the benefits of 3-D die stacking technology when integrating an SRAM row cache into a DRAM chip. Our multi-core simulation results show that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31% for memory intensive applications.
    Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on; 09/2011
  • Source
    Sungkap Yeo, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing has emerged as a highly cost-effective computation paradigm for IT enterprise applications, scientific computing, and personal data management. Because cloud services are provided by machines of various capabilities, performance, power, and thermal characteristics, it is challenging for providers to understand their cost effectiveness when deploying their systems. This article analyzes a parallelizable task in a heterogeneous cloud infrastructure with mathematical models to evaluate the energy and performance trade-off. As the authors show, to achieve the optimal performance per utility, the slowest node's response time should be no more than three times that of the fastest node. The theoretical analysis presented can be used to guide allocation, deployment, and upgrades of computing nodes for optimizing utility effectiveness in cloud computing services.
    Computer 09/2011; · 1.68 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.
    High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on; 02/2010
  • Source
    H.-H.S. Lee, K. Chakrabarty
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the challenges for 3D technology adoption is the insufficient understanding of 3D testing issues and the lack of DFT solutions. This article describes testing challenges for 3D ICs, including problems that are unique to 3D integration, and summarizes early research results in this area. Researchers are investigating various 3D IC manufacturing processes that are particularly relevant to testing and DFT. In terms of the process and the level of assembly that 3D ICs require, we can broadly classify the techniques as monolithic or as die stacking.
    IEEE Design and Test of Computers 11/2009; · 1.62 Impact Factor
  • Source
    D.L. Lewis, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: The first memristor, originally theorized by Dr. Leon Chua in 1971, was identified by a team at HP Labs in 2008. This new fundamental circuit element is unique in that its resistance changes as current passes through it, giving the device a memory of the past system state. The immediately obvious application of such a device is in a non-volatile memory, wherein high- and low-resistance states are used to store binary values. A memory array of memristors forms what is called a resistive RAM or RRAM. In this paper, we survey the memristors that have been produced by a number of different research teams and present a point-by-point comparison between DRAM and this new RRAM, based on both existent and expected near-term memristor devices. In particular, we consider the case of a die-stacked 3D memory that is integrated onto a logic die and evaluate which memory is best suited for the job. While still suffering a few shortcomings, RRAM proves itself a very interesting design alternative to well-established DRAM technologies.
    3D System Integration, 2009. 3DIC 2009. IEEE International Conference on; 10/2009
  • Source
    Dong Hyuk Woo, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: An updated take on Amdahl's analytical model uses modern design constraints to analyze many-core design alternatives. The revised models provide computer architects with a better understanding of many-core design types, enabling them to make more informed tradeoffs.
    Computer 01/2009; · 1.68 Impact Factor
  • Source
    C.S. Ballapuram, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Java platforms are widely deployed and used ranging from ultra-mobile embedded devices to servers for their portability and security. The TLB, a content addressable memory, can consume a significant power in these systems due to the nature of its associative search mechanism. In this paper, we propose and investigate three different optimizations for the TLB design, aiming to improve its power consumption for Java applications running on top of Java virtual machines. Our techniques exploit unique memory reference characteristics demonstrated by the JVM and its interaction with the Java applications running atop. Our first technique J-iTLB shows an average of 12.7% energy reduction in the iTLB with around 1% performance improvement for eliminating conflict misses between the JVM code and the Java application code. The second technique combines the J-iTLB with an object iTLB scheme and achieves an energy savings of 51% with a small 1% performance impact. Our third technique, a read-write partitioned J-dTLB, shows an average of 34% energy savings in the dTLB with 1% performance impact. Finally, when the J-iTLB with an object iTLB is combined with the J-dTLB, we obtained 42% overall TLB energy savings.
    Embedded Computer Systems: Architectures, Modeling, and Simulation, 2008. SAMOS 2008. International Conference on; 08/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To build a future many-core processor, industry must address the challenges of energy consumption and performance scalability. A 3D-integrated broad-purpose accelerator architecture called parallel-on-demand (POD) integrates a specialized SIMD-based die layer on top of a CISC superscalar processor to accelerate a variety of data-parallel applications. It also maintains binary compatibility and facilitates extensibility by virtualizing the acceleration capability.
    IEEE Micro 08/2008; · 2.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a novel design methodology to combat the ever-aggravating high frequency power supply noise (di/dt) in modern microprocessors. Our methodology integrates microarchitectural profiling for noise-aware floorplanning, dynamic runtime noise control to prevent unsustainable noise emergencies, as well as decap allocation; all to produce a design for the average-case current consumption scenario. The dynamic controller contributes a microarchitectural technique to eliminate occurences of the worst-case noise scenario thus our method focuses on average-case noise behavior.
    Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific; 04/2008
  • Source
    M. Ghosh, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes virtual exclusion, an architectural technique to reduce leakage energy in the L2 caches for cache-coherent multiprocessor systems. This technique leverages two previously proposed circuits techniques - gated V<sub>dd</sub> and drowsy cache, and proposes a low cost, easily implementable scheme for cache-coherent multiprocessor systems. The virtual exclusion scheme saves leakage energy by keeping the data portion of repetitive cache lines off in the large higher level caches while still manages to maintain multi-level Inclusion, an essential property for an efficient implementation of conventional cache coherence protocols. By exploiting the existing state information in the snoop-based cache coherence protocol, there is almost no extra hardware overhead associated with our scheme. In our experiments, the SPLASH-2 multiprocessor benchmark suite was correctly executed under the new Virtual Exclusion policy and showed an up to 72% savings of leakage energy (46% for SMP and 35% for multicore in L2 on average) over a baseline drowsy L2 cache.
    Parallel and Distributed Systems, 2007 International Conference on; 01/2008
  • Source
    E. Fontaine, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: The Katsevich image reconstruction algorithm is the first theoretically exact cone beam image reconstruction algorithm for a helical scanning path in computed tomography (CT). However, it requires much more computation and memory than other CT algorithms. Fortunately, there are many opportunities for coarse-grained parallelism using multiple threads and fine-grained parallelism using SIMD units that can be exploited by emerging multi- core processors. In this paper, we implemented and optimized Katsevich image reconstruction based on the previously proposed pi- interval method and cone beam cover method and parallelized them using OpenMP API and SIMD instructions. We also exploited symmetry in the backprojection stage. Our results show that reconstructing a 1024 times 1024 times 1024 image using 5120 512 times 128 projections on a dual-socket quad-core system took 23,798 seconds on our baseline and 642 seconds on our final version, a more than 37 times speedup. Furthermore, by parallelizing the code with more threads we found that the scalability is eventually hinged by the limited front-side bus bandwidth.
    Parallel and Distributed Systems, 2007 International Conference on; 01/2008
  • Source
    D.L. Lewis, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Die stacking is a promising new technology that enables integration of devices in the third dimension. Recent research thrusts in 3D-integrated microprocessor design have demonstrated significant improvements in both power consumption and performance. However, this technology is currently being held back due to the lack of test technology. Because processor functionality is partitioned across different silicon die layers, only partial circuitry exists on each layer pre-bond. In current 3D manufacturing, layers in the die stack are simply bonded together to form the complete processor; no testing is performed at the pre-bond stage. Such a strategy leads to an exponential decay in the yield of the final product and places an economic limit on the number of die that can be stacked. To overcome this limit, pre-bond test is a necessity. In this paper, we present a technique to enable pre-bond test in each layer. Further, we address several issues with integrating this new test hardware into the final design. Finally, we use a sample 3D floorplan based on the Alpha 21264 to show that our technique can be implemented at a minimal cost (0.2% area overhead). Our design for pre-bond testability enables the structural test necessary to continue 3D integration for microprocessors beyond a few layers.
    Test Conference, 2007. ITC 2007. IEEE International; 11/2007
  • Source
    R.M. Yoo, H.-H.S. Lee, Han Lee, Kingsum Chow
    [Show abstract] [Hide abstract]
    ABSTRACT: Benchmark suite scores are typically calculated by averaging the performance of each individual workload. The scores are inherently affected by the distribution of workloads. Given the applications of a benchmark suite are typically contributed by many consortium members, workload redundancy becomes inevitable. Especially, the merger of the benchmarks can significantly increase artificial redundancy. Redundancy in the workloads of a benchmark suite renders the benchmark scores biased, making the score of a suite susceptible to malicious tweaks. The current standard workaround method to alleviating the redundancy issue is to weigh each individual workload during the final score calculation. Unfortunately, such a weight-based score adjustment can significantly undermine the credibility of the objectiveness of benchmark scores. In this paper, we propose a set of benchmark suite score calculation methods called the hierarchical means that incorporate cluster analysis to amortize the negative effect of workload redundancy. These methods not only improve the accuracy and robustness of the score, but also improve the objectiveness over the weight-based approach. In addition, they can also be used to analyze the inherent redundancy and cluster characteristics in a quantitative manner for evaluating a new benchmark suite. In our case study, the hierarchical geometric mean was applied to a hypothetical Java benchmark suite, which attempts to model the upcoming release of the new SPECjvm benchmark suite. In addition, we also show that benchmark suite clustering heavily depends on how the workloads are characterized.
    Workload Characterization, 2007. IISWC 2007. IEEE 10th International Symposium on; 10/2007
  • Source
    Taeweon Suh, Shih-Lien Lu, H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, there is a surge of interests in using FPGAs for computer architecture research including applications from emulating and analyzing a new platform to accelerating mi-croarchitecural simulation speed for design space exploration. This paper proposes and demonstrates a novel usage of FPGAs for measuring the efficiency of coherent traffic of an actual computer system. Our approach employs an FPGA acting as a bus agent, interacting with a real CPU in a dual processor system to measure the intrinsic delay of coherence traffic. This technique eliminates non-deterministic factors in the measurement, such as the arbitration delay and stall in the pipelined bus. It completely isolates the impact of pure coherence traffic delay on system performance while executing workloads natively. Our experiments show that the overall execution time of the benchmark programs on a system with coherence traffic was actually increased over one without coherent traffic. It indicates that cache-to-cache transfers are less efficient in an Intel-based server system, and there exists room for further improvement such as the inclusion of the O state and cache line buffers in the memory controller.
    Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on; 09/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the first multiobjective microarchitectural floorplanning algorithm for high-performance processors implemented in two-dimensional (2-D) and three-dimensional (3-D) ICs. The floorplanner takes a microarchitectural netlist and determines the dimension as well as the placement of the functional modules into single- or multiple-device layers while simultaneously achieving high performance and thermal reliability. The traditional design objectives such as area and wirelength are also considered. The 3-D floorplanning algorithm considers the following 3-D-specific issues: vertical overlap optimization and bonding-aware layer partitioning. The hybrid floorplanning approach combines linear programming and simulated annealing, which is shown to be very effective in obtaining high-quality solutions in a short runtime under multiobjective goals. This paper provides comprehensive experimental results on making tradeoffs among performance, thermal, area, and wirelength for both 2-D and 3-D ICs
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 02/2007; · 1.09 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As very large scale integration (VLSI) process technology migrates to nanoscale with a feature size of less than 100 nm, global wire delay is becoming a major hindrance in keeping the latency of intrachip communication within a single cycle, thus substantially decaying performance scalability. In addition, an effective microarchitectural floor planning algorithm can no longer ignore the dynamic communication patterns of applications. This article, using the profile information acquired at the microarchitecture level, proposes a "profile-guided microarchitectural floor planner" that considers both the impact of wire delay and the architectural behavior, namely, the intermodule communication, to reduce the latency of frequent routes inside a processor and to maintain performance scalability. Based on the simulation results here, the proposed profile-guided method shows a 5%-40% average instructions per cycle (IPC) improvement when the clock frequency is fixed. From the perspective of instruction throughput in billion instructions per second (BIPS), the floor planner is much more scalable than the conventional wirelength-based floor planner
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 08/2006; · 1.09 Impact Factor