H.-H.S. Lee

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you H.-H.S. Lee?

Claim your profile

Publications (32)13.87 Total impact

  • Sungkap Yeo · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: As the concept of cloud computing is gaining popularity, more data centers are built to support the needs. The data centers, which have consumed 1.5% of the total electrical energy generated in the USA in 2006, are paying the majority of their maintenance cost to the electricity bills. Reducing power consumption in the data centers is now a must not only for seizing sustainable development but also for preserving our planet green. Along the effort of building power-efficient data centers, this chapter will start by answering the ultimate question-where did the power go? By taking a top-down approach from the data center level all the way down to the microarchitectural level, this chapter visualizes the power breakdowns and discusses the power optimization techniques for each layer. © 2012 Springer Science+Business Media, LLC. All rights reserved.
    No preview · Article · Aug 2013
  • Lifeng Nai · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Conflict detection and resolution are among the most fundamental issues in transactional memory systems. Hardware transactional memory (HTM) systems such as AMD's Advanced Synchronization Facility (ASF) employ inherent cache coherence protocol messages to perform conflict detection among transactions. Such an implementation has the advantage of design simplicity, nonetheless, it also generates false transactional conflicts due to false sharing within cache lines, unnecessarily reducing the overall performance. In this work, we first investigated the behavior of false transactional conflicts under the AMD's ASF system. It is found that false conflicts showed rather stable pattern within each cache line that subsequently inspired our false transactional conflict reduction technique using our proposed speculative sub-blocking state. By adding an extra speculative state for each cache line's sub-block, we can maintain conflict detection at the granularity of sub-blocks while keeping the original cache coherence protocol intact. The overall design is simple and highly implementable for achieving a high-efficiency HTM system with minimum impact in hardware. We evaluated our proposed technique using PTLsim-ASF and compared it with a baseline ASF HTM system and an ideal system with no false transactional conflict. Our results showed that the proposed lightweight technique can avoid false conflicts effectively and efficiently. With four sub-blocks in a cache line, our technique can eliminate 56.4% false transactional conflicts and 31.3% of all transactional conflicts on average, which approaches the performance of an ideal system.
    No preview · Conference Paper · May 2013
  • Dong Hyuk Woo · Nak Hee Seong · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in high demand, the DRAM industry has started to undertake an alternative approach to address these looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for computer architects to contemplate a new memory hierarchy for future system design. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In this paper, to address these issues, we propose a novel floorplan and several architectural techniques that fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memory-intensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31%.
    No preview · Article · Jan 2013 · IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Sungkap Yeo · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: To optimize datacenter energy efficiency, SimWare analyzes the power consumption of servers, cooling units, and fans as well as the effects of heat recirculation and air supply timing. Experiments using SimWare show a high loss of cooling efficiency due to the nonuniform inlet air temperature distribution across servers.
    No preview · Article · Sep 2012 · Computer
  • Source
    Xiaodong Wang · D. Vasudevan · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: 3D integration is a promising technology that provides high memory bandwidth, reduced power, shortened latency, and smaller form factor. Among many issues in 3D IC design and production, testing remains one of the major challenges. This paper introduces a new design-for-test technique called 3D-GESP, an efficient Built-In-Self-Repair (BISR) algorithm to fulfill the test and reliability needs for 3D-stacked memories. Instead of the local testing and redundancy allocation method as most current BISR techniques employed, we introduce a global 3D BISR scheme, which not only enables redundancy sharing, but also parallelizes the BISR procedure among all the stacked layers of a 3D memory. Our simulation results show that our proposed technique will significantly increase the memory repair rate and reduce the test time.
    Preview · Conference Paper · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents Ally, a server platform architecture that supports compute-intensive management services on multi-core processors. Ally introduces simple hardware mechanisms to sequester cores to run a separate software environment dedicated to management tasks, including packet processing software appliances (e.g. for Deep Packet Inspection, DPI) with efficient mechanisms to safely and transparently intercept network packets. Ally enables distributed deployment of compute-intensive management services throughout a data center. Importantly, it uniquely allows these services to be deployed independent of the arbitrary OSs and/or hyper visor that users may choose to run on the remaining cores, with hardware isolation preventing the host environment from tampering with the management environment. Experiments using full system emulation and a Linux-based prototype validate Ally functionality and demonstrate low overhead packet interception, e.g., using Ally to host the well-known Snort packet inspection software incurs less over-head than deploying Snort as a Xen virtual machine appliance, resulting in up to 2× improvement in throughput for some workloads.
    Preview · Conference Paper · Nov 2011
  • Source
    M. Ghosh · R. Nathuji · Min Lee · K. Schwan · H.H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: As the trend of more cores sharing common resources on a single die and more systems crammed into enterprise computing space continue, optimizing the economies of scale for a given compute capacity is becoming more critical. One major challenge in performance scalability is the growing L2 cache contention caused by multiple contexts running on a multi-core processor either natively or under a virtual machine environment. Currently, an OS, at best, relies on history based affinity information to dispatch a process or thread onto a particular processor core. Unfortunately, this simple method can easily lead to destructive performance effect due to conflicts in common resources, thereby slowing down all processes. To ameliorate the allocation/management policy of a shared cache on a multi-core, in this paper, we propose Bloom filter signatures, a low-complexity architectural support to allow an OS or a Virtual Machine Monitor to infer cache footprint characteristics and interference of applications, and then perform job scheduling based on symbiosis. Our scheme integrates hardware-level counting Bloom filters in caches to efficiently summarize cache usage behavior on a per-core, per-process or per-VM basis. We then proposed and studied three resource allocation algorithms to determine the optimal process-to-core mapping to minimize interference in the L2. We executed applications using allocation generated by our new process to-core mapping algorithms on an Intel Core 2 Duo machine and showed an averaged 22% (up to 54%) improvement when applications run natively, and an averaged 9.5% improvement (up to 26%)when running inside VMs.
    Preview · Conference Paper · Oct 2011
  • Source
    Dong Hyuk Woo · Nak Hee Seong · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: As DRAM scaling becomes more challenging and its energy efficiency receives a growing concern for data center operation, an alternative approach- stacking DRAM die with thru-silicon vias (TSV) using 3-D integration technology is being undertaken by industry to address these looming issues. Furthermore, 3-D technology also enables heterogeneous die stacking within one DRAM package. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency, in particular, we propose a novel floorplan and several architectural techniques to fully exploit the benefits of 3-D die stacking technology when integrating an SRAM row cache into a DRAM chip. Our multi-core simulation results show that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31% for memory intensive applications.
    Preview · Conference Paper · Sep 2011
  • Source
    Sungkap Yeo · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing has emerged as a highly cost-effective computation paradigm for IT enterprise applications, scientific computing, and personal data management. Because cloud services are provided by machines of various capabilities, performance, power, and thermal characteristics, it is challenging for providers to understand their cost effectiveness when deploying their systems. This article analyzes a parallelizable task in a heterogeneous cloud infrastructure with mathematical models to evaluate the energy and performance trade-off. As the authors show, to achieve the optimal performance per utility, the slowest node's response time should be no more than three times that of the fastest node. The theoretical analysis presented can be used to guide allocation, deployment, and upgrades of computing nodes for optimizing utility effectiveness in cloud computing services.
    Preview · Article · Sep 2011 · Computer
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the design and analysis of 3D-MAPS, a 64-core 3D-stacked memory-on-processor running at 277 MHz with 63 GB/s memory bandwidth, sent for fabrication using Tezzaron's 3D stacking technology. We also describe the design flow used to implement it using industrial 2D tools and custom add-ons to handle 3D specifics.
    Full-text · Conference Paper · Oct 2010
  • Source
    Dong Hyuk Woo · Nak Hee Seong · D.L. Lewis · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.
    Preview · Conference Paper · Feb 2010
  • Source
    H.-H.S. Lee · Krishnendu Chakrabarty
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the challenges for 3D technology adoption is the insufficient understanding of 3D testing issues and the lack of DFT solutions. This article describes testing challenges for 3D ICs, including problems that are unique to 3D integration, and summarizes early research results in this area. Researchers are investigating various 3D IC manufacturing processes that are particularly relevant to testing and DFT. In terms of the process and the level of assembly that 3D ICs require, we can broadly classify the techniques as monolithic or as die stacking.
    Preview · Article · Nov 2009 · IEEE Design and Test of Computers
  • Source
    Dean L. Lewis · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: The first memristor, originally theorized by Dr. Leon Chua in 1971, was identified by a team at HP Labs in 2008. This new fundamental circuit element is unique in that its resistance changes as current passes through it, giving the device a memory of the past system state. The immediately obvious application of such a device is in a non-volatile memory, wherein high- and low-resistance states are used to store binary values. A memory array of memristors forms what is called a resistive RAM or RRAM. In this paper, we survey the memristors that have been produced by a number of different research teams and present a point-by-point comparison between DRAM and this new RRAM, based on both existent and expected near-term memristor devices. In particular, we consider the case of a die-stacked 3D memory that is integrated onto a logic die and evaluate which memory is best suited for the job. While still suffering a few shortcomings, RRAM proves itself a very interesting design alternative to well-established DRAM technologies.
    Preview · Conference Paper · Oct 2009
  • Source
    Dong Hyuk Woo · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: An updated take on Amdahl’s analytical model uses modern design constraints to analyze many-core design alternatives. The revised models provide computer architects with a better understanding of manycore design types, enabling them to make more informed tradeoffs. Unsustainable power consumption and ever-increasing design and verification complexity have driven the microprocessor industry to integrate multiple cores on a single die, or multicore, as an architectural solution sustaining Moore’s law. 1 With dual-core and quad-core processors
    Preview · Article · Jan 2009 · Computer
  • Source
    C.S. Ballapuram · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: Java platforms are widely deployed and used ranging from ultra-mobile embedded devices to servers for their portability and security. The TLB, a content addressable memory, can consume a significant power in these systems due to the nature of its associative search mechanism. In this paper, we propose and investigate three different optimizations for the TLB design, aiming to improve its power consumption for Java applications running on top of Java virtual machines. Our techniques exploit unique memory reference characteristics demonstrated by the JVM and its interaction with the Java applications running atop. Our first technique J-iTLB shows an average of 12.7% energy reduction in the iTLB with around 1% performance improvement for eliminating conflict misses between the JVM code and the Java application code. The second technique combines the J-iTLB with an object iTLB scheme and achieves an energy savings of 51% with a small 1% performance impact. Our third technique, a read-write partitioned J-dTLB, shows an average of 34% energy savings in the dTLB with 1% performance impact. Finally, when the J-iTLB with an object iTLB is combined with the J-dTLB, we obtained 42% overall TLB energy savings.
    Preview · Conference Paper · Aug 2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To build a future many-core processor, industry must address the challenges of energy consumption and performance scalability. A 3D-integrated broad-purpose accelerator architecture called parallel-on-demand (POD) integrates a specialized SIMD-based die layer on top of a CISC superscalar processor to accelerate a variety of data-parallel applications. It also maintains binary compatibility and facilitates extensibility by virtualizing the acceleration capability.
    Full-text · Article · Aug 2008 · IEEE Micro
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a novel design methodology to combat the ever-aggravating high frequency power supply noise (di/dt) in modern microprocessors. Our methodology integrates microarchitectural profiling for noise-aware floorplanning, dynamic runtime noise control to prevent unsustainable noise emergencies, as well as decap allocation; all to produce a design for the average-case current consumption scenario. The dynamic controller contributes a microarchitectural technique to eliminate occurences of the worst-case noise scenario thus our method focuses on average-case noise behavior.
    Full-text · Conference Paper · Apr 2008
  • Source
    Mrinmoy Ghosh · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes virtual exclusion, an architectural technique to reduce leakage energy in the L2 caches for cache-coherent multiprocessor systems. This technique leverages two previously proposed circuits techniques - gated V<sub>dd</sub> and drowsy cache, and proposes a low cost, easily implementable scheme for cache-coherent multiprocessor systems. The virtual exclusion scheme saves leakage energy by keeping the data portion of repetitive cache lines off in the large higher level caches while still manages to maintain multi-level Inclusion, an essential property for an efficient implementation of conventional cache coherence protocols. By exploiting the existing state information in the snoop-based cache coherence protocol, there is almost no extra hardware overhead associated with our scheme. In our experiments, the SPLASH-2 multiprocessor benchmark suite was correctly executed under the new Virtual Exclusion policy and showed an up to 72% savings of leakage energy (46% for SMP and 35% for multicore in L2 on average) over a baseline drowsy L2 cache.
    Preview · Conference Paper · Jan 2008
  • Source
    Eric Fontaine · H.-H.S. Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: The Katsevich image reconstruction algorithm is the first theoretically exact cone beam image reconstruction algorithm for a helical scanning path in computed tomography (CT). However, it requires much more computation and memory than other CT algorithms. Fortunately, there are many opportunities for coarse-grained parallelism using multiple threads and fine-grained parallelism using SIMD units that can be exploited by emerging multi- core processors. In this paper, we implemented and optimized Katsevich image reconstruction based on the previously proposed pi- interval method and cone beam cover method and parallelized them using OpenMP API and SIMD instructions. We also exploited symmetry in the backprojection stage. Our results show that reconstructing a 1024 times 1024 times 1024 image using 5120 512 times 128 projections on a dual-socket quad-core system took 23,798 seconds on our baseline and 642 seconds on our final version, a more than 37 times speedup. Furthermore, by parallelizing the code with more threads we found that the scalability is eventually hinged by the limited front-side bus bandwidth.
    Preview · Conference Paper · Jan 2008
  • Source
    R.M. Yoo · H.-H.S. Lee · Han Lee · Kingsum Chow
    [Show abstract] [Hide abstract]
    ABSTRACT: Benchmark suite scores are typically calculated by averaging the performance of each individual workload. The scores are inherently affected by the distribution of workloads. Given the applications of a benchmark suite are typically contributed by many consortium members, workload redundancy becomes inevitable. Especially, the merger of the benchmarks can significantly increase artificial redundancy. Redundancy in the workloads of a benchmark suite renders the benchmark scores biased, making the score of a suite susceptible to malicious tweaks. The current standard workaround method to alleviating the redundancy issue is to weigh each individual workload during the final score calculation. Unfortunately, such a weight-based score adjustment can significantly undermine the credibility of the objectiveness of benchmark scores. In this paper, we propose a set of benchmark suite score calculation methods called the hierarchical means that incorporate cluster analysis to amortize the negative effect of workload redundancy. These methods not only improve the accuracy and robustness of the score, but also improve the objectiveness over the weight-based approach. In addition, they can also be used to analyze the inherent redundancy and cluster characteristics in a quantitative manner for evaluating a new benchmark suite. In our case study, the hierarchical geometric mean was applied to a hypothetical Java benchmark suite, which attempts to model the upcoming release of the new SPECjvm benchmark suite. In addition, we also show that benchmark suite clustering heavily depends on how the workloads are characterized.
    Preview · Conference Paper · Oct 2007