K. Skadron

University of Virginia, Charlottesville, VA, USA

Are you K. Skadron?

Claim your profile

Publications (57)17.59 Total impact

  • Source
    Article: Scaling with Design Constraints: Predicting the Future of Big Chips
    [show abstract] [hide abstract]
    ABSTRACT: The past few years have witnessed high-end processors with increasing numbers of cores and larger dies. With limited instruction-level parallelism, chip power constraints, and technology-scaling limitations, designers have embraced multiple cores rather than single-core performance scaling to improve chip throughput. This article examines whether this approach is sustainable by scaling from a state-of-the-art big-chip design point using analytical models.
    IEEE Micro 09/2011; · 1.78 Impact Factor
  • Source
    Conference Proceeding: Reducing the cost of redundant execution in safety-critical systems using relaxed dedication
    [show abstract] [hide abstract]
    ABSTRACT: We introduce on-demand redundancy, a set of architectural techniques that leverage the tightly-coupled nature of components in systems-on-chip to reduce the cost of safety-critical systems. On-demand redundancy eases the assumptions that traditionally segregate the execution of critical and non-critical tasks (NCTs), making resources available for critical tasks at potentially arbitrary points in both space and time, and otherwise freeing resources to execute non-critical tasks when critical tasks are not executing. Relaxed dedication is one such technique that allows non-critical tasks to execute on critical task resources. Our results demonstrate that for a wide variety of applications and architectures, relaxed dedication is more cost-effective than a traditional approach that employs dedicated resources executing in lockstep. Applied to dual-modular redundancy (DMR), relaxed dedication exposes 73% more NCT cycles than traditional DMR on average, across a wide variety of usage scenarios.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011; 04/2011
  • Source
    Conference Proceeding: A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads
    [show abstract] [hide abstract]
    ABSTRACT: The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics.
    Workload Characterization (IISWC), 2010 IEEE International Symposium on; 01/2011
  • Source
    Conference Proceeding: Temperature-to-power mapping
    [show abstract] [hide abstract]
    ABSTRACT: Accurate power maps are useful for power model validation, process variation characterization, leakage estimation, and power optimization, but are hard to measure directly. Deriving power maps from measured thermal maps is the inverse problem of the power-to-temperature mapping, extensively studied through thermal simulation. Until recently this inverse heat conduction problem has received little attention in the microarchitecture research community. This paper first identifies the source of difficulties for the problem. The inverse mapping is then performed by applying constraints from microarchitecture-level observations. The inherent large sensitivity of the resultant power map is minimized through thermal map-filtering and constrained least-squares optimization. Choices of filter parameters and optimization constraints are investigated and their effects are evaluated. Furthermore, the paper highlights the differences between the grid and block modeling in the inverse mapping which were often ignored by previous schemes. The proposed methods reduce the mapping error by more than 10× compared to unoptimized solutions. To our best knowledge this is the first work to quantitatively evaluate and minimize the noise effect in the temperature to power mapping problem at the microarchitecture level for both grid and block mode, and for the steady and transient case.
    Computer Design (ICCD), 2010 IEEE International Conference on; 11/2010
  • Source
    Conference Proceeding: Exploiting inter-thread temporal locality for chip multithreading
    [show abstract] [hide abstract]
    ABSTRACT: Multi-core organizations increasingly support multiple threads per core. Threads on a core usually share a single first-level data cache, so thread schedulers must try to minimize cache contention among threads. While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed to improve inter-thread cache sharing. This paper proposes the symbiotic affinity scheduling (SAS) algorithm in which work is first partitioned according to the number of cores (i.e., the number of caches), and these partitions are then subdivided and scheduled among each core's available thread contexts so that threads sharing a core operate on neighboring elements to maximize cache locality. We demonstrate this concept with a series of data-parallel benchmarks. Simulations on M5 achieve an average speedup of 1.69?? and 36% energy savings over conventional scheduling techniques that are oblivious to whether threads share a cache. Even compared to an approach that extends oblivious scheduling to ensure that the sum of the threads' working sets fits in the cache, symbiotic affinity scheduling is able to exploit greater temporal locality and provide 30% performance gains on average. Symbiosis also outperforms adaptive contention reduction techniques by 17%.
    Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on; 05/2010
  • Source
    Conference Proceeding: Exploring the thermal impact on manycore processor performance
    [show abstract] [hide abstract]
    ABSTRACT: Performance of processors with many simple parallel cores is limited by the serial part of the workload, requiring an asymmetric core organization with one or more aggressive ¿primary¿ cores for better serial performance. A primary core introduces power-hungry microarchitectural structures and usually causes severe local hot spots. This paper explores the thermal impact on manycore processor architecture and evaluates its performance. Preliminary results show that thermal constraints reduce performance as expected, but also make performance almost insensitive to the complexity of the primary core across a diverse degrees of parallelism, which greatly reduces design complexity.
    Semiconductor Thermal Measurement and Management Symposium, 2010. SEMI-THERM 2010. 26th Annual IEEE; 03/2010
  • Source
    Conference Proceeding: Interaction of scaling trends in processor architecture and cooling
    [show abstract] [hide abstract]
    ABSTRACT: It is predicted that two important trends are likely to accompany traditional CMOS semiconductor technology scaling - chip multiprocessors and 3D integration. With the ever-increasing power consumption and the consequent difficulty in heat removal, it is important to consider the limits and implications of different cooling methods for the upcoming many-core and 3D era. In this paper, we consider both technology scaling and many-core architecture scaling trends in conjunction with conventional air cooling and advanced microchannel cooling for both 2D and 3D microprocessors and identify interesting inflection design points down the road.
    Semiconductor Thermal Measurement and Management Symposium, 2010. SEMI-THERM 2010. 26th Annual IEEE; 03/2010
  • Source
    Conference Proceeding: Differentiating the roles of IR measurement and simulation for power and temperature-aware design
    [show abstract] [hide abstract]
    ABSTRACT: In temperature-aware design, the presence or absence of a heatsink fundamentally changes the thermal behavior with important design implications. In recent years, chip-level infrared (IR) thermal imaging has been gaining popularity in studying thermal phenomena and thermal management, as well as reverse-engineering chip power consumption. Unfortunately, IR thermal imaging needs a peculiar cooling solution, which removes the heatsink and applies an IR-transparent liquid flow over the exposed bare die to carry away the dissipated heat. Because this cooling solution is drastically different from a normal thermal package, its thermal characteristics need to be closely examined. In this paper, we characterize the differences between two cooling configurations-forced air flow over a copper heatsink (AIR-SINK) and laminar oil flow over bare silicon (OIL-SILICON). For the comparison, we modify the HotSpot thermal model by adding the IR-transparent oil flow and the secondary heat transfer path through the package pins, hence modeling what the IR camera actually sees at runtime. We show that OIL-SILICON and AIR-SINK are significantly different in both transient and steady-state thermal responses. OIL-SILICON has a much slower short-term transient response, which makes dynamic thermal management less efficient. In addition, for OIL-SILICON, the direction of oil flow plays an important role by changing hot spot location, thus impacting hot spot identification and thermal sensor placement. These results imply that the power- and temperature-aware design process cannot just rely on IR measurements. Simulation and IR measurement are both needed and are complementary techniques.
    Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on; 05/2009
  • Source
    Conference Proceeding: Accelerating Compute-Intensive Applications with GPUs and FPGAs
    [show abstract] [hide abstract]
    ABSTRACT: Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. Applications typically exhibit vastly different performance characteristics depending on the accelerator. This is an inherent problem attributable to architectural design, middleware support and programming style of the target platform. For the best application-to-accelerator mapping, factors such as programmability, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. We present a performance study of three diverse applications - Gaussian elimination, data encryption standard (DES), and Needleman-Wunsch - on an FPGA, a GPU and a multicore CPU system. We perform a comparative study of application behavior on accelerators considering performance and code complexity. Based on our results, we present an application characteristic to accelerator platform mapping, which can aid developers in selecting an appropriate target architecture for their chosen application.
    Application Specific Processors, 2008. SASP 2008. Symposium on; 07/2008
  • Source
    Article: Low-Power Design and Temperature Management
    [show abstract] [hide abstract]
    ABSTRACT: One of the primary concerns for microprocessor designers has always been balancing power and thermal management while minimizing performance loss. rather than generate solutions to this dilemma, the advent of multicore chips has raised a host of new challenges. this discussion with Pradip Bose and Kanad Ghose, excerpted from a 2007 Card Workshop Panel, explores the future of low-power design and temperature management.
    IEEE Micro 12/2007; · 1.78 Impact Factor
  • Source
    Article: Dynamic Voltage Scaling in Multitier Web Servers with End-to-End Delay Control
    [show abstract] [hide abstract]
    ABSTRACT: The energy and cooling costs of Web server farms are among their main financial expenditures. This paper explores the benefits of dynamic voltage scaling (DVS) for power management in server farms. Unlike previous work, which addressed DVS on individual servers and on load-balanced server replicas, this paper addresses DVS in multistage service pipelines. Contemporary Web server installations typically adopt a three-tier architecture in which the first tier presents a Web interface, the second executes scripts that implement business logic, and the third serves database accesses. From a user's perspective, only the end-to-end response across the entire pipeline is relevant. This paper presents a rigorous optimization methodology and an algorithm for minimizing the total energy expenditure of the multistage pipeline subject to soft end-to-end response-time constraints. A distributed power management service is designed and evaluated on a real three-tier server prototype for coordinating DVS settings in a way that minimizes global energy consumption while meeting end-to-end delay constraints. The service is shown to consume as much as 30 percent less energy compared to the default (Linux) energy saving policy
    IEEE Transactions on Computers 05/2007; 56(4):444-458. · 1.10 Impact Factor
  • Source
    Conference Proceeding: Enhancing Energy Efficiency in Multi-tier Web Server Clusters via Prioritization
    T. Horvath, K. Skadron, T. Abdelzaher
    [show abstract] [hide abstract]
    ABSTRACT: This paper investigates the design issues and energy savings benefits of service prioritization in multi-tier Web server clusters. In many services, classes of clients can be naturally assigned different priorities based on their performance requirements. We show that if the whole multi-tier system is effectively prioritized, additional power and energy savings are realizable while keeping an existing cluster-wide energy management technique, through exploiting the different performance requirements of separate service classes. We find a simple prioritization scheme to be highly effective without requiring intrusive modifications to the system. In order to quantify its benefits, we perform extensive experimental evaluation on a real testbed. It is shown that the scheme significantly improves both total system power savings and energy efficiency, at the same time as improving throughput and enabling the system to meet per-class performance requirements.
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International; 04/2007
  • Source
    Article: Interconnect Lifetime Prediction for Reliability-Aware Systems
    [show abstract] [hide abstract]
    ABSTRACT: Thermal effects are becoming a limiting factor in high-performance circuit design due to the strong temperature dependence of leakage power, circuit performance, IC package cost, and reliability. While many interconnect reliability models assume a constant temperature, this paper analyzes the effects of temporal and spatial thermal gradients on interconnect lifetime in terms of electromigration, and presents a physics-based dynamic reliability model which returns reliability equivalent temperature and current density that can be used in traditional reliability analysis tools. The model is verified with numerical simulations and reveals that blindly using the maximum temperature leads to too pessimistic lifetime estimation. Therefore, the proposed model not only increases the accuracy of reliability estimates, but also enables designers to reclaim design margin in reliability-aware design. In addition, the model is useful for improving the performance of temperature-aware runtime management by modeling system lifetime as a resource to be consumed at a stress-dependent rate
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 03/2007; · 1.22 Impact Factor
  • Source
    Article: HotSpot: a compact thermal modeling methodology for early-stage VLSI design
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 06/2006; · 1.22 Impact Factor
  • Source
    Conference Proceeding: CMP design space exploration subject to physical constraints
    [show abstract] [hide abstract]
    ABSTRACT: This paper explores the multi-dimensional design space for chip multiprocessors, exploring the inter-related variables of core count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency, under various area and thermal constraints. The results show the importance of joint optimization. Thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery, demonstrating the importance of considering thermal constraints while optimizing these other parameters. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important. Finally, the paper shows the challenges of accommodating both CPU-bound and memory-bound workloads on the same design. Their respective preferences for more cores and larger caches lead to increasingly irreconcilable configurations as area and other constraints are relaxed; rather than accommodating a happy medium, the extra resources simply encourage more extreme optimization points.
    High-Performance Computer Architecture, 2006. The Twelfth International Symposium on; 03/2006
  • Source
    Article: Parameterized physical compact thermal modeling
    Wei Huang, M.R. Stan, K. Skadron
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents a compact thermal modeling (CTM) approach, which is fully parameterized according to design geometries and material physical properties. While most compact modeling approaches facilitate thermal characterization of existing package designs, our method is better suited for preliminary exploration of the design space at both the silicon level and the package level. We show that our modeling method achieves reasonable boundary condition independence (BCI) by comparing a CTM example with a BCI model for a benchmark ball grid array single-chip package under the same standard set of boundary conditions. In essence, the presented CTM method can act as a convenient medium for enhanced interactions and collaborations among designers at the package, circuit and computer architecture levels, leading to efficient early evaluations of different thermally-related design trade-offs at all the above levels of abstraction before the actual detailed design is available. The presented modeling method can be easily extended to model emerging packaging schemes such as stacked chip-scale packaging and three-dimensional integration.
    IEEE Transactions on Components and Packaging Technologies 01/2006; · 0.94 Impact Factor
  • Source
    Article: Improved thermal management with reliability banking
    Z. Lu, J. Lach, M.R. Stan, K. Skadron
    [show abstract] [hide abstract]
    ABSTRACT: Using a fixed temperature for thermal throttling is pessimistic. Reduced aging during periods of low temperature can compensate for accelerated aging during periods of high temperature. Runtime tracking of the temperature-dependent aging rate means that throttling is engaged only when necessary to maintain reliability. In this article, we show that the effect of cool (low-temperature) phases can compensate for that of hot (high-temperature) phases on reliability. Existing dynamic thermal management (DTM) techniques ignore the effects of temperature fluctuations on chip lifetime and can unnecessarily impose performance penalties for hot phases. Using electromigration as the targeted failure mechanism, we apply a dynamic reliability model and propose a dynamic reliability management (DRM) technique to dynamically track the consumption of chip lifetime during operation.
    IEEE Micro 12/2005; 25(6):40- 49. · 1.78 Impact Factor
  • Source
    Conference Proceeding: Analytical model for sensor placement on microprocessors
    Kyeong-Jae Lee, K. Skadron, W. Huang
    [show abstract] [hide abstract]
    ABSTRACT: Thermal management in microprocessors has become a major design challenge in recent years. Thermal monitoring through hardware sensors is important, and these sensors must be carefully placed on the chip to account for thermal gradients. In this paper, we present an analytical model that describes the maximum temperature differential between a hot spot and a region of interest based on their distance and processor packaging information. We also use a run-time thermal model, as an illustration of virtual sensors, and examine two benchmarks that exhibit highly concentrated thermal stress. We then use our analytical model to demonstrate the safety margins of the chip. Ultimately, the mathematical expression allows designers to obtain worst-case behavior of thermal heatup and select the optimal location of additional sensors.
    Computer Design: VLSI in Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on; 11/2005
  • Source
    Conference Proceeding: Monitoring temperature in FPGA based SoCs
    [show abstract] [hide abstract]
    ABSTRACT: FPGA logic densities continue to increase at a tremendous rate. This has had the undesired consequence of increased power density, which manifests itself as higher on-die temperatures and local hotspots. Sophisticated packaging techniques have become essential to maintain the health of the chip. In addition to static techniques to reduce the temperature, dynamic thermal management techniques are essential. Such techniques rely on accurate on-chip temperature information. In this paper, we present the design of a system that monitors the temperatures at various locations on the FPGA. This system is composed of a controller interfacing to an array of temperature sensors that are implemented on the FPGA fabric. Such a system can be used to implement dynamic thermal management techniques. We cross validate the sensor readings with values obtained from HotSpot, a pre-RTL architectural level thermal modeling tool.
    Computer Design: VLSI in Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on; 11/2005
  • Source
    Conference Proceeding: The need for a full-chip and package thermal model for thermally optimized IC designs
    [show abstract] [hide abstract]
    ABSTRACT: Modeling and analyzing detailed die temperature with a full-chip thermal model at early design stages is important to discover and avoid potential thermal hazards. However, omitting important aspects of package details in a thermal model can result in significant temperature estimation errors. In this paper, we discuss the applications of an existing compact thermal model that models both die and package temperature details. As an example, a thermally self-consistent leakage power calculation of a POWER4-like microprocessor design is presented. We then demonstrate the importance of including detailed package information in the thermal model by several examples considering the impact of thermal interface material (TIM), which glues the die to the heat spreader. The fact that detailed package information is needed to build an accurate compact thermal model implies a design flow, in which the chip- and package-level compact thermal model acts as a convenient medium for more productive collaborations among circuit designers, computer architects and package designers, leading to early and efficient evaluations of different design tradeoffs for an optimal design from a thermal point of view.
    Low Power Electronics and Design, 2005. ISLPED '05. Proceedings of the 2005 International Symposium on; 09/2005