David H. Albonesi

Cornell University, Ithaca, New York, United States

Are you David H. Albonesi?

Claim your profile

Publications (123)38.41 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Heating ventilation and air-conditioning (HVAC) systems consume a significant portion of the energy within buildings. Current HVAC control systems use simple fixed occupant schedules, while proposed energy optimization schemes do not consider past discomfort in making future energy optimization decisions. We propose a Model-based predictive control (MPC) algorithm that adaptively balances energy and comfort while the system is in operation. The algorithm combines occupancy prediction with the history of occupant discomfort to constrain expected discomfort to an allowed budget. Our approach saves energy by dynamically shifting discomfort over time based on its real time performance. The system adapts its behavior according to the past discomfort and thus plays the dual role of saving energy when discomfort is smaller than the target budget, and maintaining comfort when the discomfort margin is small. Simulation results using synthetic benchmarks and occupancy traces demonstrate considerable energy savings over a smart reactive approach while meeting occupant comfort objectives.
    No preview · Article · Feb 2015
  • Article: Flicker

    No preview · Article · Jul 2013 · ACM SIGARCH Computer Architecture News
  • [Show abstract] [Hide abstract]
    ABSTRACT: Future microprocessors may become so power constrained that not all transistors will be able to be powered on at once. These systems will be required to nimbly adapt to changes in the chip power that is allocated to general-purpose cores and to specialized accelerators. This paper presents Flicker, a general-purpose multicore architecture that dynamically adapts to varying and potentially stringent limits on allocated power. The Flicker core microarchitecture includes deconfigurable lanes--horizontal slices through the pipeline--that permit tailoring an individual core to the running application with lower overhead than microarchitecture-level adaptation, and greater flexibility than core-level power gating. To exploit Flicker's flexible pipeline architecture, a new online multicore optimization algorithm combines reduced sampling techniques, application of response surface models to online optimization, and heuristic online search. The approach efficiently finds a near-global-optimum configuration of lanes without requiring offline training, microarchitecture state, or foreknowledge of the workload. At high power allocations, core-level gating is highly effective, and slightly outperforms Flicker overall. However, under stringent power constraints, Flicker significantly outperforms core-level gating, achieving an average 27% performance improvement.
    No preview · Conference Paper · Jun 2013
  • Abhinandan Majumdar · David H. Albonesi · Pradip Bose
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing worldwide concern over the energy consumption of commercial buildings calls for new approaches that analyze scheduled occupant activities and proactively take steps to curb building energy use. As one step in this direction, we propose to automate the scheduling of meetings in a way that uses available meeting rooms in an energy efficient manner, while adhering to time conflicts and capacity constraints. We devise a number of scheduling algorithms, ranging from greedy to heuristic approaches, and demonstrate up to a 70% reduction in energy use, with the best algorithms producing schedules whose energy use matches that of a brute force oracle.
    No preview · Conference Paper · Nov 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
    No preview · Patent · Jan 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.
    No preview · Article · Nov 2011 · Microprocessors and Microsystems
  • Mark J. Cianchetti · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 16nm timeframe, on-chip optical interconnect architectures are typically limited in scalability or are dependent on comparatively slow electrical control networks. In this article, we present a hybrid electrical/optical router for future large scale, cache coherent multicore microprocessors. The heart of the router is a low-latency optical crossbar that uses predecoded source routing and switch state preconfiguration to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. Overall, our optical router achieves 2X better network performance than a state-of-the-art electrical baseline in a mesh topology while consuming 30% less network power.
    No preview · Article · Jun 2011 · ACM Journal on Emerging Technologies in Computing Systems
  • Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: ReMAP is a reconfigurable architecture for accelerating and parallelizing applications within a heterogeneous chip multiprocessor (CMP). Clusters of cores share a common reconfigurable fabric adaptable for individual thread computation or fine-grained communication with integrated computation. ReMAP demonstrates significantly higher performance and energy efficiency than hard-wired communication-only mechanisms, and over allocating the fabric area to additional or more powerful cores.
    No preview · Article · Jan 2011 · IEEE Micro
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents ReMAP, a reconfigurable architecture geared towards accelerating and parallelizing applications within a heterogeneous CMP. In ReMAP, threads share a common reconfigurable fabric that can be configured for individual thread computation or fine-grained communication with integrated computation. The architecture supports both fine-grained point-to-point communication for pipeline parallelization and fine-grained barrier synchronization. The combination of communication and configurable computation within ReMAP provides the unique ability to perform customized computation while data is transferred between cores, and to execute custom global functions after barrier synchronization. ReMAP demonstrates significantly higher performance and energy efficiency compared to hard-wired communication-only mechanisms, and over what can ideally be achieved by allocating the fabric area to additional or more powerful cores.
    Preview · Conference Paper · Dec 2010
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Having reached the end of my second term as editor in chief, the time has come for Micro to move forward to the next phase of its evolution. I am delighted to announce that Dr. Erik Altman from IBM Research is the new Micro EIC. He has already begun making his mark on Micro, and my transition to "former EIC" has been a breeze given Erik's experience, enthusiasm, and ideas for moving Micro forward. The six articles in this issue certainly end my tenure as EIC on a high note.
    No preview · Article · Nov 2010 · IEEE Micro
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for ac- cess speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the har- monic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.
    Full-text · Conference Paper · Sep 2010
  • Source
    Paula Petrica · Jonathan A. Winter · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Future chip multiprocessors (CMPs) will be capable of deconfiguring faulty units in order to permit continued operation in the presence of wear-out failures. However, the unforeseen downside is pipeline imbalance due to other portions of the pipeline now being overprovisioned with respect to the deconfigured functionality. We propose PowerTransfer, a novel CMP architecture that dynamically redistributes the chip power under pipeline imbalances that arise from deconfiguring faulty units. Through rebalancing – achieved by temporary, symbiotic deconfiguration of additional functionality within the degraded core – power is harnessed for use elsewhere on the chip. This additional power is dynamically transferred to portions of the multi-core chip that can realize a performance boost from turning on previously dormant microarchitectural features. We demonstrate that a realistic PowerTransfer manager achieves chip-wide performance improvements of up to 25% compared to architectures that simply deconfigure faulty units without regard to the resulting inefficiency.
    Preview · Article · Jun 2010
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: While reconfigurable computing has traditionally involved attaching a reconfigurable fabric to a single processor core, the prospect of large-scale CMPs calls for a reevaluation of reconfigurable computing from the perspective of multicore architectures. We present ReMAPP, a reconfigurable architecture geared towards application acceleration and parallelization. In ReMAPP, parallel threads share a common reconfigurable fabric which can be configured for individual thread computation or fine-grained communication with integrated computation. The architecture supports both fine-grained barrier synchronization and fine-grained point-to-point communication for pipeline parallelization. The combination of communication and configurable computation within ReMAPP provides the unique ability to perform customized computation while data is transferred between cores, and to execute custom global functions after barrier synchronization. We demonstrate that ReMAPP achieves significantly higher performance and energy efficiency compared to hard-wired communication- only mechanisms, and over what can ideally be achieved by allocating the fabric area to more cores.
    Preview · Article · Jun 2010
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: A year ago, the 36th International Symposium on Computer Architecture featured the latest installment of the Computer Architecture Research Directions workshop. CARD is a series of minipanels, in which two experts take somewhat opposing viewpoints on important topics related to the future of computer architecture, under the direction of a moderator. As in previous years, attendees flocked to the CARD workshop to hear these debates. Two years ago, Micro featured a special issue on the 2007 CARD workshop. This issue features two articles derived from those minipanels, followed by two excellent general-interest articles.
    No preview · Article · May 2010 · IEEE Micro
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, reconfigurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy×delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.
    Preview · Conference Paper · Jan 2010
  • Source
    Jonathan A. Winter · David H. Albonesi · Christine A. Shoemaker
    [Show abstract] [Hide abstract]
    ABSTRACT: Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.
    Full-text · Conference Paper · Jan 2010
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: IEEE Micro Editor in Chief David H. Albonesi welcomes six new emmbers to teh IEEE Micro Editorial Board and previews this general interest issue.
    No preview · Article · Sep 2009 · IEEE Micro
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: IEEE Micro Editor in Chief David H. Albonesi welcomes six new emmbers to teh IEEE Micro Editorial Board and previews this general interest issue.
    No preview · Article · Sep 2009 · IEEE Micro
  • Article: Phastlane
    Mark J. Cianchetti · Joseph C. Kerekes · David H. Albonesi

    No preview · Article · Jun 2009 · ACM SIGARCH Computer Architecture News
  • [Show abstract] [Hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
    No preview · Article · Jan 2009

Publication Stats

4k Citations
38.41 Total Impact Points

Institutions

  • 2005-2013
    • Cornell University
      • Department of Electrical and Computer Engineering
      Ithaca, New York, United States
  • 1997-2006
    • University of Rochester
      • Department of Electrical and Computer Engineering
      Rochester, NY, United States
  • 2003
    • Rochester Institute of Technology
      • Department of Computer Engineering
      Rochester, New York, United States
  • 1995-1998
    • University of Massachusetts Amherst
      • Department of Electrical and Computer Engineering
      Amherst Center, MA, United States