David H. Albonesi

Cornell University, Ithaca, New York, United States

Are you David H. Albonesi?

Claim your profile

Publications (120)44.23 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Future microprocessors may become so power constrained that not all transistors will be able to be powered on at once. These systems will be required to nimbly adapt to changes in the chip power that is allocated to general-purpose cores and to specialized accelerators. This paper presents Flicker, a general-purpose multicore architecture that dynamically adapts to varying and potentially stringent limits on allocated power. The Flicker core microarchitecture includes deconfigurable lanes--horizontal slices through the pipeline--that permit tailoring an individual core to the running application with lower overhead than microarchitecture-level adaptation, and greater flexibility than core-level power gating. To exploit Flicker's flexible pipeline architecture, a new online multicore optimization algorithm combines reduced sampling techniques, application of response surface models to online optimization, and heuristic online search. The approach efficiently finds a near-global-optimum configuration of lanes without requiring offline training, microarchitecture state, or foreknowledge of the workload. At high power allocations, core-level gating is highly effective, and slightly outperforms Flicker overall. However, under stringent power constraints, Flicker significantly outperforms core-level gating, achieving an average 27% performance improvement.
    Proceedings of the 40th Annual International Symposium on Computer Architecture; 06/2013
  • Abhinandan Majumdar · David H. Albonesi · Pradip Bose
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing worldwide concern over the energy consumption of commercial buildings calls for new approaches that analyze scheduled occupant activities and proactively take steps to curb building energy use. As one step in this direction, we propose to automate the scheduling of meetings in a way that uses available meeting rooms in an energy efficient manner, while adhering to time conflicts and capacity constraints. We devise a number of scheduling algorithms, ranging from greedy to heuristic approaches, and demonstrate up to a 70% reduction in energy use, with the best algorithms producing schedules whose energy use matches that of a brute force oracle.
    Proceedings of the Fourth ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings; 11/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
    Year: 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.
    Microprocessors and Microsystems 11/2011; 35:683-694. DOI:10.1016/j.micpro.2011.08.008 · 0.60 Impact Factor
  • Mark J. Cianchetti · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 16nm timeframe, on-chip optical interconnect architectures are typically limited in scalability or are dependent on comparatively slow electrical control networks. In this article, we present a hybrid electrical/optical router for future large scale, cache coherent multicore microprocessors. The heart of the router is a low-latency optical crossbar that uses predecoded source routing and switch state preconfiguration to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. Overall, our optical router achieves 2X better network performance than a state-of-the-art electrical baseline in a mesh topology while consuming 30% less network power.
    ACM Journal on Emerging Technologies in Computing Systems 06/2011; 7:9. DOI:10.1145/1970406.1970411 · 0.83 Impact Factor
  • Matthew A. Watkins · D.H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: ReMAP is a reconfigurable architecture for accelerating and parallelizing applications within a heterogeneous chip multiprocessor (CMP). Clusters of cores share a common reconfigurable fabric adaptable for individual thread computation or fine-grained communication with integrated computation. ReMAP demonstrates significantly higher performance and energy efficiency than hard-wired communication-only mechanisms, and over allocating the fabric area to additional or more powerful cores.
    IEEE Micro 01/2011; 31(1):65-77. DOI:10.1109/MM.2011.14 · 1.81 Impact Factor
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Having reached the end of my second term as editor in chief, the time has come for Micro to move forward to the next phase of its evolution. I am delighted to announce that Dr. Erik Altman from IBM Research is the new Micro EIC. He has already begun making his mark on Micro, and my transition to "former EIC" has been a breeze given Erik's experience, enthusiasm, and ideas for moving Micro forward. The six articles in this issue certainly end my tenure as EIC on a high note.
    IEEE Micro 11/2010; 30:4-5. DOI:10.1109/MM.2010.117 · 1.81 Impact Factor
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: While reconfigurable computing has traditionally involved attaching a reconfigurable fabric to a single processor core, the prospect of large-scale CMPs calls for a reevaluation of reconfigurable computing from the perspective of multicore architectures. We present ReMAPP, a reconfigurable architecture geared towards application acceleration and parallelization. In ReMAPP, parallel threads share a common reconfigurable fabric which can be configured for individual thread computation or fine-grained communication with integrated computation. The architecture supports both fine-grained barrier synchronization and fine-grained point-to-point communication for pipeline parallelization. The combination of communication and configurable computation within ReMAPP provides the unique ability to perform customized computation while data is transferred between cores, and to execute custom global functions after barrier synchronization. We demonstrate that ReMAPP achieves significantly higher performance and energy efficiency compared to hard-wired communication- only mechanisms, and over what can ideally be achieved by allocating the fabric area to more cores.
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: A year ago, the 36th International Symposium on Computer Architecture featured the latest installment of the Computer Architecture Research Directions workshop. CARD is a series of minipanels, in which two experts take somewhat opposing viewpoints on important topics related to the future of computer architecture, under the direction of a moderator. As in previous years, attendees flocked to the CARD workshop to hear these debates. Two years ago, Micro featured a special issue on the 2007 CARD workshop. This issue features two articles derived from those minipanels, followed by two excellent general-interest articles.
    IEEE Micro 05/2010; 30:5. DOI:10.1109/MM.2010.52 · 1.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for ac- cess speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the har- monic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.
    13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1-3 September 2010, Lille, France; 01/2010
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents ReMAP, a reconfigurable architecture geared towards accelerating and parallelizing applications within a heterogeneous CMP. In ReMAP, threads share a common reconfigurable fabric that can be configured for individual thread computation or fine-grained communication with integrated computation. The architecture supports both fine-grained point-to-point communication for pipeline parallelization and fine-grained barrier synchronization. The combination of communication and configurable computation within ReMAP provides the unique ability to perform customized computation while data is transferred between cores, and to execute custom global functions after barrier synchronization. ReMAP demonstrates significantly higher performance and energy efficiency compared to hard-wired communication-only mechanisms, and over what can ideally be achieved by allocating the fabric area to additional or more powerful cores.
    43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, 4-8 December 2010, Atlanta, Georgia, USA; 01/2010
  • Source
    Jonathan A. Winter · David H. Albonesi · Christine A. Shoemaker
    [Show abstract] [Hide abstract]
    ABSTRACT: Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • Source
    Matthew A. Watkins · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, reconfigurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy×delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: IEEE Micro Editor in Chief David H. Albonesi welcomes six new emmbers to teh IEEE Micro Editorial Board and previews this general interest issue.
    IEEE Micro 09/2009; 29(5):2-5. DOI:10.1109/MM.2009.78 · 1.81 Impact Factor
  • David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: IEEE Micro Editor in Chief David H. Albonesi welcomes six new emmbers to teh IEEE Micro Editorial Board and previews this general interest issue.
    IEEE Micro 09/2009; 29:2-5. · 1.81 Impact Factor
  • Source
    Mark J. Cianchetti · Joseph C. Kerekes · David H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip perfor- mance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical intercon- nect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks. In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low- latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieves 2X better net- work performance than a state-of-the-art electrical baseline while consuming 80% less network power.
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
  • Source
    M.A. Watkins · M.J. Cianchetti · D.H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper investigates reconfigurable architectures suitable for chip multiprocessors (CMPs). Prior research has established that augmenting a conventional processor with reconfigurable logic can dramatically improve the performance of certain application classes, but this comes at non-trivial power and area costs. Given substantial observed time and space differences in fabric usage, we propose that pools of programmable logic should be shared among multiple cores. While a common shared pool is more compact and power efficient, fabric conflicts may lead to large performance losses relative to per-core private fabrics. We identify particular characteristics of past reconfigurable fabric designs that are particularly amenable to fabric sharing. We then propose spatially and temporally shared fabrics in a CMP. The sharing policies that we devise incur negligible performance loss compared to private fabrics, while cutting the area and peak power of the fabric by 4X.
    Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on; 10/2008
  • Source
    J.A. Winter · D.H. Albonesi
    [Show abstract] [Hide abstract]
    ABSTRACT: In future large-scale multi-core microprocessors, hard errors and process variations will create dynamic heterogeneity, causing performance and power characteristics to differ among the cores in an unanticipated manner. Under this scenario, naive assignments of applications to cores degraded by various faults and variations may result in large performance losses and power inefficiencies. We propose scheduling algorithms based on the Hungarian Algorithm and artificial intelligence (AI) search techniques that account for this future uncertainty in core characteristics. These thread assignment policies effectively match the capabilities of each degraded core with the requirements of the applications, achieving an ED<sup>2</sup> only 3.2% and 3.7% higher, respectively, than a baseline eight core chip multiprocessor with no degradation, compared to over 22% for a round robin policy.
    Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on; 07/2008

Publication Stats

4k Citations
44.23 Total Impact Points

Institutions

  • 2005–2013
    • Cornell University
      • Department of Electrical and Computer Engineering
      Ithaca, New York, United States
  • 1997–2005
    • University of Rochester
      • Department of Electrical and Computer Engineering
      Rochester, NY, United States
  • 2004
    • Rochester Institute of Technology
      • Department of Computer Engineering
      Rochester, NY, United States
  • 1995–1998
    • University of Massachusetts Amherst
      • Department of Electrical and Computer Engineering
      Amherst Center, MA, United States