David H. Albonesi

Cornell University, Ithaca, New York, United States

Are you David H. Albonesi?

Claim your profile

Publications (112)36.69 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
    Year: 01/2012
  • [show abstract] [hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload.In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading.We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.
    Microprocessors and Microsystems - Embedded Hardware Design. 01/2011; 35:683-694.
  • M.A. Watkins, D.H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: ReMAP is a reconfigurable architecture for accelerating and parallelizing applications within a heterogeneous chip multiprocessor (CMP). Clusters of cores share a common reconfigurable fabric adaptable for individual thread computation or fine-grained communication with integrated computation. ReMAP demonstrates significantly higher performance and energy efficiency than hard-wired communication-only mechanisms, and over allocating the fabric area to additional or more powerful cores.
    IEEE Micro 01/2011; 31(1):65-77. · 2.39 Impact Factor
  • Mark J. Cianchetti, David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 16nm timeframe, on-chip optical interconnect architectures are typically limited in scalability or are dependent on comparatively slow electrical control networks. In this article, we present a hybrid electrical/optical router for future large scale, cache coherent multicore microprocessors. The heart of the router is a low-latency optical crossbar that uses predecoded source routing and switch state preconfiguration to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. Overall, our optical router achieves 2X better network performance than a state-of-the-art electrical baseline in a mesh topology while consuming 30% less network power.
    JETC. 01/2011; 7:9.
  • Source
    Matthew A. Watkins, David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents ReMAP, a reconfigurable architecture geared towards accelerating and parallelizing applications within a heterogeneous CMP. In ReMAP, threads share a common reconfigurable fabric that can be configured for individual thread computation or fine-grained communication with integrated computation. The architecture supports both fine-grained point-to-point communication for pipeline parallelization and fine-grained barrier synchronization. The combination of communication and configurable computation within ReMAP provides the unique ability to perform customized computation while data is transferred between cores, and to execute custom global functions after barrier synchronization. ReMAP demonstrates significantly higher performance and energy efficiency compared to hard-wired communication-only mechanisms, and over what can ideally be achieved by allocating the fabric area to additional or more powerful cores.
    43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, 4-8 December 2010, Atlanta, Georgia, USA; 01/2010
  • Source
    Matthew A. Watkins, David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, reconfigurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy×delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Resizable caches can trade-off capacity for ac- cess speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the har- monic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.
    13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1-3 September 2010, Lille, France; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Future many-core microprocessors are likely to be heterogeneous, by design or due to variability and defects. The latter type of heterogeneity is especially challenging due to its unpredictability. To minimize the performance and power impact of these hardware imperfections, the runtime thread scheduler and global power manager must be nimble enough to handle such random heterogeneity. With hundreds of cores expected on a single die in the future, these algorithms must provide high power-performance efficiency, yet remain scalable with low runtime overhead. This paper presents a range of scheduling and power management algorithms and performs a detailed evaluation of their effectiveness and scalability on heterogeneous many-core architectures with up to 256 cores. We also conduct a limit study on the potential benefits of coordinating scheduling and power management and demonstrate that coordination yields little benefit. We highlight the scalability limitations of previously proposed thread scheduling algorithms that were designed for small-scale chip multiprocessors and propose a Hierarchical Hungarian Scheduling Algorithm that dramatically reduces the scheduling overhead without loss of accuracy. Finally, we show that the high computational requirements of prior global power management algorithms based on linear programming make them infeasible for many-core chips, and that an algorithm that we call Steepest Drop achieves orders of magnitude lower execution time without sacrificing power-performance efficiency.
    19th International Conference on Parallel Architecture and Compilation Techniques (PACT 2010), Vienna, Austria, September 11-15, 2010; 01/2010
  • [show abstract] [hide abstract]
    ABSTRACT: In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
    01/2009;
  • 01/2009;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip perfor- mance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical intercon- nect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks. In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low- latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieves 2X better net- work performance than a state-of-the-art electrical baseline while consuming 80% less network power.
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
  • Source
    M.A. Watkins, M.J. Cianchetti, D.H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: This paper investigates reconfigurable architectures suitable for chip multiprocessors (CMPs). Prior research has established that augmenting a conventional processor with reconfigurable logic can dramatically improve the performance of certain application classes, but this comes at non-trivial power and area costs. Given substantial observed time and space differences in fabric usage, we propose that pools of programmable logic should be shared among multiple cores. While a common shared pool is more compact and power efficient, fabric conflicts may lead to large performance losses relative to per-core private fabrics. We identify particular characteristics of past reconfigurable fabric designs that are particularly amenable to fabric sharing. We then propose spatially and temporally shared fabrics in a CMP. The sharing policies that we devise incur negligible performance loss compared to private fabrics, while cutting the area and peak power of the fabric by 4X.
    Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on; 10/2008
  • Source
    J.A. Winter, D.H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: In future large-scale multi-core microprocessors, hard errors and process variations will create dynamic heterogeneity, causing performance and power characteristics to differ among the cores in an unanticipated manner. Under this scenario, naive assignments of applications to cores degraded by various faults and variations may result in large performance losses and power inefficiencies. We propose scheduling algorithms based on the Hungarian Algorithm and artificial intelligence (AI) search techniques that account for this future uncertainty in core characteristics. These thread assignment policies effectively match the capabilities of each degraded core with the requirements of the applications, achieving an ED<sup>2</sup> only 3.2% and 3.7% higher, respectively, than a baseline eight core chip multiprocessor with no degradation, compared to over 22% for a round robin policy.
    Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on; 07/2008
  • Source
    Jonathan A. Winter, David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: We explore DTM techniques within the context of uniform and nonuniform SMT workloads. While DVS is suitable for addressing workloads with uniformly high temperatures, for nonuniform work- loads, performance loss occurs because of the slowdown of the cooler thread. To address this, we propose and evaluate DTM mechanisms that exploit the steering-based thread management mech- anisms inherent in a clustered SMT architecture. We show that in contrast to DVS, which operates globally, our techniques are more effective at controlling temperature for nonuniform workloads. Furthermore, we devise a DTM technique that combines steering and DVS to achieve consistently good performance across all workloads.
    TACO. 01/2008; 5.
  • David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: A productive, healthy debate is informative and insightful, and collectively moves the debaters and the audience closer to a potential solution or agreement by examining the issues from multiple viewpoints and critiquing those arguments. Micro editor in chief David Albonesi introduces four debates on computer architecture from the Computer Architecture Research Directions (CARD) workshop that was held as part of the 13th Annual IEEE/ACM International Symposium on High-Performance Computer Architecture in February 2007.
    IEEE Micro 12/2007; 27(6):6-6. · 2.39 Impact Factor
  • David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: Micro's editor in chief introduces the topics covered by the four articles in this general-interest issue: an interconnection network using highly integrated photonic technology; the ManySim simulation framework for future large-scale chip-multiprocessors; the SimWattch simulation framework, which integrates the Simics functional simulator with the SimpleScalar/Wattch microarchitecture simulators; and self-configuring embedded systems.
    IEEE Micro 08/2007; 27(4):3-4. · 2.39 Impact Factor
  • David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: While leading computing corporations have instituted "green data center" and "eco-responsible computing" initiatives, the computer architecture community as a whole has drifted away from power-aware architecture and on to the next topic. Arguably, power remains the computer architecture topic with the most potential for societal impact. Albonesi exhorts Micro readers to re-emphasize power-related research and outlines a few of the most pressing issues.
    IEEE Micro 06/2007; · 2.39 Impact Factor
  • David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: Despite the move away from very high-frequency, high-ILP cores to multiple, more modest cores ("multicore"), power is still a huge, unsolved problem for the microprocessor industry. The emphasis is no longer power-aware processor microarchitecture but power-aware systems architecture. The "system" extends from the multicore system-on-chip to the external memory, disks, indeed to the entire enterprise. The data center has arisen as a major target of power-related computer architecture research. The greater question is, in our attempts to make the world's information available to all in the blink of an eye, what is the environmental cost, and how can we as a research community address this problem?
    IEEE Micro 04/2007; 27(2):4-5. · 2.39 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: This work investigates the integration of CMOS-compatible optical technology to on-chip coherent buses for future CMPs. The analysis results in a hierarchical optoelectrical bus that exploits the advantages of optical technology while abiding by projected limitations. This bus achieves significant performance improvement for high-bandwidth applications relative to a state-of-the-art fully electrical bus
    IEEE Micro 02/2007; · 2.39 Impact Factor
  • David H. Albonesi
    [show abstract] [hide abstract]
    ABSTRACT: The new Editor in Chief of IEEE Micro introduces himself and the first issue of 2007. He thanks outgoing Editor in Chief Pradip Bose for his outstanding work on Micro during his tenure. He assesses the current state of the microarchitecture field, speculates on the future, and asks readers for their suggestions on topics the magazine should cover in coming issues.
    IEEE Micro 02/2007; · 2.39 Impact Factor

Publication Stats

3k Citations
36.69 Total Impact Points

Institutions

  • 2005–2011
    • Cornell University
      • Department of Electrical and Computer Engineering
      Ithaca, New York, United States
  • 1997–2005
    • University of Rochester
      • • Department of Electrical and Computer Engineering
      • • Department of Computer Science
      Rochester, NY, United States
  • 2004
    • Rochester Institute of Technology
      • Department of Computer Engineering
      Rochester, NY, United States
  • 1995–1998
    • University of Massachusetts Amherst
      • Department of Electrical and Computer Engineering
      Amherst Center, MA, United States