Doug Burger's research while affiliated with Microsoft and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (239)
This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization...
Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the i...
Growing computational demands from deep neural networks (DNNs), coupled with diminishing returns from general-purpose architectures, have led to a proliferation of Neural Processing Units (NPUs). This paper describes the Project Brainwave NPU (BW-NPU), a parameterized microarchitecture specialized at synthesis time for convolutional and recurrent D...
To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure for AI serving in real time, accelerates deep neural network (DNN) inferencing in major services such as Bings intelligent search feature...
Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with the economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic to accelerate both network plane functions and applications.This Configurable Cloud architecture...
Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware with the economic benefits of homogeneity. The Configurable Cloud datacenter architecture introduces a layer of reconfigurable logic (FPGAs) between the network switches and servers. This enables line-rate transformation of network packets, accelerat...
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmabl...
In 2015, a team of software and hardware developers at Microsoft shipped the world?s first commercial search engine accelerated using FPGAs in the datacenter. During the sprint to production, new algorithms in the Bing ranking service were ported into FPGAs and deployed to a production bed within several weeks of conception, leading to significant...
Data compression techniques have been the subject of intense study over the past several decades due to exponential increases in the quantity of data stored and transmitted by computer systems. Compression algorithms are traditionally forced to make tradeoffs between throughput and compression quality (the ratio of original file size to compressed...
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Bas...
Power dissipation trends are leading high-performance processors to a regime in which all chip elements cannot be operated simultaneously at maximum frequency. Consequently, energy-efficiency will increase even more in importance, and performance must be achieved within strict power budgets. Current designs employ techniques such as dynamic voltage...
In 2005, as chip multiprocessors started to appear widely, it became possible for the on-chip cores to share the last-level cache. At the time, architects either considered the last-level cache to be divided into per-core private segments, or wholly shared. The shared cache utilized the capacity more efficiency but suffered from high, uniform laten...
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large...
As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution-from circuit to compiler-that enables general-purpose use of limited-precision, analog hardware to accel...
Data compression is crucial in large-scale storage servers to save both storage and network bandwidth, but it suffers from high computational cost. In this work, we present a high throughput FPGA based compressor as a PCIe accelerator to achieve CPU resource saving and high power efficiency. The proposed compressor is differentiated from previous h...
The memory industry faces significant disruption due to challenges related to scaling. Future memory systems will have more heterogeneity at individual levels of the hierarchy, with management support from multiple layers across the stack.
This paper describes a first order multicore model to project a tighter upper bound on performance than previous Amdahl's Law based approaches. The speedup over a known baseline is a function of the core performance, microarchitectural features, application parameters, chip organization, and multicore topology. The model is flexible enough to consi...
New memory technologies, such as phase-change memory (PCM), promise denser and cheaper main memory, and are expected to displace DRAM. However, many of them experience permanent failures far more quickly than DRAM. DRAM mechanisms that handle permanent failures rely on very low failure rates and, if directly applied to PCM, are extremely inefficien...
This work proposes an approximate algorithmic transformation and a new class of accelerators, called neural processing units (NPUs). NPUs leverage the approximate algorithmic transformation that converts regions of code from a Von Neumann model to a neural model. NPUs achieve an average 2.3× speedup and 3.0× energy savings for general-purpose appro...
Examples of a system, method and computer accessible medium are provided to generate a predicate prediction for a distributed multi-core architecture. Using such system, method and computer accessible medium, it is possible to intelligently encode approximate predicate path information on branch instructions. Using this statically generated informa...
form only given, as follows. Although transistor densities continue to scale exponentially, the failure of Dennard Scaling prevents us from maximally utilizing die area in future power-constrained multicore processors-a phenomenon referred to as "Dark Silicon". Alternative energy-efficient architectures based on FPGAs, GPGPUs, ASICs, MPPAs, etc. ar...
Dynamic multicore architectures, that fuse and split cores at run time, potentially offer a level of performance/energy agility that static multicore designs cannot achieve. Conventional ISAs, however, have scalability limits to fusion. EDGE-based designs offer greater scalability but to date have been performance limited by significant microarchit...
Starting in 2004, the microprocessor industry has shifted to multicore scaling---increasing the number of cores per die each generation---as its principal strategy for continuing performance growth. Many in the research community believe that this exponential core scaling will continue into the hundreds or thousands of cores per chip, auguring a pa...
Since 2004, processor designers have increased core counts to exploit Moore’s Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scalin...
This column discusses the life and achievements of Chuck Moore and his impact on the computer architecture community.
Disciplined approximate programming lets programmers declare which parts of a program can be computed approximately and consequently at a lower energy cost. The compiler proves statically that all approximate computation is properly isolated from precise computation. The hardware is then free to selectively apply approximate storage and approximate...
Disciplined approximate programming lets programmers declare which parts of a program can be computed approximately and consequently at a lower energy cost. The compiler proves statically that all approximate computation is properly isolated from precise computation. The hardware is then free to selectively apply approximate storage and approximate...
With the power constraints of the dark silicon era looming, archi-tectural changes to improve energy efficiency are critical. Disci-plined approximate computing, in which unreliable hardware com-ponents are safely used in error-tolerant software, is a promising approach. We discuss programming models, architectures, and ac-celerators for approximat...
Widespread adoption of Phase Change Memory (PCM) requires solutions to several problems recently addressed in the literature, including limited endurance, increased write latencies, and system-level changes required to exploit non-volatility. One important difference between PCM and DRAM that has received less attention is the increased need for wr...
Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scalin...
SONGO (Search ON the GO) is a web search caching system for mobile devices that allows most of the submitted queries to be served locally without having to use the 3G link. Initially, a community-based search cache is generated by mining the most popular queries and links from the mobile search logs. This cache is updated daily making sure that the...
Predication of control edges has the potential advantages of improving fetch bandwidth and reducing branch mispredictions. However, heavily pred- icated code in out-of-order processors can lose significant performance by deferring resolution of the predicates until they are executed, whereas in non- predicated code those control arcs would have rem...
Cloud services accessed through mobile devices suffer from high network access latencies and are constrained by energy budgets dictated by the devices' batteries. Radio and battery technologies will improve over time, but are still expected to be the bottlenecks in future systems. Non-volatile memories (NVM), however, may continue experiencing sign...
Cloud services accessed through mobile devices suffer from high network access latencies and are constrained by energy budgets dictated by the devices' batteries. Radio and battery technologies will improve over time, but are still expected to be the bottlenecks in future systems. Non-volatile memories (NVM), however, may continue experiencing sign...
Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis...
Power consumption, complexity, and on-chip latency are forcing computer systems to exploit more parallelism effi-ciently. Explicit Dataflow Graph Execution (EDGE) archi-tectures seek to expose parallelism by dividing programs into blocks of efficient dataflow operations, exposing inter and intra-block concurrency. This paper studies the bal-ance of...
Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficienc y can improve performance by reducing miss rate, or alternatel...
Previous research has shown that Explicit Data Graph Execution (EDGE) instruction set architectures (ISA) allow for power efficient performance scaling. In this paper we describe the preliminary design of a new dynamic multicore processor called E2 that utilizes an EDGE ISA to allow for the dynamic composition of physical cores into logical process...
As computer architectures become increasingly complex, hand-tuning compiler heuristics becomes increasingly tedious and time consuming for compiler developers. This paper presents a case study that uses a genetic algorithm to learn a compiler policy. The target policy implicitly balances communication and contention among processing elements of the...
Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as dynamic random access memory (DRAM). In contrast, phase change memory (PCM) relies on programmable resistances, as well as scalable current and thermal mechanisms. To deploy PCM as a DRAM alternative and to exploit...
As leakage and other charge storage limitations begin to impair the scalability of DRAM, non-volatile resistive memories are being developed as a potential replacement. Unfortunately, current error correction techniques are poorly suited to this emerging class of memory technologies. Unlike DRAM, PCM and other resistive memories have wear lifetimes...
As leakage and other charge storage limitations begin to impair the scalability of DRAM, non-volatile resistive memories are being developed as a potential replacement. Unfortunately, current error-correction techniques are poorly suited to this emerging class of memory technologies. Unlike DRAM, PCM and other resistive memories have wear lifetimes...
Simulation is an indispensable tool for evaluation and analysis throughout the development cycle of a computer system, and even after the computer system is built. How simulation should evolve as the complexity of computer systems continues to grow is an open question and the subject of this panel from the 2009 Workshop on Computer Architecture Res...
DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and are a promising DRAM replacement. Unfortunately, PCM is write-limited, and current approaches to man...
DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and are a promising DRAM replacement. Unfortunately, PCM is write-limited, and current approaches to man...
DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and are a promising DRAM replacement. Unfortunately, PCM is write-limited, and current approaches to man...
As DRAM and other charge memories reach scaling lim- its, resistive memories, such as phase change memory (PCM), may permit continued scaling of main memo- ries. However, while PCM may scale further than DRAM, PCM presently wears out more quickly, has higher read latency, has much higher write latency, and has much higher write power than DRAM. Thi...
As computer architectures become increasingly complex, hand-tuning compiler heuristics becomes increasingly te-dious and time consuming for compiler developers. This pa-per presents a case study that uses a genetic algorithm to learn a compiler policy. The target policy implicitly balances com-munication and contention among processing elements of...
Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained ac cess to persistent storage. In this paper, we present a file system and a hardware archi- tec...
When designing a multicore chip, two primary concerns are how large and powerful to make each core and how many cores to include.
A design with a few large cores is attractive for general purpose computing that has coarse-grained threads which depend on
instructional-level parallelism for performance. At the other end of the processor granularity s...
While researchers have invested substantial effort to buil d architectural power models, validating such models has proven difficult at best. In this paper, we examinethe accuracy of commonly used architectural power models on a custom ASIC microprocessor. Our platform i s the TRIPS system for which we have readily available high-level simulators,...
Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencie...
This paper analyzes the performance of the TRIPS prototype chip's block predictor. The prototype is the first implementation of the block-atomic TRIPS architecture, wherein the unit of execution is a TRIPS hyperblock. The TRIPS prototype predictor uses a two-step prediction process: it first predicts the exit from the current hyperblock and uses th...
As transistors shrink and processors trend toward low power, maintaining precise digital behavior grows more expensive. Replacing digital units with analog equivalents sometimes allows similar computation to be performed at higher speed using less power. As a case study in mixed-signal approximate computation, the authors describe an enhanced neura...
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine conc...
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine conc...
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine conc...
One way to exploit more ILP and improve the performance of a sin-gle threaded application is to speculatively execute multiple code re-gions on multiple cores. In such a system, communication of operands among in-flight instructions can be power intensive, especially in su-perscalar processors where all result tags are broadcast to a small number o...
The TRIPS system employs a new instruction set architecture(ISA) called Explicit Data Graph Ex- ecution (EDGE) that renegotiates the boundary between hard ware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model i n which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine c...
Shrinking transistor sizes and a trend toward low-power processors have caused increased leakage, high per-device variation and a larger number of hard and soft errors. Maintaining precise digital behavior on these devices grows more expensive with each technology generation. In some cases, replacing digital units with analog equivalents allows sim...
Distributed processors must balance communication and con- currency. When dividing instructions among the processors, key factors are the available concurrency, criticality of depen- dence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of inst...
Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, w...
Communication overheads are one of the fundamental chal- lenges in a multiprocessor system. As the number of proces- sors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions com- municat...
Demand for instruction level parallelism calls for increas- ing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distributed banks, which oers a technology scalable substrate but a challenging compilation target. This paper introduces a register allo...
Modern processors rely on memory dependence prediction to execute load instructions as early as possible, speculating that they are not dependent on an earlier, unissued store. To date, the most sophisticated dependence predictors, such as Store Sets, have been tightly coupled to the fetch and execution streams, requiring global knowledge of the in...
Modern processors rely on memory dependence prediction to execute load instructions as early as possible, speculating that they are not dependent on an earlier, unissued store. To date, the most sophisticated dependence predictors, such as Store Sets, have been tightly coupled to the fetch and execution streams, requiring global knowledge of the in...
While technology trends have ushered in the age of chip multiprocessors (CMP) and enabled designers to place an increasing number of cores on chip, a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs in which each core is identical and range from a relatively simple RISC pipeline to a large and...
As technology trends have limited the performance scaling of con- ventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed unipro- cessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging...
Modern chip multiprocessors (CMPs) are designed to exploit both instruction-level parallelism (ILP) within pro- cessors and thread-level parallelism (TLP) within and across processors. However, the number of processors and the granularity of each processor are fixed at de- sign time. This paper evaluates a flexible architectural approach, called Co...
One way to improve the performance of a single- threaded application is to run regions of the code specu- latively on a multi-core system. In a composable multi- core system, resources such as the L1 cache, register file, and instruction window are distributed among the cores. In such a system, there are many ways to map instructions to the cores....
Driven by the need for higher bandwidth and complexity reduction, off-chip interconnect has evolved from proprietary busses to networked architectures. A similar evolution is occurring in on-chip interconnect. This paper presents the design, implementation and evaluation of one such on-chip network, the TRIPS OCN. The OCN is a wormhole routed, 4x10...
In this paper, we describe the design and implementation of the primary memory system of the TRIPS processor. To match the aggressive execution bandwidth and support high levels of memory parallelism, the primary memory system is completely partitioned into four banks, can support up to 256 in-flight memory instructions, aggressive reordering of in...
The TRIPS chip prototypes two networks on chip to demonstrate the viability of a routed interconnection fabric for memory and operand traffic. In a 170-million-transistor custom ASIC chip, these NoCs provide system performance within 28 percent of ideal noncontended networks at a cost of 20 percent of the die area. our experience shows that NoCs ar...
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of...
Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This appro...
Microarchitecturally integrated on-chip networks, or micronets, are candidates to replace busses for processor component interconnect in future processor designs. For micronets, tight coupling between processor microarchitecture and network architecture is one of the keys to improving processor performance. This paper presents the design, implement...
Well-engineered compilers use a carefully selected set of optimizations, heuristic optimization policies, and a phase ordering. Designing a single optimization heuristic that works well with other optimization phases is a challenging task. Although compiler designers evaluate heuristics and phase orderings before deployment, compilers typically do...
The TRIPS chip prototypes two networks on chip to demonstrate the viability of a routed interconnection fabric for memory and operand traffic. In a 170-million-transistor custom ASIC chip, these NoCs provide system performance within 28 percent of ideal noncontended networks at a cost of 20 percent of the die area. Our experience shows that NoCs ar...
Growing on-chip wire delays, coupled with complexity and power limitations, have placed severe constraints on the issue-width scaling of centralized superscalar architectures. As a result, recent microprocessor designs have backed away from powerful uniprocessors, instead favoring multiple simpler cores on a single die. Partitioning the chip into a...
Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and sc aling to large-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating en - tries when instructions issue ("late binding"), rather tha n when they are dispatched. This...
We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees...