Trevor N. Mudge

Trevor N. Mudge
University of Michigan | U-M · Department of Electrical Engineering and Computer Science (EECS)

About

427
Publications
101,468
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
24,080
Citations
Introduction
Skills and Expertise

Publications

Publications (427)
Article
We present Versa, an energy-efficient 36-core systolic multiprocessor with dynamically reconfigurable interconnects and memory. Versa leverages reconfigurable functional units and systolic-enhanced ARM cores to adapt for different algorithm characteristics, providing optimized bandwidth, access latency, and data reuse. Hardware support for crucial...
Article
STT-MRAM is a promising candidate for drop-in replacement for DRAM-based main memory because of its higher energy efficiency and similar latency to DRAM. However, simply replacing DRAM with STT-MRAM without optimizations severely limits STT-MRAM from exploiting its full potential. STT-MRAM employs costly sense amplifiers that demand an order of mag...
Preprint
We present Versa, an energy-efficient processor with 36 systolic ARM Cortex-M4F cores and a runtime-reconfigurable memory hierarchy. Versa exploits algorithm-specific characteristics in order to optimize bandwidth, access latency, and data reuse. Measured on a set of kernels with diverse data access, control, and synchronization characteristics, re...
Preprint
Full-text available
Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Lengt...
Chapter
With the growing size of real-world datasets running on CPUs, address translation has become a significant performance bottleneck. To translate virtual addresses into physical addresses, modern operating systems perform several levels of page table walks (PTWs) in memory. Translation look-aside buffers (TLBs) are used as caches to keep recently use...
Article
Full-text available
A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on...
Conference Paper
STT-MRAM is a promising drop-in replacement for DRAM for main memory, because it can offer higher energy efficiency than DRAM with comparable latency. Implementing STT-MRAM similar to traditional DRAM memory, that is keeping the same policies while only replacing the technology, however, severely limits the goodness that this new technology can off...
Conference Paper
A Sparse Matrix-Matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40 nm CMOS. On-chip memories are reconfigured as scratchpad or cache and interconnected with synthesizable coalescing crossbars for efficient memory access in each phase of the algorithm. The 2.0 mm x 2.6 mm ch...
Article
Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained...
Conference Paper
The performance needs of memory systems caused by growing volumes of data from emerging applications, such as machine learning and big data analytics, have continued to increase. As a result, HBM has been introduced in GPUs and throughput oriented processors. HBM is a stack of multiple DRAM devices across a number of memory channels. Although HBM p...
Conference Paper
This paper investigates the feasibility of a unified processor architecture to enable error coding flexibility and secure communication in low power Internet of Things (IoT) wireless networks. Error coding flexibility for wireless communication allows IoT applications to exploit the large tradeoff space in data rate, link distance and energy-effici...
Article
This paper investigates the feasibility of a unified processor architecture to enable error coding flexibility and secure communication in low power Internet of Things (IoT) wireless networks. Error coding flexibility for wireless communication allows IoT applications to exploit the large tradeoff space in data rate, link distance and energy-effici...
Conference Paper
The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational res...
Article
The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational res...
Article
The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational res...
Article
The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational res...
Article
Most server-grade systems provide Chipkill-Correct error protection at the expense of power and performance. In this paper we present a low overhead solution to improving the reliability of commodity DRAM systems with no change in the existing memory architecture. Specifically, we propose five erasure and error correction (E-ECC) schemes that provi...
Conference Paper
Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail...
Article
This paper presents an energy-autonomous wireless communication system for ultra-small Internet-of-Things (IoT) platforms. In the proposed system, all necessary components including the battery, energy-harvesting solar cells, and the RF antenna are fully integrated within a millimeter-scale form-factor. Designing an energy-optimized wireless commun...
Conference Paper
DRAM can enter self-refresh mode to save power during idle periods. But self-refresh mode does not modify or reduce the number of refresh operations, therefore the refresh energy stays the same. We observe that in the self-refresh mode DRAM cells are in two distinct modes, static (idle) and dynamic (refreshing), and that the switching between these...
Conference Paper
In recent years, operating at near-threshold supply voltages has been proposed to improve energy efficiency in circuits, yet decreased efficacy of dynamic voltage scaling has been observed in recent planar technologies. However, foundries have introduced a shift from planar to FinFET fabrication processes. In this paper, we study 7nm FinFET's abili...
Article
As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an o...
Article
As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of...
Conference Paper
Although graphics processing units (GPUs) are capable of high compute throughput, their memory systems need to supply the arithmetic pipelines with data at a sufficient rate to avoid stalls. For benchmarks that have divergent access patterns or cause the L1 cache to run out of resources, the link between the GPU's load/store unit and the L1 cache b...
Conference Paper
Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and e...
Article
As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, web-service companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design o...
Article
This paper proposes a novel 3D switch, called 'Hi-Rise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbi...
Patent
Interconnect circuitry 2 has a plurality of data source circuits 8 connected to respective input paths 4 and a plurality of data destination circuits 10 connected to respective output paths 6. Connection cells 12 provide selective connections between input paths 4 and output paths 6. Arbitration circuitry 26 provides adaptive priority arbitration b...
Conference Paper
In this work we explore the tradeoffs between energy and performance for several last-level cache configurations in an asymmetric multi-core system. We show that for switching threads between cores at intervals on the order of 100k or more instructions, the performance difference is negligible when private last-level caches are used in place of sha...
Conference Paper
Current mobile devices are unable to execute complex vision applications in a timely and power efficient manner without offloading some of the computation. This paper examines the tradeoffs that arise from executing some of the workload onboard and some remotely. Feature extraction and matching play an essential role in image classification and hav...
Article
Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key value stores. Despite its wide-spread use, there is little in th...
Article
Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key value stores. Despite its wide-spread use, there is little in th...
Conference Paper
Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key value stores. Despite its wide-spread use, there is little in th...
Conference Paper
The power target for exascale supercomputing is 20MW, with about 30% budgeted for the memory subsystem. Commodity DRAMs will not satisfy this requirement. Additionally, the large number of memory chips (>10M) required will result in crippling failure rates. Although specialized DRAM memories have been reorganized to reduce power through 3D-stacking...
Article
Process scaling has resulted in an exponential increase of the number of transistors available to designers. Meanwhile, global interconnect has not scaled nearly as well, because global wires scale only in one dimension instead of two, resulting in fewer, high-resistance routing tracks. This paper evaluates the use of three-dimensional (3D) integra...
Conference Paper
Due to the rapid growth of mobile communication, wireless base stations are becoming a significant consumer of computational resources. Historically, base stations have been built from ASICs, DSP processors, or FPGAs. This paper studies the feasibility of building wireless base stations from commercial graphics processing units (GPUs). GPUs are att...
Article
Supply-voltage scaling has stagnated in recent technology nodes, leading to so-called dark silicon. To increase overall chip multiprocessor (CMP) performance, it is necessary to improve the energy efficiency of individual tasks so that more tasks can be executed simultaneously within thermal limits. In this article, the authors investigate the limi...
Conference Paper
The rapid growth in the number of mobile devices and the higher data rate requirements of mobile subscribers have made wireless signal processing a key driving application of mobile computing technology. To design better mobile platforms and the supporting wireless infrastructure, it is very important for computer architects and system designers to...
Conference Paper
In this paper, we study different schemes to parallelize trellis algorithms for efficient implementation on a GPU. We consider parallelization schemes at the packet-level, subblock-level and trellis-level to increase the number of threads in a GPU implementation. At the trellis-level, we consider state-level, forward-backward traversal and branch-m...
Patent
An integrated circuit includes a plurality of processing stages each including processing logic 1014, a non-delayed signal-capture element 1016, a delayed signal-capture element 1018 and a comparator 1024. The non-delayed signal-capture element 1016 captures an output from the processing logic 1014 at a non-delayed capture time. At a later delayed...
Article
Centip3De uses the synergy between 3D integration and near-threshold computing to create a reconfigurable system that provides both energy-efficient operation and techniques to address single-thread performance bottlenecks. The original Centip3De design is a seven-layer 3D stacked design with 128 cores and 256 Mbytes of DRAM. Silicon results show a...
Patent
Full-text available
A method of generating valid vertical interconnect positions for a multiple layer integrated circuit including multiple layers stacked vertically above one another and having a bonding interface between at least one pair of layers. The interface is formed by the coupling of a pair of conductive bond patterns formed on facing surfaces of the pair of...
Conference Paper
In this paper, we explore the challenges in scaling on-chip networks towards kilo-core processors. Current low-radix topologies optimize for fast local communication, but do not scale well to kilo-core systems because of the large number of routers required. These increase both power and hop count. In contrast, symmetric high-radix topologies optim...
Article
We present Centip3De, a large-scale 3D CMP with a cluster-based near-threshold computing (NTC) architecture. Centip3De uses a 3D stacking technology in conjunction with 130 nm CMOS. Measured results for a two-layer, 64-core system are discussed, with the system achieving 3930 DMIPS/W energy efficiency, which is >; 3x improvement over traditional op...
Article
The rapid advancements in the computational capabilities of the graphics processing unit (GPU) as well as the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cos...
Conference Paper
Just-in-time compilation with dynamic code optimization is often used to help improve the performance of applications that utilize high-level languages and virtual run-time environments, such as those found in smartphones. Just-in-time compilation introduces additional overhead into the instruction fetch stage of a processor that is particularly pr...
Conference Paper
With multi-core processors now mainstream, the shift to many-core processors poses a new set of design challenges. In particular, the scalability of coherence protocols remains a significant challenge. While complex Network-on-Chip interconnect fabrics have been proposed and in some cases implemented, most of industry has slowly evolved existing co...
Conference Paper
This article consists of a collection of slides from the author's conference presentation on Centip3De, a 64-core, three dimensional (3D) stacked near-threshold system. Some of the specific topics discussed include: power management considerations with processor performance; threshold computing and design; architectural impact of near threshold com...
Article
Full-text available
Near-threshold operation has emerged as a competitive approach for energy-efficient architecture design. In particular, a combination of near-threshold circuit techniques and parallel SIMD computations achieves excellent energy efficiency for easy-to-parallelize applications. However, near-threshold operations suffer from delay variations due to in...
Article
A scalable architecture to design high radix switch fabric is presented. It uses circuit techniques to re-use existing input and output data buses and switching logic for fabric configuration and supporting multiple arbitration policies. In addition, it integrates a 4-level message-based priority arbitration for quality of service. Fine grain clock...
Article
Full-text available
Supply voltage scaling has stagnated in recent technology nodes, leading to so-called "dark silicon." In this paper, we investigate the limit of voltage scaling together with task parallelization to maintain task completion latency. When accounting for parallelization overheads, minimum task energy is obtained at "near threshold" supply-voltages ac...
Article
The adoption of non-volatile memories (NVMs) in system architecture and the growth in data-centric workloads offer exciting opportunities for new designs. In this paper, we examine the potential and limit of designs that move compute in close proximity to NVM-based data stores. To address the challenges in evaluating such system architectures for d...
Article
High-speed and low-power routers form the basic building blocks of on-die interconnect fabrics that are critical to overall throughput and energy efficiency of high performance systems [1,2]. Conventional routers use distinct logic blocks for routing data and handling arbitration [3,4]. At higher radices, connections between these blocks become a b...
Article
Full-text available
Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of hi...
Article
The National Institute of Standard and Technology (NIST) has initiated a process to develop a Federal Information Processing Standard (FIPS) for the Advanced Encryption Standard (AES), specifying an Advanced Encryption Algorithm to replace the Data Encryption Standard (DES) that expired in 1998. NIST has solicited candidate algorithms for inclusion...
Conference Paper
Full-text available
Error control coding (ECC) is essential for correcting soft errors in Flash memories. In such memories, as the number of erase/program cycles increases over time, the number of errors increases. In this paper we propose a flexible product code based ECC scheme that can support ECC of higher strength when needed. Specifically, we propose product cod...
Conference Paper
Full-text available
The rapid advancements in the computational capabilities of the graphics processing unit (GPU) as well as the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cos...
Article
A 32×32 64b self-arbitrating switch fabric called SWIFT achieves a bandwidth of 2.1Tb/s with single cycle arbitration and data transfer latency in 65nm technology while operating at 1026MHz at 1.2V. SWIFT co-optimizes arbiter and crossbar logic using a unique fabric architecture that integrates conflict resolution with data routing to optimally use...
Article
The impact of simultaneous multithreading (SMT) on small-scale designs (2-4 threads) has been successfully demonstrated in both academia and industry, but large-scale SMT (16-32 threads) is still impractical due to the adverse
Article
Full-text available
Commercial and research work in the field of software defined radio (SDR) has produced designs which have been able to deliver the efficiency and computational power needed to process 3G wireless technologies. Though efficient 3G processing has been achieved by these designs, next generation 4G SDR technology requires 10–1000x more computational pe...
Conference Paper
Full-text available
Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance opti...
Conference Paper
Full-text available
Driven by continued scaling of Moore's Law, the number of processing elements on a die are increasing dramatically. Recently there has been a surge of wide single instruction multiple data architectures designed to handle computationally intensive applications like 3D graphics, high definition video, image processing, and wireless communication. A...
Conference Paper
Full-text available
Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to confl...
Conference Paper
Full-text available
This paper presents an ultra low power programmable processor architecture for wireless devices that support 4G wireless communications and video decoding. To derive such an architecture, first we analyzed the kernel algorithms that constitute these applications. The characteristics of these algorithms helped define the wide-SIMD architecture, wher...