Norman P. JouppiGoogle Inc. | Google
Norman P. Jouppi
Ph.D.
About
237
Publications
124,889
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
29,858
Citations
Introduction
Skills and Expertise
Publications
Publications (237)
Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With the improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve...
We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-...
The 2008 Corona effort was inspired by a pressing need for more of everything, as demanded by the salient problems of the day. Dennard scaling was no longer in effect. A lot of computer architecture research was in the doldrums. Papers often showed incremental subsystem performance improvements, but at incommensurate cost and complexity. The many-c...
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization,...
Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this pape...
Five years ago, few would have predicted that a software company like Google would build its own computers. Nevertheless, Google has been deploying computers for machine learning (ML) training since 2017, powering key Google services. These Tensor Processing Units (TPUs) are composed of chips, systems, and software, all co-designed in-house. In thi...
Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed...
Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.
Tensor processing units improve performance per watt of neural networks in Google datacenters by roughly 50x.
The first-generation tensor processing unit (TPU) runs deep neural network (DNN) inference 15-30 times faster with 30-80 times better energy efficiency than contemporary CPUs and GPUs in similar semiconductor technologies. This domain-specific architecture (DSA) is a custom chip that has been deployed in Google datacenters since 2015, where it serv...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix mult...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix mult...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multip...
This column features retrospectives from the authors of six MICRO Test of Time award-winning papers: "MIPS: A Microprocessor Architecture" by Norman Jouppi and colleagues; "HPS, A New Microarchitecture: Rationale and Introduction" by Yale Patt, Wen-Mei Hwu, and Mike Shebanow; "Critical Issues Regarding HPS, A High Performance Microarchitecture" by...
In this paper, we describe CACTI-IO, an extension to CACTI that includes power, area, and timing models for the IO and PHY of the OFF-chip memory interface for various server and mobile configurations. CACTI-IO enables design space exploration of the OFF-chip IO along with the dynamic random access memory and cache parameters. We describe the model...
3D-stacked DRAM has the potential to provide high performance and large capacity memory for future high performance computing systems and datacenters, and the integration of a dedicated logic die opens up opportunities for architectural enhancements such as DRAM row-buffer caches. However, high performance and cost-effective row-buffer cache design...
A processing system is provided for processing signals in a processor system including first and second conjoined-cores, and sharing a single floating point unit or a single memory interconnection network port by the first and second conjoined-cores.
Embodiments herein relate to a method for remapping data. In an embodiment, it is determined if a first memory block is faulty. A pointer is stored to the first memory block and a pointer flag of the first memory block is set when the first memory block is faulty. Data previously stored at the first memory block is written to a second memory block,...
An error test routine tests for a type of memory error by changing a content of a memory module. A memory handling procedure isolates the memory error in response to a positive outcome of the error test routine. The error test routine and memory handling procedure are to be performed at runtime transparent to an operating system. Information corres...
Embodiments of the present invention are directed to optoelectronic network switches. In one embodiment, an optoelectronic switch includes a set of roughly parallel input waveguides and a set of roughly parallel output waveguides positioned roughly perpendicular to the input waveguides. Each of the output waveguides crosses the set of input wavegui...
A disclosed example apparatus includes an interface (702, 726) to receive a request to access a memory (602a) of a memory module (600) and a data store status monitor (730) to determine a status of the memory. The example apparatus also includes a message output subsystem (732) to, when the memory is busy, respond to the request with a negative ack...
Various embodiments of the present invention are directed multi-core memory modules. In one embodiment, a memory module (500) includes memory chips, and a demultiplexer register (502) electronically connected to each of the memory chips and a memory controller. The memory controller groups one or more of the memory chips into at least one virtual m...
New phase-change memory (PCM) devices have low-access latencies (like DRAM) and high capacities (i.e., low cost per bit, like Flash). In addition to being able to scale to smaller cell sizes than DRAM, a PCM cell can also store multiple bits per cell (referred to as multilevel cell, or MLC), enabling even greater capacity per bit. However, reading...
A circuit switched optical interconnection fabric includes a first hollow metal waveguide and a second hollow metal waveguide which intersects the first hollow metal waveguide to form an intersection. An optical element within the intersection is configured to selectively direct an optical signal between the first hollow metal waveguide and a secon...
Nonvolatile memories (NVMs) are promising technologies for replacing SRAM or eDRAM in low-level on-chip caches and main memories because they can save standby power and provide high cache capacity. However, limited write endurance is a common problem for NVM technologies. The current memory management policies are not write variation aware and resu...
A system and method for control of video bandwidth based on the pose of a person. In one embodiment, a plurality of video streams is obtained that are representative of images at a first location. The video streams are communicated from the first location to a second location. A pose of the head of a person is determined wherein the person is at on...
Various embodiments of the present invention are directed to methods that enable a memory controller to choose a particular operation mode for virtual memory devices of a memory module based on dynamic program behavior. In one embodiment, a method for determining an operation mode for each virtual memory device of a memory module includes selecting...
Various embodiments of the present invention are directed a multi-core memory modules. In one embodiment, a memory module (500) includes at least one virtual memory device and a demultiplexer register (502) disposed between the at least one virtual memory device and a memory controller. The demultiplexer register receives a command identifying one...
Example methods, apparatus, and articles of manufacture to perform error detection and correction are disclosed. A disclosed example method involves enabling a memory controller to operate in one of a tagged memory mode or a non-tagged memory mode. In addition, when the tagged memory mode is enabled in the memory controller, a five-error-correction...
A telescopic spatial radio system is provided for sending a signal representative of a sound at a speaker location to a listener location, the signal providing positioning information of the speaker location relative to the listener location and processing the signal using the positioning information to provide a telescopic zoomable binaural sound...
Various embodiments of the present invention relate to systems for reducing the amount of power consumed in temperature tuning resonator-based transmitters and receivers. In one aspect, a system comprises an array of resonators (801-806) disposed adjacent to a waveguide (646) and a heating element (808). The heating element is operated to thermally...
Nonvolatile memories (NVMs) have the potential to replace low-level SRAM or eDRAM on-chip caches because NVMs save standby power and provide large cache capacity. However, limited write endurance is a common problem for NVM technologies, and today's cache management might result in unbalanced cache write traffic, causing heavily written cache block...
A chat room system is provided that includes sending a signal representative of a sound at a first user location to a second user location and establishing, in a chat room, a virtual first user location and a virtual second user location. It further includes establishing the orientation of a listening system at the second user location and processi...
Persistent memory is an emerging technology which allows in-memory persistent data objects to be updated at much higher throughput than when using disks as persistent storage. Previous persistent memory designs use logging or copy-on-write mechanisms to update persistent data, which unfortunately reduces the system performance to roughly half that...
Many new memory technologies are available for building future energy-efficient memory hierarchies. It is necessary to have a framework that can quickly find the optimal memory technology at each hierarchy level. In this work, we first build a circuit-architecture joint design space exploration framework by combining RC circuit analysis and Artific...
Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh...
Metal-Oxide Resistive Random Access Memory (ReRAM) technology is gaining popularity due to its superior write bandwidth, high density, and low operating power. An ReRAM array structure can be built with three different approaches: a traditional design with a dedicated access transistor (1T1R) or an access diode (1D1R) for each cell, or an intrinsic...
The datacenter today is almost everybody’s other computer. The primary computer is increasingly a mobile device such as a smart phone, tablet, or laptop. The bulk of both the computing and the storage is done in the datacenter since the mobile devices are energy constrained. Datacenters vary in both the size and the performance of the constituent c...
A method and system for flexible prefetching of data and/or instructions for applications are described. A prefetching mechanism monitors program instructions and tag information associated with the instructions. The tag information is used to determine when a prefetch operation is desirable. The prefetching mechanism then requests data and/or inst...
Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can further improve density and reduce cost-per-bit, and therefore has recently been investigated extensive...
Energy-efficient and cost-effective memory hierarchies are needed in the next era of computing. Currently, many emerging non-volatile memory technologies such as PCRAM, STTRAM, and ReRAM, can potentially meet this requirement. It is necessary to have a framework that can quickly find the optimal memory technology choice and the corresponding circui...
With their significant performance and energy advantages, emerging manycore processors have also brought new challenges to the architecture research community. Manycore processors are highly integrated complex system-on-chips with complicated core and uncore subsystems. The core subsystems can consist of a large number of traditional and asymmetric...
This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At microarchitectural level, McPAT includes models for the fundamental components of a complete chip multiprocessor, in...
Over the next decade, significant progress must be made in research on computer architectures that enable unprecedented improvements in the efficiency of large-scale computing systems, particularly to support applications that require exascale algorithmic performance. Here, we review the performance requirements for both high-performance computing...
Modern computers require large on-chip caches, but the scalability of traditional SRAM and eDRAM caches is constrained by leakage and cell density. Emerging non-volatile memory (NVM) is a promising alternative to build large on-chip caches. However, limited write endurance is a common problem for non-volatile memory technologies. In addition, today...
A first step toward advanced optical interconnect technologies is a data-center switch using a multibus optical backplane. This network switch replaces the electronic backplane and communications fabric application-specific integrated circuit (ASIC) with an optical fabric based on multiple optical broadcast buses. The prototype demonstrated reliabl...
We describe CACTI-IO, an extension to CACTI that includes power, area and timing models for the IO and PHY of the off-chip memory interface for various server and mobile configurations. CACTI-IO enables quick design space exploration of the off-chip IO along with the DRAM and cache parameters. We describe the models added to CACTI-IO that help incl...
We describe CACTI-IO, an extension to CACTI [4] that includes power, area and timing models for the IO and PHY of the off-chip memory interface for various server and mobile configurations. CACTI-IO enables design space exploration of the off-chip IO along with the DRAM and cache parameters. We describe the models added and three case studies that...
Resiliency is one of the toughest challenges in high-performance computing, and memory accounts for a significant fraction of errors. Providing strong error tolerance in memory usually requires a wide memory channel that incurs a large access granularity (hence, a large cache line). Unfortunately, applications with limited spatial locality waste me...
Memory system reliability is a serious and growing concern in modern servers. Existing chipkill-level memory protection mechanisms suffer from several drawbacks. They activate a large number of chips on every memory access -- this increases energy consumption, and reduces performance due to the reduction in rank-level parallelism. Additionally, the...
With conventional memory technologies approaching their scaling limit, the search for a new technology has gained increased attention in the recent years. Resistive RAM (ReRAM), with its superior write latency and energy, small cell size (4F² for a single level cell, F is the feature size), and support for 3D stacking, has been a promising candidat...
Various new nonvolatile memory (NVM) technologies have emerged recently. Among all the investigated new NVM candidate technologies, spin-torque-transfer memory (STT-RAM, or MRAM), phase-change random-access memory (PCRAM), and resistive random-access memory (ReRAM) are regarded as the most promising candidates. As the ultimate goal of this NVM rese...
Memory system reliability is a serious and growing concern in modern servers. Existing chipkill-level memory protection mechanisms suffer from several draw-backs. They activate a large number of chips on every memory access - this increases energy consumption, and reduces performance due to the reduction in rank-level parallelism. Additionally, the...
Free-p—fine-grained remapping with error checking and correcting (ECC) and embedded pointers—remaps worn-out nonvolatile RAM (NVRAM) blocks at a fine granularity without requiring large dedicated storage and protects NVRAM against both hard and soft errors. Furthermore, Free-p can be implemented purely in the memory controller, avoiding custom NVRA...
Networking consumes up to 33 percent of modern data center power. Network switches are the key source of inefficiency: a switch traversal costs an order of magnitude more than a link traversal. The authors propose a new high-radix switch architecture that uses emerging integrated optical interconnect technology to reduce switch power. They tailor e...
Advanced optical interconnect technologies can enable significantly higher radix and higher bandwidth network switches. These can provide lower power, higher bandwidth, and lower latency networks for future high-performance computing systems.
VLSI process technology scaling has enabled dramatic improvements in the capacity and peak bandwidth of DRAM devices. However, current standard DDRx DIMM memory interfaces are not well tailored to achieve high energy efficiency and performance in modern chip-multiprocessor-based computer systems. Their suboptimal performance and energy inefficiency...
Emerging 3D die-stacked DRAM technology is one of the most promising solutions for future memory architectures to satisfy the ever-increasing demands on performance, power, and cost. This paper introduces CACTI-3DD, the first architecture-level integrated power, area, and timing modeling framework for 3D die-stacked off-chip DRAM main memory. CACTI...
Main memory latencies have always been a concern for system performance. Given that reads are on the criti-cal path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism be-tween reads...
A System-on-Chip (SoC) integrates multiple discrete components into a single chip, for example by placing CPU cores, network interfaces and I/O controllers on the same die. While SoCs have dominated high-end embedded products for over a decade, system-level integration is a relatively new trend in servers, and is driven by the opportunity to lower...
Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous stud...
This paper introduces CACTI-P, the first architecture-level integrated power, area, and timing modeling framework for SRAM-based structures with advanced leakage power reduction techniques. CACTI-P supports modeling of major leakage power reduction approaches including power-gating, long channel devices, and Hi-k metal gate devices. Because it acco...
Non-blocking cache; MSHR; Out-of-order Processors Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce miss-induced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non...
The latency, bandwidth, and power consumption of on-chip interconnection networks are central concerns in the design of multi-
and many-core microprocessors. When the global network-on-chip (NoC) is electrical, the power consumption and the limited
connectivity caused by difficulties associated with global wires will limit network performance due t...
For large-scale networks, high-radix switches reduce hop and switch count, which decreases latency and power. The ITRS projections for signal-pin count and per-pin bandwidth are nearly flat over the next decade, so increased radix in electronic switches will come at the cost of less per-port bandwidth. Silicon nanophotonic technology provides a lon...
It is well-known that memory latency, energy, capacity, bandwidth, and scalability will be critical bottlenecks in future large-scale systems. This paper addresses these problems, focusing on the interface between the compute cores and memory, comprising the physical interconnect and the memory access protocol. For the physical interconnect, we stu...
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent che...
It is well-known that memory latency, energy, capacity, band-width, and scalability will be critical bottlenecks in future large-scale systems. This paper addresses these problems, focusing on the interface between the compute cores and memory, comprising the physical interconnect and the memory access protocol. For the physical interconnect, we st...
A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip mem...
Emerging non-volatile memory (NVM) technologies are getting mature in recent years. These emerging NVM technologies have demonstrated great potentials for the universal memory hierarchy design. Among all the technology candidates, resistive random-access memory (RRAM) is considered to be the most promising as it operates faster than phase-change me...
Emerging non-volatile memories such as phase-change RAM (PCRAM) offer significant advantages but suffer from write endurance problems. However, prior solutions are oblivious to soft errors (recently raised as a potential issue even for PCRAM) and are incompatible with high-level fault tolerance techniques such as chipkill. To additionally address s...
The ACM International Conference Proceedings Series (ICPS) has recently being relaunched as a publication venue for research activities.
This chapter presents a landscape of cache hierarchy implementations commonly employed in research/development and identifies their key distinguishing features. These features include the following: shared vs. private, centralized vs. distributed, and uniform vs. non-uniform access. There is little consensus in the community about what constitutes...
The previous chapter focused on policies that blurred the line between shared and private caches, and that attempted to bring data closer to requesting cores. Most of the papers discussed in this chapter are oblivious/agnostic to the non-uniform nature of cache access latencies. Therefore, much of the discussion here will focus on hit rates. For th...