John L. Hennessy

John L. Hennessy
Stanford University | SU · Department of Computer Science

About

295
Publications
191,834
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
32,684
Citations
Additional affiliations
September 1987 - October 2016
Stanford University
Position
  • Professor
September 1977 - present
Stanford University
Position
  • Professor
January 1974 - August 1977
Stony Brook University
Position
  • PhD Student

Publications

Publications (295)
Research
Full-text available
Computer Organization and Design: The Hardware/Software Interface" by David A. Patterson and John L. Hennessy is a renowned textbook that explores the fundamental principles of computer architecture and organization. It provides a comprehensive and in-depth examination of the interplay between hardware and software in computer systems. The book cov...
Article
This paper first reviews the Spectre and Meltdown processor security vulnerabilities that were revealed during January-October 2018 and that allow the extraction of protected information from billions of processors in systems large and small. It then discusses short-term mitigation actions and speculates on the longer-term implications to computer...
Article
Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.
Article
Full-text available
This column features retrospectives from the authors of six MICRO Test of Time award-winning papers: "MIPS: A Microprocessor Architecture" by Norman Jouppi and colleagues; "HPS, A New Microarchitecture: Rationale and Introduction" by Yale Patt, Wen-Mei Hwu, and Mike Shebanow; "Critical Issues Regarding HPS, A High Performance Microarchitecture" by...
Article
Single chip processor performance has improved dramatically since the inception of the four-bit microprocessor in 1971. This is due in part to technological advances, (i.e., faster devices and greater device density), but also because of the adoption of architectural approaches well suited to the opportunities and limitations of VLSI. The most appr...
Article
Full-text available
Abstract—While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of paral...
Article
Full-text available
In modern processors, the performance of the memory hierarchy is crucial in determining the overall performance of a CPU. Among the most important factors in deciding the performance of a CPU is the cache performance, and particularly, the cache miss rate. Determining and improving the hit rate of a cache is one of the most important tasks undertak...
Article
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. More importantly, architects often assume without proof that although their simulator may make...
Article
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. More importantly, architects often assume without proof that although their simulator may make...
Conference Paper
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. More importantly, architects often assume without proof that although their simulator may make...
Article
Full-text available
Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be di#cult. More importantly, architects often assume without proof that although their simulator may make in...
Article
Shared-memory multiprocessors that use the latest microprocessors are becoming widely used both as compute servers and as desktop computers. But the difficulty in developing parallel software is a major obstacle to the effective use of the multiprocessors to solve a single task. To increase the productivity of multiprocessor programmers, we develop...
Article
Multimedia applications are becoming ubiquitous. Unlike conventional interactive and batch applications, these applications often have real-time requirements. As multimedia applications are integrated with conventional non-real-time applications in the generalpurpose computing environment, the problem arises of how to support the resulting mix of a...
Conference Paper
Full-text available
Generating an accurate estimate of the performance of a program on a given system is important to a large number of people. Computer architects, compiler writers, and developers all need insight into a machine's performance. There are a number of performance estimation techniques in use, from profile-based approaches to full machine simulation. Thi...
Article
Generating an accurate estimate of the performance of a program on a given system is important to a large number of people. Computer architects, compiler writers, and developers all need insight into a machine's performance. There are a number of performance estimation techniques in use, from profile-based approaches to full machine simulation. Thi...
Article
Full-text available
Object-oriented programming languages promise to improve programmer productivity by supporting abstract data types, inheritance, and message passing directly within the language. Unfortunately, traditional implementations of object-oriented language features, particularly message passing, have been much slower than traditional implementations of th...
Article
Full-text available
Submitted to the Department of Electrical Engineering. Copyright by the author. Thesis (Ph. D.)--Stanford University, 1999.
Article
Full-text available
After 20 years in academia and the Silicon Valley, the new Provost of Stanford University calls for a shift in focus for systems research. Performance-long the centerpiece-needs to share the spotlight with availability, maintainability, and other qualities. Although performance increases over the past 15 years have been truly amazing, it will be ha...
Article
Full-text available
Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherenc...
Article
this paper, a few aspects of the design are important to understanding the motivation for the design strategy that was employed. The R4000 is a general-purpose processor and its design goals included a variety of functional capabilities as well as performance goals. Among the key functional capabilities required for the R4000 are: the implementatio...
Article
Full-text available
Hardware/software codesign is a methodology for solving design problems in systems with processors or embedded controllers where the design requirements mandate a functionality and performance level for the system, independent of the hardware and software boundary. In addition to the challenges of functional correctness and total system performance...
Article
Full-text available
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both the hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and a custom node co...
Article
Full-text available
Scalable cache coherence protocols have become the key technology for creating moderate to large-scale shared-memory multiprocessors. Although the performance of such multiprocessors depends critically on the performance of the cache coherence protocol, little comparative performance data is available. Existing commercial implementations use a vari...
Article
Full-text available
This research focused on the design and development of scalable shared-memory machines, in particular, those using directory-based cache coherence. This research led to the design and fabrication of the machine - the Stanford DASH machine.
Conference Paper
Full-text available
The problem of cache coherence in shared-memory multiprocessors is addressed using two basic approaches: directory schemes and snoopy cache systems. Directory schemes for cache coherence are potentially attractive in large multiprocessor systems that are beyond the scaling limits of the snoopy cache schemes. Slight modifications to directory scheme...
Conference Paper
Full-text available
Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can redu...
Article
efits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th International Symposium on Computer Architecture, pages 273--280, June 1989. [98] M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implement...
Article
Full-text available
This thesis provides design and analysis of techniques for global load balancing on ensemble architectures running soft-real-time object-oriented applications with statistically periodic loads. It focuses on estimating the instantaneous average load over all the processing elements. The major contribution is the use of explicit stochastic process m...
Article
Full-text available
The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes. Directory schemes have been given less attention in the past several years, while snoopy cache methods have become extremely popular. Directory schemes for cache coherence are potentially attract...
Article
Full-text available
States that the move to large-scale parellism has been the greatest gain in computing performance since 1989 by the National Science Foundation (NSF). Information on the use of shared memory Parallel Vector Processor (PVP) in 1985; Discussion on the introduction of the Massively Parallel Processing (MPP) in 1989; Information on the percentage of gr...
Article
Full-text available
The construction of a cache-coherent distributed shared memory (DSM) machine involves many organizational and implementation trade-offs. This paper studies the performance implications of these trade-offs as made on some real DSM machines. We focus on characteristics related to communication and examine their impact on delivered application perform...
Article
Full-text available
The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, allowing very limited buffering, to release consiste...
Article
This paper describes an optimizing compiler system that solves the key problem of aggregate copy elimination. The methods developed rely exclusively on compile-time algorithms, including interprocedural analysis, that are applied to an intermediate data flow representation. By dividing the problem into update-in-place and build-in-place analysis, a...
Article
Full-text available
Hardware/software co-design is a methodology for solving design problems in systems with processors or embedded controllers where the design requirements mandate a functionality and performance level for the system, independent of the hardware and software boundary. In addition to the challenges of functional correctness and total system performanc...
Conference Paper
Full-text available
Studies done with academic CC-NUMA machines and simulators indicate a good potential for application performance. Our goal therefore, is to investigate whether the CONVEX Exemplar a commercial distributed shared memory machine, lives up to the expected potential of CC-NUMA machines. If not, we would like to understand what architectural or implemen...
Article
Thesis (Ph. D.)--Stanford University, 1997. Submitted to the Department of Electrical Engineering. Copyright by the author.
Conference Paper
Full-text available
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the...
Article
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the...
Article
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the...
Article
A universal lower-bound technique for the size and other implementation characteristics of an arbitrary Boolean function as a decision tree and as a two-level AND/OR circuit is derived. The technique is based on the power spectrum coefficients of the n-dimensional Fourier transform of the function. The bounds vary from constant to exponential and a...
Conference Paper
Full-text available
Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for...
Article
Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for...
Article
Full-text available
Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building th...
Article
Today, many VLSI designs are processors at the core. Microprocessors are one obvious example; however, other examples abound. Many special-purpose, embedded controllers consist of a microprocessor, at least at the core. Digital Signal Processors (DSPs) are special-purpose processors. Special-purpose engines for functions such as graphics and video...
Article
The University of Illinois has traditionally been a major center for computer architecture education and research in the nation. This short paper briefly describes the computer architecture curriculum at the University of Illinois and discusses a few ...
Article
Contenido: Fundamentos del diseño de computadoras; Principios y ejemplos del conjunto de instrucciones; Paralelismo del nivel de instrucción y su explotación dinámica; Explotación del paralelismo del conjunto de instrucciones con acercamientos de software; Diseño de la jerarquía de la memoria; Multiprocesadores y paralelismo en el nivel de los hilo...
Article
Full-text available
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing the consistency model is accompanied by a more com...
Article
Full-text available
Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such...
Article
Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to...
Article
Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically ch...
Article
Full-text available
Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsytem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise the efficiency of the communication architecture,...
Article
Full-text available
In order to design effective large-scale multiprocessors, designers must understand the characteristics of the applications that will use the machines. One important class of applications is based on hierarchical N-body methods. In this article, the key architectural implications of representative applications that employ the two dominant hierarchi...
Article
Full-text available
: The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design trade-offs, prototypes are essential to ensure that no critical details are overlooked. A prototype...
Article
Full-text available
The paper, Programming for Different Memory Consistency Models [GAG + 92], defines the PLpc memory model. This companion note formalizes the system requirements for PLpc along with a proof that shows these requirements are sufficient for supporting this model. In addition, we prove the correctness of the conditions presented in the original paper [...
Conference Paper
Article
This chapter discusses memory hierarchy. Programs exhibit both temporal locality, that is, the tendency to re-use recently accessed data items, and spatial locality, that is, the tendency to reference data items that are close to other recently accessed items. Memory hierarchies take advantage of temporal locality by keeping more recently accessed...
Chapter
This chapter focuses on the basic ideas and definitions, the major components of software and hardware, and integrated circuits, the technology that fuels the computer revolution. Both hardware and software designers construct computer systems in hierarchical layers, with each lower layer hiding details from the level above. This principle of abstr...
Chapter
This chapter presents the construction of the datapath and control unit for two different implementations of the millions of instructions per second (MIPS) instruction set. It reviews the core of the MIPS instruction set, including the memory-reference instructions load word (1 w) and store word (sw), the arithmetic-logical instructions add, sub, a...
Chapter
This chapter explores the instruction set of a real computer, both in the form written by humans and in the form read by the machine. Starting from a notation that looks like a restricted programming language, it is refined step-by-step until one sees the real language of a real computer. The chapter presents an instruction set that follows the adv...
Chapter
This chapter provides an overview of interfacing processors and peripherals. Many of the characteristics of input/output (I/O) systems are driven by technology in processors, for example, the properties of disk drives affect how the disks should be connected to the processor and how the operating system interacts with the disks. I/O systems, howeve...
Chapter
Computer words are composed of bits and thus, words can be represented as binary numbers. Although the natural numbers 0,1, 2, and so on can be represented either in decimal or binary form, the question arises regarding the way in which the other numbers that commonly occur are represented. This chapter discusses the representation of numbers, arit...
Chapter
Pipelining is a technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that unlike some speedup techniques, it can be invisible to the programmer. This chapter reviews the concept of pipelining using the millions of instructions per second (MIPS) instruction subset and a simpl...
Chapter
This chapter provides an overview of parallel processors. It discusses single instruction stream, multiple data streams (SIMD) computers, multiple instruction streams, multiple data streams (MIMD) computers, programming MIMDs, and MIMDs connected by a single bus and a network. The virtues of SIMD are that all the parallel execution units are synchr...
Chapter
This chapter focuses on performance and its evaluation. All computer designers must balance performance and cost. There exists a domain of high-performance design, in which performance is the primary goal and cost is secondary. Much of the supercomputer industry designs in this fashion. At the other extreme is low-cost design, where cost takes prec...
Conference Paper
Full-text available
Several multiprocessors have been proposed that offer programmable implementations of scalable cache coherence as well as support for message passing. In the FLASH machine, flexibility is obtained by the use of a programmable node controller, called MAGIC, through which all transactions in a node pass. We use the actual code sequences that implemen...
Conference Paper
Full-text available
Abstract Integrating support for block data transfer has become an im - portant emphasis in recent cache - coherent shared address space multiprocessors This paper examines the potential perfor - mance benefits of adding this support A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations...
Article
Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to...
Article
Full-text available
Modern compilers generate good code by performing global optimizations. Unlike other functions of the compiler such as parsing and code generation which examine only one statement or one basic block at a time, optimizers examine large parts of a program and coordinate changes in widely separated parts of a program. Thus optimizers use more complex...
Article
Full-text available
Effectively using shared-memory multiprocessors requires substantial programming effort. We present the programming language COOL (Concurrent Object-Oriented Language), which was designed to exploit coarse-grained parallelism at the task level in shared-memory multiprocessors. COOL's primary design goals are efficiency and expressiveness. By effici...