Urs Hölzle's research while affiliated with Google Inc. and other places

Publications (52)

Chapter
The applications that run on warehouse-scale computers (WSCs) dominate many system design trade-off decisions. This chapter outlines some of the distinguishing characteristics of software that runs in large internet services and the system software and tools needed for a complete computing platform. Here are some terms used to describe the differen...
Chapter
As described in Chapter 1, one of the defining characteristics of WSCs is their emphasis on cost efficiency at scale. To better understand this, let us examine the total cost of ownership (TCO) of a data center. At the top level, costs split into capital expenses (Capex) and operational expenses (Opex). Capex refers to investments that must be made...
Chapter
As mentioned earlier, the architecture of WSCs is largely defined by the hardware building blocks chosen. This process is analogous to choosing logic elements for implementing a microprocessor, or selecting the right set of chipsets and components for a server platform. In this case, the main building blocks are server hardware, networking fabric,...
Chapter
The promise of web-based, service-oriented computing will be fully realized only if users can trust that the services they increasingly rely on will be always available. This expectation translates into a high-reliability requirement for building-sized computers. Determining the appropriate level of reliability is fundamentally a tradeoff between t...
Chapter
Internet and cloud services run on a planet-scale computer with workloads distributed across multiple data center buildings around the world. These data centers are designed to house computing, storage, and networking infrastructure. The main function of the buildings is to deliver the utilities needed by equipment and personnel there: power, cooli...
Chapter
Energy efficiency has been a major technology driver in the mobile and embedded areas for a long time. Work in this area originally emphasized extending battery life, but then expanded to include reducing peak power because thermal constraints began to limit further CPU performance improvements or packaging density in small devices. However, energy...
Article
This book describes warehouse-scale computers (WSCs), the computing platforms that power cloud computing and all the great web services we use every day. It discusses how these new systems treat the datacenter itself as one massive computer designed at warehouse scale, with hardware and software working in concert to deliver good levels of internet...
Article
We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of buildingscale n...
Article
Full-text available
We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale...
Conference Paper
We present the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet. B4 has a number of unique characteristics: i) massive bandwidth requirements deployed to a modest number of sites, ii) elastic traffic demand that seeks to maximize average bandwidth, and iii) full control over the edge ser...
Conference Paper
We present the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet. B4 has a number of unique characteristics: i) massive bandwidth requirements deployed to a modest number of sites, ii) elastic traffic demand that seeks to maximize average bandwidth, and iii) full control over the edge ser...
Article
Full-text available
In this point-counterpoint discussion, Trevor Mudge argues for the combination of near-threshold voltage processors with techniques such as boosting to address the needs of datacenter workloads. Urs Hölzle offers a cautionary note on the wisdom of giving up too much single-threaded performance to achieve energy-efficiency in large Internet service...
Article
Full-text available
More than a dozen leading experts give their opinions on where the Internet is headed and where it will be in the next decade in terms of technology, policy, and applications. They cover topics ranging from the Internet of Things to climate change to the digital storage of the future. A summary of the articles is available in the Web extras section...
Book
As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portio...
Conference Paper
The performance of popular Internet Web services is gov- erned by a complex combination of server behavior, net- work characteristics and client workload - all interacting through the actions of the underlying transport control pro- tocol (TCP). Consequently, even small changes to TCP or to the network infrastructure can have significant impact on...
Article
Full-text available
Program errors are hard to find because of the cause-effect gap between the instant when an error occurs and when the error becomes apparent to the programmer. Although debugging techniques such as conditional and data breakpoints help in finding errors in simple cases, they fail to effectively bridge the cause-effect gap in many situations. This p...
Conference Paper
The proliferation of the Internet is fueling the development of mobile computing environments in which mobile code is executed on remote sites. In such environments, the end user must often wait while the mobile program is transferred from the server to the client where it executes. This downloading can create significant delays, hurting the intera...
Conference Paper
Java programs perform many synchronization operations on data structures. Some of these synchronization are unnecessary; in particular, if an object is reachable only by a single thread, concurrent access is impossible and no synchronization is needed. We describe an interprocedural, flow- and context-insensitive dataflow analysis that finds such s...
Conference Paper
Two-level predictors deliver highly accurate conditional branch prediction, indirect branch target prediction and value prediction. Accurate prediction enables speculative execution of instructions, a technique that increases instruction level parallelism. Unfortunately, the accuracy of a two-level predictor is limited by the cost of the predictor...
Conference Paper
. jContractor is a purely library based approach to support Design By Contract specifications such as preconditions, postconditions, class invariants, and recovery and exception handling in Java. jContractor uses an intuitive naming convention, and standard Java syntax to instrument Java classes and enforce Design By Contract constructs. The design...
Conference Paper
Full-text available
Program errors are hard to find because of the cause-effect gap between the time when an error occurs and the time when the error becomes apparent to the programmer. Although debugging techniques such as conditional and data breakpoints help to find error causes in simple cases, they fail to effectively bridge the cause-effect gap in many situation...
Conference Paper
We present an analysis of the memory usage for six of the Java programs in the SPECjvm98 benchmark suite. Most of the programs are real- world applications with high demands on the memory system. For each program, we measured as much low level data as possible, including age and size distribution, type distribution, and the overhead of object align...
Conference Paper
Binary component adaptation (BCA) [KH98] is a mechanism to modify existing components (such as Java classfiles) to the specific needs of a programmer. Binary component adaptation allows components to be adapted and evolved in binary form. BCA rewrites component binaries while they are loaded, requires no source code access and guarantees release-to...
Article
Full-text available
OSUIF is an extension to SUIF 2.0 that provides support for the compilation of object-oriented languages. OSUIF extends standard SUIF in three main areas: symbol table, intermediate language, and exception handling. The resulting system should be able to support compilers for many (but not all) object-oriented languages. The two initial OSUIF front...
Conference Paper
Full-text available
Object relationships in modem software systems are becoming increasingly numerous and complex. Programmers who try to find violations of such relationships need new tools that allow them to explore objects in a large system more efficiently. Many existing debuggers present only a low-level, one-object-at-a-time view of objects and their relationshi...
Conference Paper
We study the direct cost of virtual function calls in C++ programs, assuming the standard implementation using virtual function tables. We measure this overhead experimentally for a number of large benchmark programs, using a combination of executable inspection and processor simulation. Our results show that the C++ programs measured spend a media...
Article
Dynamically dispatched calls often limit the performance of object-oriented programs, since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these expensive operations. Frequent calls not only slow down execution with the dispatch overhead per se, but more importantly they hinder...
Conference Paper
Row displacement dispatch tables implement message dispatching for dynamically-typed languages with a run time overhead of one memory indirection plus an equality test. The technique is similar to virtual function table lookup, which is, however, restricted to statically typed languages like C++. We show how to reduce the space requirements of disp...
Conference Paper
Full-text available
Object-oriented systems must implement message dispatch efficiently in order not to penalize the object-oriented programming style. We characterize the performance of most previously published dispatch techniques for both statically- and dynamically-typed languages with both single and multiple inheritance. Hardware organization (in particular, bra...
Conference Paper
Two promising optimization techniques for object-oriented languages are type feedback (profile-based receiver class prediction) and concrete type inference (static analysis). We directly compare the two techniques, evaluating their effectiveness on a suite of 23 SELF programs while keeping other factors constant.Our results show that both systems i...
Article
Object-oriented programs can be optimized either dynamically, i.e., based on run-time information, or statically, i.e., based on program analysis alone. Two promising optimization techniques for object-oriented languages are type feedback (dynamic) and concrete type inference (static). We directly compare the two techniques, evaluating their effect...
Conference Paper
Full-text available
Previous studies have shown that object-oriented programs have different execution characteristics than procedural programs, and that special object-oriented hardware can improve performance. The results of these studies may no longer hold because compiler optimizations can remove a large fraction of the differences. Our measurements show that SELF...
Article
Full-text available
Programming systems should be both responsive (to support rapid development) and efficient (to complete computations quickly). Pure object-oriented languages are harder to implement efficiently since they need optimization to achieve good performance. Unfortunately, optimization conflicts with interactive responsiveness because it tends to produce...
Conference Paper
Full-text available
Abstrach Object-oriented programs are difficult to optimize because they execute many dynamically-dispatched calls. These calls cannot easily be eliminated because the compiler does not know which callee will be invoked at runtime. We have developed a simple technique that feeds back type information from the runtime system to the compiler. With th...
Article
Abstract: "Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide the details of an object's implementation from the object's clients. Unfortunately, crossing abstraction boundaries often incurs a substantial run-time overhead in the form of frequent procedure calls. Thus, pervasive use of...
Conference Paper
Object-oriented programming promises to increase programmer productivity through better reuse of existing code. However, reuse is not yet pervasive in today's object-oriented programs. Why is this so? We argue that one reason is that current programming languages and environments assume that components are perfectly coordinated. Yet in a world wher...
Conference Paper
Full-text available
SELF's debugging system provides complete source-level debugging (expected behavior) with globally optimized code. It shields the debugger from optimizations performed by the compiler by dynamically deoptimizing code on demand. Deoptimization only affects the procedure activations that are actively being debugged; all other code runs at full speed....
Conference Paper
Polymorphic inline caches (PICs) provide a new way to reduce the overhead of polymorphic message sends by extending inline caches to include more than one cached lookup result per call site. For a set of typical object-oriented SELF programs, PICs achieve a median speedup of 11%. As an important side effect, PICs collect type information by recordi...
Article
Full-text available
. All organizational functions carried out by classes can be accomplished in a simple and natural way by object inheritance in classless languages, with no need for special mechanisms. A single model---dividing types into prototypes and traits---supports sharing of behavior and extending or replacing representations. A natural extension, dynamic ob...
Article
Full-text available
. The design of inheritance and encapsulation in SELF, an object-oriented language based on prototypes, results from understanding that inheritance allows parents to be shared parts of their children. The programmer resolves ambiguities arising from multiple inheritance by prioritizing an object's parents. Unifying unordered and ordered multiple in...
Article
This paper gives a detailed description of our implementation of binary component adaptation (BCA) (KH98) for Java. We describe the adaptation specification, its translation into the delta file, and how class files are modified during class loading. We also explain how we integrated BCA into the JDK1.1.5 and how we modified javac to compile against...
Article
Dynamically-dispatched calls often limit the performance of object-oriented programs since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these expensive operations. Frequent calls not only slow down execution with the dispatch overhead per se, but more importantly they hinder o...
Article
Object-oriented components are hard to integrate if developed independently of each other, and difficult to evolve without affecting existing clients, particularly with widely distributed components that have thousands of re- users. We propose binary component adaptation (BCA), a new solution that allows components to be adapted and evolved in bina...
Article
Full-text available
We describe the use and implementation of mixins [BC90] in the Animorphic Smalltalk system, a high performance Smalltalk virtual machine and program-ming environment. Mixins are the basic unit of implementation, and are directly supported by the VM. At the language level, code can be defined either in mixins or in classes, but classes are merely su...
Article
Most modern programming languages require efficient automatic memory management (garbage collection, GC) as part of the runtime system. Since GC is very memory intensive it can potentially suffer significantly from poor memory access times. Unfortunately memory performance develops at a slower pace than processor speed, thus mak-ing memory accesses...

Citations

... But turning big data into insights or true treasure heavily relies upon and hence boosts deployments of massive big data and AI systems. As architecture, systems, data management, and machine learning communities pay greater attention to innovative big data and AI algorithms, architecture, and systems [3,4,5,6,7], the pressure of measuring, comparing, and evaluating these systems rises [8]. Benchmarking are the foundation of those efforts [9,10]. ...
... This comes at the expense of application performance becoming more vulnerable to events that result in "killer" microsecond scale idleness [9]. This is acute for user-facing applications with tight tail-latency requirements whereby serving a user query typically consists of executing numerous interacting microservices that explicitly communicate with each other [9,10,73]. The communication latency limits the time available to execute a microservice and magnifies the impact of microsecond scale idleness (e.g., events related to NVM, main memory access, and power management) [9,15,17]. ...
... Companies such as Google, Cisco, VMWare, and FlexiWAN are currently offering SD-WAN as a service. 4 ISPs have to offer MPLS and SD-WAN at the same time as an intermediate solution. Several solutions for these large-scale SD-WAN setups are currently being developed like the automatic network orchestration 5 and the use of deep reinforcement learning for the quality of those connections. ...
... In this way, even if some of the encoded packets are lost, the original packets can still be reconstructed at the receiver. • Bursting over multiple paths: Cloudburst spreads encoded packets over multiple paths, which obliviously exploits the rich path diversity in modern DCNs [27]- [29], as well as the temporary network under-utilization [30]. If any congestion-free path exists, Cloudburst will take advantage of them without extra signalling overhead. ...
... All hosts belonging to the same ToR or edge switch are grouped accordingly. The importance of these topologies is well justified, as they have been leveraged by popular cloud providers like Google [34] or Facebook [5]. Additionally, using two types of networks let us prove the flexibility of eTorii, which is suitable for any hierarchical DCN. ...
... A typical two-tier architecture has an interpreter for quick startup, and a compiler for peak performance. In this architecture, when an implementation needs to tier-up as it is evaluating a long-running function, one should not need to wait for the interpreter to complete, but rather switch immediately to a natively compiled version [17]. Or, when speculative compilation is found to be wrong, execution must not continue and the current code must be replaced with a correct version [16]. ...
... Inlining oder interprozedurale Analyse. [AH95,DCG95] Im folgenden Codebeispiel kann man gut mögliche Optimierungen erläutern: Hier kann der Compiler (im Allgemeinen) zur Compilezeit nicht feststellen, ob er zur Laufzeit Circle::getArea() oder Rectangle::getArea() aufrufen soll. ...
... Heil and Smith [21] exploit profiling hardware to log information about stores to the heap. Dieckmann and Hölzle investigated the use of active memory systems to support GC [17]. Chang et al. investigated architecture support for a bitmap-based allocator and mark-sweep garbage collector [12]. ...
... When a struct A embeds a struct B, all of B's member variables will be included in the declaration of A and all methods defined on B will also work on A. An example of the syntax is shown in listing 2.10. This allows for easy pick-and-choose of functionalities, a concept that is known in some languages as mixins [31]. In Go programs, it is common to define small and coherent interfaces that can be reused by embedding. ...
... Furthermore, SDN has been successfully deployed in wide area networks. For example, B4 [1] is a private wide area network connecting Google's data centers geographically distributed all over the world that leverages SDN principles for improving link utilization. ...