Sanjay Ghemawat’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (54)


Figure 2: The Pathways system (Barham et al., 2022) scales training across two TPU v4 pods using two-way data parallelism at the pod level.
Figure 11: Examples from the PaLM-Coder 540B model. (top left) GSM8K-Python question converted from the OpenAI GSM8K math dataset. (bottom left) TransCoder example translating a simple function from C++ to Python. (right) Converted HumanEval example.
Figure 13: An example DeepFix problem with the original broken code on the left and the PaLM-Coder 540B model's prediction on the right. The predicted code contains fixes for all of the compilation errors (undeclared variables), as well as other stylistic improvements (declaring variables together) and logic improvements (reading numbers into array a in a loop and not using index i outside the loop).
Figure 14: Another example DeepFix problem. The predicted code fixes the compilation error (missing braces for the if block, causing a scope error for variable t) and makes other improvements (declaring variables together and removing the line t = 0; which has no effect).
Figure 21 presents disaggregated accuracy (Barocas et al., 2021), which are further broken down by gender. We find that accuracy is higher on stereotypical examples than on gotcha examples, and that accuracy is lowest on gotcha examples for female gender. Promisingly, we do see the performance gap across these slices improves with number of shots: from 14.1 to 10.1 percentage points in the 1-shot setting, and from 18.3 to 9.2 percentage points in the 4-shot setting. Differences in performance may be related to differences in the frequency of English pronouns in the training set (770M neutral, 620M male, and 381M female), but we see no clear relationship between accuracy and the rank of occupations identified in Appendix C.

+9

PaLM: Scaling Language Modeling with Pathways
  • Preprint
  • File available

April 2022

·

3,354 Reads

·

42 Citations

·

Sharan Narang

·

Jacob Devlin

·

[...]

·

Noah Fiedel

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Download

Figure 10. 3B Transformer model pipelined over 128 TPUs: PATH-WAYS can efficiently train models over islands of TPUs connected via DCN achieving the same throughput (131.4k tokens/sec) on 4 islands of 32 cores each on configuration (C) as using a single island of 128 cores on configuration (B).
Training throughput (tokens/s) of Text-to-text Transformer model configurations from (Raffel et al., 2019) on JAX multi- controller and PATHWAYS.
Pathways: Asynchronous Distributed Dataflow for ML

March 2022

·

289 Reads

·

1 Citation

We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.


Dynamic Control Flow in Large-Scale Machine Learning

May 2018

·

2 Reads

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.


Dynamic Control Flow in Large-Scale Machine Learning

April 2018

·

295 Reads

·

77 Citations

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.


Figure 4: Three parameter synchronization schemes for a single parameter in data-parallel training ( §4.4): (a) asynchronous, (b) synchronous without backup workers, and (c) synchronous with backup workers.
Figure 6: Baseline throughput for synchronous replication with a null model. Sparse accesses enable TensorFlow to handle larger models, such as embedding matrices ( §4.2).  
TensorFlow: A system for large-scale machine learning

May 2016

·

12,007 Reads

·

15,205 Citations

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.


TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

March 2016

·

10,165 Reads

·

15,282 Citations

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.


Associating summaries with pointers in persistent data structures

April 2015

·

8 Reads

Methods for organizing and retrieving data values in a persistent data structure are provided. Data values are grouped into data blocks and pointers are obtained for each data block. In addition, one or more summaries, related to a properties of the data block, are created and associated with the data block's pointer. The summaries allow for a more efficient retrieval of data values from the data structure by preventing unnecessary retrieval calls to persistent storage when the summaries do not match query criteria.


Permitting users to remove documents

March 2015

·

6 Reads

A system may present information regarding a document and provide an option for removing the document. The system may also receive selection of the option and remove the document when the option is selected. The system may aggregate information regarding documents that have been removed by a group of users and assign scores to a set of documents based on the aggregated information.


TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems

January 2015

·

3,995 Reads

·

8,903 Citations

TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.


Providing posts from an extended network

October 2014

·

7 Reads

A system includes: an engaging post identifier for identifying and retrieving engaging posts; an extended network post identifier for identifying extended posts from an extended network; a combining module for creating a combined list of added posts from the engaging post and the extended posts, the combining module generating one or more ranked posts by ranking the list of added posts by relevance to a user; and a user interface module for providing the one or more ranked posts. The disclosure also includes a method for finding and providing engaging posts that includes determining engaging posts; determining extended posts from an extended social network using a social graph of the user; adding the engaging posts and the extended posts to create a combined list of added posts; ranking the added posts by relevance to a user; and providing one or more of the ranked posts.


Citations (29)


... For the retrieval, we apply the BGE-M3 embedding model [15] to our database and the questions; based on these embeddings the semantic search is carried out. For the second part, answer generation, we use three state-of-the-art LLMs: Llama 2, in its fine-tuned 13B chat version [16], GPT, in its 3.5 version [17] and PaLM [18]. ...

Reference:

Exploring the Behavior and Performance of Large Language Models: Can LLMs Infer Answers to Questions Involving Restricted Information?
PaLM: Scaling Language Modeling with Pathways

... Dataflow graphs, however, require statically known tensor shapes and thus cannot directly represent dynamic loop cycles, dynamic dependencies and dynamic tensor shapes. To implement DRL algorithms as a single static graph, users must (i) use complex higher-order control-flow operations [28] together with (ii) over-approximations of loop-carried state into static tensor shapes [29] and (iii) apply dynamic tensor slicing to extract required inputs at each timestep from the state. This is not only cumbersome for users, but it often results in excessive peak memory usage due to the over-approximation of loop-carried state. ...

Dynamic Control Flow in Large-Scale Machine Learning
  • Citing Conference Paper
  • April 2018

... CloudSimEx is a set of tools featuring disk operations, the simulation of MapReduce clusters [14], web sessions, and probabilistic cloudlet arrivals, among others. CloudSimDisk [29] from Luleå University of Technology is an alternative to the simple disk features of CloudSimEx that focuses on modeling the energy consumption of storage systems within a Cloud infrastructure. ...

MapReduce: Simplified data processing on large clusters
  • Citing Article
  • January 2004

... For this, we construct a fully connected neural network with 12 layers and 40 neurons in each layer, with sine as an activation function. We write our formulation in Python 3.8, and employ the Tensorflow framework [23] with Keras [24] to take advantage of its automatic differentiation capability. We also use the extended Adam algorithm to optimize the cost function. ...

TensorFlow: A system for large-scale machine learning

... Each user sends its results to the server, which aggregates them using a weighted average method in order to update the parameters of the global model and send them back to the users. For this implementation the authors employed the TensorFlow framework, Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mané, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viégas, Vinyals, Warden, Wattenberg, Wicke, Yu and Zheng (2015), along with the scikit-learn library, while the fusion technique that was utilized was FedAvg. At first, the Bi-LSTM was compared with a Convolutional Neural Network (CNN) and the results indicated that the former produced higher accuracy and lower loss. ...

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

... The Google Earth Engine relies on multiple technologies within the Google data center, such as Flume Java framework, Bigtable, Spanner distributed databases [15], [16], Borg cluster management system, Colossus, and Google Fusion Tables [17], [18]. GEE provides accessibility via its API and web-based interface, enabling swift prototyping and visualization of outcomes. ...

Spanner

ACM Transactions on Computer Systems

... Sanjay Ghemawat presents a new storage management architecture that substantially improves disk performance of a distributed object-oriented database system [11]. The storage architecture is built around a large modified object buffer (MOB) that is stored in primary memory. ...

The Modified Object Buffer: A Storage Management Technique for Object-Oriented Databases
  • Citing Article
  • October 1995

... The impact of profiling in non-deterministic systems: Most of the prediction models are forecast based on historical knowledge for short-term requests (Mallick et al., 2012). The historical knowledge token from the monitoring service, which logs information continuously as a profile in a searchable database (Anderson et al., 1997) and the whole process showed in Fig. 3. The proper analysis tool dissects the stored profile information at several levels. ...

Continuous profiling
  • Citing Article
  • December 1997

ACM SIGOPS Operating Systems Review

... Distributed protocols are the foundation of many modern computer systems. They find use in domains such as finance [8,7] and cloud computing [9,10]. Distributed protocols are notoriously difficult to design and verify. ...

Spanner: Google’s Globally Distributed Database
  • Citing Article
  • August 2013

ACM Transactions on Computer Systems