Paul Barham’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (54)


Figure 1 | Comparing memorization rates. We find significantly lower memorization rates across-the-board. (Left) Overall memorization across model families. (Right) Exact and approximate memorization per data source.
Evaluation of Gemma 2 Instruction Tuned models on the Chatbot Arena (Chiang et al., 2024). The models are evaluated against each other through blind side by side evaluations by human raters. Each model is attributed a score, based on the Elo rating system.
|Vulnerability detection results on PrimeVul, DiverseVul and SPI. We report accuracy.
Charm Offensive results on a sample of 100 human participants. We report the percentage of participants that find some human traits, e.g., funny, in a model.
Gemma 2: Improving Open Language Models at a Practical Size
  • Preprint
  • File available

July 2024

·

188 Reads

·

4 Citations

Gemma Team

·

Morgane Riviere

·

Shreya Pathak

·

[...]

·

Alek Andreev

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Download

PaLM 2 Technical Report

May 2023

·

579 Reads

·

25 Citations

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.


PaLM: Scaling Language Modeling with Pathways

April 2022

·

3,309 Reads

·

33 Citations

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.


Figure 10. 3B Transformer model pipelined over 128 TPUs: PATH-WAYS can efficiently train models over islands of TPUs connected via DCN achieving the same throughput (131.4k tokens/sec) on 4 islands of 32 cores each on configuration (C) as using a single island of 128 cores on configuration (B).
Training throughput (tokens/s) of Text-to-text Transformer model configurations from (Raffel et al., 2019) on JAX multi- controller and PATHWAYS.
Pathways: Asynchronous Distributed Dataflow for ML

March 2022

·

276 Reads

·

1 Citation

We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.


Machine Learning Systems are Stuck in a Rut

May 2019

·

468 Reads

·

72 Citations

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas. We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress. We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators.


Dynamic Control Flow in Large-Scale Machine Learning

April 2018

·

292 Reads

·

75 Citations

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.


Incremental, iterative data processing with timely dataflow

September 2016

·

159 Reads

·

41 Citations

Communications of the ACM

We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchronous execution. We describe two of the programming frameworks built on Naiad: GraphLINQ for parallel graph processing, and differential dataflow for nested iterative and incremental computations. We show that a generalpurpose system can achieve performance that matches, and sometimes exceeds, that of specialized systems.


Figure 4: Three parameter synchronization schemes for a single parameter in data-parallel training ( §4.4): (a) asynchronous, (b) synchronous without backup workers, and (c) synchronous with backup workers.
Figure 6: Baseline throughput for synchronous replication with a null model. Sparse accesses enable TensorFlow to handle larger models, such as embedding matrices ( §4.2).  
TensorFlow: A system for large-scale machine learning

May 2016

·

11,569 Reads

·

14,918 Citations

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.


TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

March 2016

·

9,765 Reads

·

15,007 Citations

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.



Citations (43)


... Codegen Multi (6B) [18] 0.5 2022 CodeLlama (7B) [19] 2.5 2023 LlaMa 3.1 (8B / 70B) [20] 15.0 2024 StarCoder 2 (7B) [21] 3.5 2024 Gemma 2 (2B / 27B) [22] 2.0 / 13.0 2024 CodeGemma (7B) [23] 6.5 2024 Mistral (7B) [24] -from 2024 that contained data from these older, highly popular repositories. Finally, to manage computational resources, we randomly sampled files from the collected data (k=250) for our evaluation. ...

Reference:

Are Large Language Models Memorizing Bug Benchmarks?
Gemma 2: Improving Open Language Models at a Practical Size

... Typically, LLMs (eg, LLaMA [Meta AI], GPT-4 [OpenAI], Med PALM-2, and Claude) are trained on massive amounts of data (eg, TB) integrated from multiple data sources, including those available publicly [32,33]. Large datasets used to train LLMs often lead to sizeable models with billions of parameters with a direct impact on their performance [27], while marking the transition from language models to LLMs with emergent abilities. ...

PaLM 2 Technical Report

... Nonetheless, the unique principles and architectures of IMC (especially analog IMC) make these tools generally unsuitable for IMC chips, which have a strong coupling with algorithm models. For example, in low-level deployment, making good utilization of the relatively limited IMC array space and allocating computation parameters in the IMC array efficiently are crucial problems to be solved, which, however, are not solved in SOTA tools [22]. ...

Machine Learning Systems are Stuck in a Rut
  • Citing Conference Paper
  • May 2019

... The second, more elaborate way is to assume that the dataflow graph is fixed, ie., it can not change at runtime. As it turns out, this fixed-graph approach is preferred by the established deep learning frameworks such as TensorFlow [29]. There are good reasons for choosing the fixed-graph idea. ...

Dynamic Control Flow in Large-Scale Machine Learning
  • Citing Conference Paper
  • April 2018

... Manuscript received xxx such as computation power, parallel strategy, communication operators, computing clusters, etc. This makes it difficult to achieve optimal training performance by directly using deep learning frameworks such as TensorFlow [1,106] and PyTorch [81]. Therefore, how to effectively optimize the performance of distributed training has become a key focus of Artificial Intelligence (AI) technology. ...

Tensorflow: Large-scale machine learning on heterogeneous distributed systems
  • Citing Article
  • January 2016

... TensorFlow Federated (TFF) and PySyft, for example, offer robust frameworks designed with privacy and security at the forefront, making them suitable for projects that prioritize these concerns. TFF's ability to deploy models on real-world datasets 34 and PySyft's focus on encrypted, privacypreserving computations 35 suggest their potentialities in overcoming some of the challenges we have identified. ...

TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems
  • Citing Technical Report
  • January 2015

... Many existing single-machine graph stream processing systems are optimized for the batch-parallel model where stream up-dates arrive in large batches consisting entirely of insertions or entirely of deletions. Some batch-parallel systems answer queries synchronously, meaning they stop ingesting stream updates while computing queries [60,11,21,55,69,68]. Others periodically take "snapshots" of the graph during ingestion that enable asynchronous query evaluation [18,14,35,37,51]. ...

Incremental, iterative data processing with timely dataflow
  • Citing Article
  • September 2016

Communications of the ACM

... For this, we construct a fully connected neural network with 12 layers and 40 neurons in each layer, with sine as an activation function. We write our formulation in Python 3.8, and employ the Tensorflow framework [23] with Keras [24] to take advantage of its automatic differentiation capability. We also use the extended Adam algorithm to optimize the cost function. ...

TensorFlow: A system for large-scale machine learning

... The classification of the data under analysis was based on what was taught by the training bank, and was collected following the criteria described for each category. We used the TensorFlow (Abadi et al. 2016), Matplotlib (Hunter 2007), Numpy (Harris et al. 2020), PIL (Clark 2015), and Keras (Chollet et al. 2015) packages. ...

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems