Noam Shazeer’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (53)


Figure 2: The Pathways system (Barham et al., 2022) scales training across two TPU v4 pods using two-way data parallelism at the pod level.
Figure 11: Examples from the PaLM-Coder 540B model. (top left) GSM8K-Python question converted from the OpenAI GSM8K math dataset. (bottom left) TransCoder example translating a simple function from C++ to Python. (right) Converted HumanEval example.
Figure 13: An example DeepFix problem with the original broken code on the left and the PaLM-Coder 540B model's prediction on the right. The predicted code contains fixes for all of the compilation errors (undeclared variables), as well as other stylistic improvements (declaring variables together) and logic improvements (reading numbers into array a in a loop and not using index i outside the loop).
Figure 14: Another example DeepFix problem. The predicted code fixes the compilation error (missing braces for the if block, causing a scope error for variable t) and makes other improvements (declaring variables together and removing the line t = 0; which has no effect).
Figure 21 presents disaggregated accuracy (Barocas et al., 2021), which are further broken down by gender. We find that accuracy is higher on stereotypical examples than on gotcha examples, and that accuracy is lowest on gotcha examples for female gender. Promisingly, we do see the performance gap across these slices improves with number of shots: from 14.1 to 10.1 percentage points in the 1-shot setting, and from 18.3 to 9.2 percentage points in the 4-shot setting. Differences in performance may be related to differences in the frequency of English pronouns in the training set (770M neutral, 620M male, and 381M female), but we see no clear relationship between accuracy and the rank of occupations identified in Appendix C.

+9

PaLM: Scaling Language Modeling with Pathways
  • Preprint
  • File available

April 2022

·

3,371 Reads

·

48 Citations

·

Sharan Narang

·

Jacob Devlin

·

[...]

·

Noah Fiedel

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Download

Figure 2: Structure of a seqio Task, highlighting customizable use of APIs.
Scaling Up Models and Data with t5x\texttt{t5x} and $\texttt{seqio}

March 2022

·

387 Reads

·

5 Citations

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: t5x\texttt{t5x} simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio\texttt{seqio} provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures. t5x\texttt{t5x} and seqio\texttt{seqio} are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively.


Designing Effective Sparse Expert Models

February 2022

·

196 Reads

·

3 Citations

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).


LaMDA: Language Models for Dialog Applications

January 2022

·

3,804 Reads

·

55 Citations

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.


Primer: Searching for Efficient Transformers for Language Modeling

September 2021

·

80 Reads

·

3 Citations

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.


GSPMD: General and Scalable Parallelization for ML Computation Graphs

May 2021

·

134 Reads

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computation graphs. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning for every operator in the graph based on limited user annotations, making it convenient to scale up existing single-device programs. It solves several technical challenges for production usage, such as static shape constraints, uneven partitioning, exchange of halo data, and nested operator partitioning. These techniques allow GSPMD to achieve 50% to 62% compute utilization on 128 to 2048 Cloud TPUv3 cores for models with up to one trillion parameters. GSPMD produces a single program for all devices, which adjusts its behavior based on a run-time partition ID, and uses collective operators for cross-device communication. This property allows the system itself to be scalable: the compilation time stays constant with increasing number of devices.


Do Transformer Modifications Transfer Across Implementations and Applications?

February 2021

·

42 Reads

·

1 Citation

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.


Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

January 2021

·

288 Reads

·

7 Citations

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.



GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

June 2020

·

316 Reads

·

7 Citations

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.


Citations (38)


... Large language models (LLMs) such as GPT [10], GLM [11], LLaMA [12], and Qwen [13] have recently demonstrated impressive capabilities in natural language understanding and in-context learning (ICL) [14], enabling them to perform complex language tasks with minimal fine-tuning and limited annotated data [15,16]. By leveraging prompt engineering, these models can generate targeted outputs from well-crafted inputs, reducing the need for extensive datasets and making them particularly valuable in low-resource scenarios [17][18][19]. Moreover, LLMs have shown the ability to perform tasks in zero-shot and few-shot settings [20,21], allowing them to generalize from minimal demonstrations (examples) [22]. ...

Reference:

Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study
PaLM: Scaling Language Modeling with Pathways

... In recent years, the advent of generative artificial intelligence (AI) has marked a transformative era in our society [1][2][3][4][5][6][7][8] . These advanced computational systems have demonstrated exceptional proficiency in interpreting and generating human language, thereby setting new benchmarks in AI's capabilities. ...

LaMDA: Language Models for Dialog Applications

... Attention-based models have shown great success on NLP tasks. In this work, we use the small BERT model 3 (Turc et al., 2019) for the following reasons: 1) while there are many transformer-based models, they show only incremental improvements compared to the original BERT model (Narang et al., 2021), 2) transformer-based models have high VRAM requirement, which makes them costprohibitive in experimental settings. The small BERT allows us to train a model with a batch size of 32 on consumer-grade GPUs within a reasonable time. ...

Do Transformer Modifications Transfer Across Implementations and Applications?
  • Citing Conference Paper
  • January 2021

... Positional encoding techniques such as No Position Encoding (NoPE), 47 Absolute Position Encoding (APE), 41 Rotary Position Encoding (RoPE), 48 T5 relative bias, 49 and ALiBi 50 have been developed to encode positional information effectively. Additionally, studies have been conducted [51][52][53][54][55] to compare activation functions such as ReLU, 56 ReLU 2 , 51 and various GLU variants like GELU, ReGLU, and Swish to determine the best option for the task. ...

Primer: Searching for Efficient Transformers for Language Modeling
  • Citing Preprint
  • September 2021

... Vision Transformers (ViT) [7] utilize a self-attention mechanism that effectively captures long-range dependencies between different parts of an image. They perform well in tasks requiring the analysis of complex structures and exhibit high accuracy on large datasets. ...

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • Citing Preprint
  • January 2021

... The factual knowledge of LLMs has been widely studied. Early work considered a model to know a fact if it correctly completed a cloze sentence (Petroni et al., 2019;Jiang et al., 2020;Kassner et al., 2020, inter alia), or directly answered a question, either in a zero-shot setting (Radford et al., 2019) or after fine-tuning (Roberts et al., 2020). Modern LLMs, capable of instruction following, are typically directly prompted to answer questions Singhal et al., 2023;Anil et al., 2023;Dubey et al., 2024;Cohen et al., 2023, inter alia). ...

How Much Knowledge Can You Pack Into the Parameters of a Language Model?
  • Citing Conference Paper
  • January 2020

... As a result, the architecture can scale more efficiently, allowing for higher model capacity without the need for proportionally increased computation for every input. By activating only a small number of experts at a time, MoE provides a computationally efficient solution for handling large and complex models [27]. Moreover, recent advancements in explainability for MoE architectures, such as analyzing expert utilization and routing behavior, have deepened our understanding of their decision-making processes and modality-specific specializations [28]. ...

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
  • Citing Preprint
  • June 2020

... The introduction of ChatGPT in November 2022 demonstrated the potential for LLMs to interact with users in a remarkably lifelike way, and this capability is already being incorporated into AIED tools (Stamper et al., 2024). Natural language processing LLMs are pre-trained and display a remarkable ability to formulate in-depth knowledge from their training sets without any access to external memory (Roberts et al., 2020). This is accomplished by fine-tuning models on large amounts of text data creating a series of vector text embeddings which serve as a parameterized knowledge base . ...

How Much Knowledge Can You Pack Into the Parameters of a Language Model?
  • Citing Preprint
  • February 2020