Noam Shazeer's research while affiliated with Google Inc. and other places

Publications (53)

Preprint
Full-text available
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we traine...
Preprint
Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and en...
Preprint
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by traini...
Preprint
Full-text available
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual groun...
Preprint
Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is perf...
Preprint
We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computation graphs. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partition...
Preprint
The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer...
Preprint
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of Mo...
Preprint
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming,...
Preprint
We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better pe...
Preprint
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706....
Preprint
It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show t...
Preprint
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attentio...
Preprint
Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) i...
Preprint
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of tr...
Preprint
Medical images such as 3D computerized tomography (CT) scans and pathology images, have hundreds of millions or billions of voxels/pixels. It is infeasible to train CNN models directly on such high resolution images, because neural activations of a single image do not fit in the memory of a single GPU/TPU. Existing image analysis approaches allevia...
Preprint
Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. Th...
Preprint
Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at t...
Preprint
Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high lat...
Preprint
We describe an approach to Grammatical Error Correction (GEC) that is effective at making use of models trained on large amounts of weakly supervised bitext. We train the Transformer sequence-to-sequence model on 4B tokens of Wikipedia revisions and employ an iterative decoding strategy that is tailored to the loosely-supervised nature of the Wikip...
Preprint
Music relies heavily on self-reference to build structure and meaning. We explore the Transformer architecture (Vaswani et al., 2017) as a generative model for music, as self-attention has shown compelling results on tasks that require long-term structure such as Wikipedia summary generation (Liu et al, 2018). However, timing information is critica...
Article
In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network wei...
Article
Full-text available
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
Article
Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer ar...
Article
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation o...
Article
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to...
Article
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In pa...
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing...
Article
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significa...
Article
Full-text available
We present Sparse Non-negative Matrix (SNM) estimation, a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features. We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus. Results show that SNM language models trained with...
Article
Full-text available
We present NN-grams, a novel, hybrid language model integrating n-grams and neural networks (NN) for speech recognition. The model takes as input both word histories as well as n-gram counts. Thus, it combines the memorization capacity and scalability of an n-gram model with the generalization ability of neural networks. We report experiments where...
Article
In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such...
Article
We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorization of the point-wise mutual information matrix via stochastic gradient descent. It uses a piecewise loss with special handling for unobserved co-occurrenc...
Article
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient...
Article
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists in maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, th...
Article
Full-text available
We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation. A first set of experiments empirically evaluating it on the One Billion Word Benchmark shows that SNM $n$-gram LMs perform almost as well as the well-established Kneser-Ney (KN) models. When using skip-gram features the models a...
Patent
Full-text available
A system for generating a model is provided. The system generates, or selects, candidate conditions and generates, or otherwise obtains, statistics regarding the candidate conditions. The system also forms rules based, at least in part, on the statistics and the candidate conditions and selectively adds the rules to the model.
Patent
Full-text available
One embodiment of the present invention provides a system characterizes a document with respect to clusters of conceptually related words. Upon receiving a document containing a set of words, the system selects “candidate clusters” of conceptually related words that are related to the set of words. These candidate clusters are selected using a mode...
Patent
Full-text available
Systems and techniques relating to ranking search results of a search query include, in general, subject matter that can be embodied in a computer-implemented method that includes determining a measure of relevance for a document result within a context of a search query for which the document result is returned, the determining being based on a fi...
Patent
A computer-implemented method for determining whether a target text-string is correctly spelled is provided. The target text-string is compared to a corpus to determine a set of contexts which each include an occurrence of the target text-string. Using heuristics, each context of the set is characterized based on occurrences in the corpus of the ta...
Patent
Full-text available
A system may track statistics for a number of features using an approximate counting technique by: subjecting each feature to multiple, different hash functions to generate multiple, different hash values, where each of the hash values may identify a particular location in a memory, and storing statistics for each feature at the particular location...
Patent
Full-text available
One embodiment of the present invention provides a system that learns a generative model for textual documents. During operation, the system receives a current model, which contains terminal nodes representing random variables for words and cluster nodes representing clusters of conceptually related words. Within the current model, nodes are couple...
Article
Full-text available
We introduce a framework for representing a variety of interesting problems as inference over the execution of probabilistic model programs. We represent a "solution" to such a problem as a guide program which runs alongside the model program and influences the model program's random choices, leading the model program to sample from a different dis...

Citations

... We further explore the commonsense lacking issue in the current fundamental VL data by comparing them with common natural language processing (NLP) data. Here we compare the distributions of the syntactic categories and words of the most popular VL datasets (COCO [37] and CC 12M [9]) with three commonly used NLP datasets: Con-ceptNet [66] the knowledge base dataset, Wikipedia [19] the popular [14,29,39,56,81] cleaned English-language articles with the size of 16GB, C4 [53] the popular used [18,26,30,46,57,68,69] English-language text sourced from the Common Crawl web scrape with the size of 745GB. The syntactic categories and word distributions comparison is shown in Fig. 8. ...
... Large language models (LLMs) allow us to test the possibility that GEK can emerge naturally from tracking co-occurrence patterns in linguistic input. State-of-the-art LLMs, trained to predict words based on their context, have achieved remarkable success across a variety of tasks, such as generating syntactically and semantically coherent paragraphs of text (Brown et al., 2020), sentiment analysis and logical inference (e.g., Devlin et al., 2018;Liu et al., 2019;Radford et al., 2019;Yang et al., 2019), closed-book QA (Roberts et al., 2020), and certain aspects of commonsense reasoning (Talmor et al., 2020;Zellers et al., 2018). ...
... Ultimately, we found that we could boost detection recall with inference-chaining techniques. The first technique, called iterative inference, was recently introduced in previous work and involves the use of the output of the model as the input iteratively until the model makes no modifications, at which point the process terminates 48 . The second technique, elicitive inference, is a technique we developed to take advantage of the power of beam search to induce multiple model output candidates. ...
... However, the self-attention mechanism of Transformer makes it good at capturing global information, but weak at capturing the local information (such as structural information) from SMILES sequences [22,23]. In summary, RNN and Transformer tend to focus on different types of features in the process of feature extraction [24], and both of them have their tendencies in molecular property prediction. However, there has been little work integrating both models for molecular property prediction. ...
... To this end, we build an ensemble of sub-problem specialized models (branches), split the input space into multiple partitions, enforce a specialization of the models for each partition, and train a gating mechanism to decide which branch should be used for each input sample. Similar to [5] we take the top-K predictions of the gater to combine the branches projections to obtain the final classification. The conditional execution is not new in the literature, but our original contributions include (1) an extra step to enforce individual model specialization and (2) a dynamical exclusion of some of the top-K predictions, based on how confident the gater is (i.e., the confidence is high). ...
... For an optimizer, we use BertAdam 4 , which is a variant of Adam. For ULM-Large, we train for 10 epochs using AdamW (Loshchilov and Hutter, 2017) for KLUE-TC and Adafactor (Shazeer and Stern, 2018) for KLUE-STS and NSMC. We empirically find the best hyperparameter setting for each task in the following choices: ...
... The pre-trained networks analyze the inputs and generates results. Over the last decade, there has been extensive work on processing natural language and an increasing uptick in usage of these models in both industries [23,53,68] and academia [77,78,82]. However, NLP models require more training data [6] and extensive computation [73], compared to other deep learning (DL)-based models. ...
... (Section 3). • We propose a code resampling technique to prevent the issue of index collapse when learning high-dimensional codes with vector quantized autoencoders (Kaiser et al., 2018), which we employ in the skill discovery process (Section 3.2); • We show that Choreographer can learn skills both from offline data or collecting data with exploration, and from both states and pixels inputs. The skills are adaptable for multiple tasks, as shown in the URL benchmark, where we outperform all baselines (Section 4.1); • We show the skills learned by Choreographer are effective for exploration, discovering sparse rewards in the environment more likely than other methods (Section 4.2), and we further visualize and analyze them to provide additional insights (Section 4.3). ...
... For obtaining topical keys, we first find all hyperlinked articles A h appearing in the section. We then use the value of the "instance of" or "subclass of" tuple in the Wikidata (Liu et al., 2018). The documents are citations in the Wikipedia article obtained by Common-Crawl or web pages returned by Google Search. ...
... Pre-trained language models have been shown to learn skills that can transfer to new modalities [35], however, this will not be effective for task-specific skills such as a desired captioning style or learning the space of output labels. Several multi-modal, multi-task models have learned many tasks in different modalities simultaneously [22,31,34,62] and could thus potentially transfer skills between them, with HighMMT in particular showing positive results [31]. Our work studies the more challenging zero-shot setting (meaning no training data in the target modality is available), and therefore requires all the needed skills to be learned from a modality different than the one used in evaluation. ...