Karen Simonyan’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (75)


Flamingo: a Visual Language Model for Few-Shot Learning
  • Preprint

April 2022

·

263 Reads

·

6 Citations

Jean-Baptiste Alayrac

·

Jeff Donahue

·

·

[...]

·

Karen Simonyan

Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.


Training Compute-Optimal Large Language Models

March 2022

·

356 Reads

·

33 Citations

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over \nummodels language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4×\times more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that \chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, \chinchilla reaches a state-of-the-art average accuracy of 67.5\% on the MMLU benchmark, greater than a 7\% improvement over \gopher.


Figure 3. Example masked auto-encoding results on one ImageNet images using 85% masking rate for groupwise-masking. On the left we show the original image, in the middle the corresponding masked image and on the right the outputs of the 16-group model. Note that the masks were shared across groups (groups in HiP-16 for 224x224 images are sequences of 14 consecutive pixel rows), and this is visible as a vertically recurring pattern.
Hierarchical Perceiver
  • Preprint
  • File available

February 2022

·

96 Reads

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higher-resolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet, Audioset and PASCAL VOC datasets.

Download

Unified Scaling Laws for Routed Language Models

February 2022

·

55 Reads

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.


Scaling Language Models: Methods, Analysis & Insights from Training Gopher

December 2021

·

1,133 Reads

·

16 Citations

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.


Improving language models by retrieving from trillions of tokens

December 2021

·

530 Reads

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 22 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25×\times fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.


Skilful precipitation nowcasting using deep generative models of radar

September 2021

·

1,318 Reads

·

701 Citations

Nature

Precipitation nowcasting, the high-resolution forecasting of precipitation up to two hours ahead, supports the real-world socioeconomic needs of many sectors reliant on weather-dependent decision-making1,2. State-of-the-art operational nowcasting methods typically advect precipitation fields with radar-based wind estimates, and struggle to capture important non-linear events such as convective initiations3,4. Recently introduced deep learning methods use radar to directly predict future rain rates, free of physical constraints5,6. While they accurately predict low-intensity rainfall, their operational utility is limited because their lack of constraints produces blurry nowcasts at longer lead times, yielding poor performance on rarer medium-to-heavy rain events. Here we present a deep generative model for the probabilistic nowcasting of precipitation from radar that addresses these challenges. Using statistical, economic and cognitive measures, we show that our method provides improved forecast quality, forecast consistency and forecast value. Our model produces realistic and spatiotemporally consistent predictions over regions up to 1,536 km × 1,280 km and with lead times from 5–90 min ahead. Using a systematic evaluation by more than 50 expert meteorologists, we show that our generative model ranked first for its accuracy and usefulness in 89% of cases against two competitive methods. When verified quantitatively, these nowcasts are skillful without resorting to blurring. We show that generative nowcasting can provide probabilistic predictions that improve forecast value and support operational utility, and at resolutions and lead times where alternative methods struggle. A deep generative model using radar observations is used to create skilful precipitation predictions that are accurate and support real-world utility.


Machine Translation Decoding beyond Beam Search

April 2021

·

44 Reads

Beam search is the go-to method for decoding auto-regressive machine translation models. While it yields consistent improvements in terms of BLEU, it is only concerned with finding outputs with high model likelihood, and is thus agnostic to whatever end metric or score practitioners care about. Our aim is to establish whether beam search can be replaced by a more powerful metric-driven search technique. To this end, we explore numerous decoding algorithms, including some which rely on a value function parameterised by a neural network, and report results on a variety of metrics. Notably, we introduce a Monte-Carlo Tree Search (MCTS) based method and showcase its competitiveness. We provide a blueprint for how to use MCTS fruitfully in language applications, which opens promising future directions. We find that which algorithm is best heavily depends on the characteristics of the goal metric; we believe that our extensive experiments and analysis will inform further research in this area.


Skillful Precipitation Nowcasting using Deep Generative Models of Radar

April 2021

·

343 Reads

·

1 Citation

Precipitation nowcasting, the high-resolution forecasting of precipitation up to two hours ahead, supports the real-world socio-economic needs of many sectors reliant on weather-dependent decision-making. State-of-the-art operational nowcasting methods typically advect precipitation fields with radar-based wind estimates, and struggle to capture important non-linear events such as convective initiations. Recently introduced deep learning methods use radar to directly predict future rain rates, free of physical constraints. While they accurately predict low-intensity rainfall, their operational utility is limited because their lack of constraints produces blurry nowcasts at longer lead times, yielding poor performance on more rare medium-to-heavy rain events. To address these challenges, we present a Deep Generative Model for the probabilistic nowcasting of precipitation from radar. Our model produces realistic and spatio-temporally consistent predictions over regions up to 1536 km x 1280 km and with lead times from 5-90 min ahead. In a systematic evaluation by more than fifty expert forecasters from the Met Office, our generative model ranked first for its accuracy and usefulness in 88% of cases against two competitive methods, demonstrating its decision-making value and ability to provide physical insight to real-world experts. When verified quantitatively, these nowcasts are skillful without resorting to blurring. We show that generative nowcasting can provide probabilistic predictions that improve forecast value and support operational utility, and at resolutions and lead times where alternative methods struggle.


Variable-rate discrete representation learning

March 2021

·

47 Reads

·

2 Citations

Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.


Citations (54)


... The Flamingo model demonstrated the potential of large-scale multimodal models to perform few-shot learning across a wide range of vision and language tasks, highlighting the adaptability of pretrained representations (Alayrac et al. 2022). This adaptability is particularly relevant for domains like manufacturing, where the ability to quickly adapt to new processes or products is crucial. ...

Reference:

Unsupervised Multimodal Fusion of In-process Sensor Data for Advanced Manufacturing Process Monitoring
Flamingo: a Visual Language Model for Few-Shot Learning
  • Citing Preprint
  • April 2022

... We also include a foundation model trained on public data (BERT SQuAD -BERT for Multiple Choice) [54] as a benchmark. We use this simpler model (of an older vintage and with fewer parameters) to test whether present day LLMs, which are oversized and more intensive on their use of computational resources [55], are needed for this augmentation task or if a simpler model would do. We do not present results for GPT−4 and Claude 3 as fine-tuning options were not yet publicly available at the time of writing this paper. ...

Training Compute-Optimal Large Language Models
  • Citing Preprint
  • March 2022

... [5][6][7][8] There have been calls for thoughtful evaluation and regulation of artificial intelligence prior to implementation in the healthcare setting. 9 Limited studies have evaluated the potential of LLMs to evaluate medication regimens, identify drug-drug interactions (DDIs), and provide clinical recommendations to potentially serve as a supportive tool in clinical pharmacy. The purpose of this study was to evaluate the performance of ChatGPT (GPT-4) to identify clinically relevant DDIs and assess accuracy of recommendations provided. ...

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

... Vigorous head movements can overwhelm the subtle signals of facial muscle activities, making them difficult to detect accurately. Secondly, autoregressive architectures like transformers inherently suffer from exposure bias [25,29], a problem exacerbated in our context by the high-frequency nature of our predictions (30 outputs per second). This issue becomes more pronounced with such frequent outputs, severely impacting the reliability of long-term predictions. ...

Machine Translation Decoding beyond Beam Search
  • Citing Conference Paper
  • January 2021

... Through an adversarial process between the generator and the discriminator, GANs 50 allow the generator to produce samples that closely resemble real data. This approach has achieved remarkable success in super-resolution applications Wang et al. (2018a) and has proven effective in addressing issues in short-term forecasting tasks, such as the tendency for predictions to become increasingly smooth and lose intensity over time (Ravuri et al., 2021;Zhang et al., 2023). GANs have also demonstrated strong performance in statistical downscaling within the meteorological field (Leinonen et al., 2021;Price and Rasp, 2022;Singh et al., 2019). ...

Skilful precipitation nowcasting using deep generative models of radar

Nature

... Despite the impressive advancements in the application of BMA and EMOS for precipitation forecasting, studies have shown limited accuracy for heavy-precipitation events (Liu and Xie 2014;Scheuerer and Möller 2015). This is due to an imbalance of sample size between heavy-precipitation events and light or nonprecipitation events within the rolling training period (Ravuri et al. 2021), leading to an underestimation of potential heavy-precipitation events. To address this issue, Ji et al. (2019) proposed a conditional BMA method for precipitation probability forecasting, which split the training samples according to the predicted precipitation intensity by the raw ensemble mean and established conditional BMA models for light-, moderate-, and heavy-precipitation events. ...

Skillful Precipitation Nowcasting using Deep Generative Models of Radar
  • Citing Preprint
  • April 2021

... Tokenized units have also been exploited within speech recognition systems, both textually derived (e.g., via BPE as in [24]) and acoustically derived (e.g., the ADSM model of [25]). The successive merging of DAUs within BPE leads to a variable-rate inventory, an idea closely related to work on event-driven audio representations [26,27]. ...

Variable-rate discrete representation learning
  • Citing Preprint
  • March 2021

... By integrating transfer learning techniques and capitalizing on pretrained weights designed for image classification tasks, ViT mitigates the impact of the lack of inductive bias, yielding substantial performance improvements in action recognition tasks. The comparative analysis presented in Table 4 underscores the superior performance of ViT when contrasted with two CNN-based neural networks, SE-ResNet152 [29] and NFNet-F1 [30]. The latter networks excel primarily in image classification tasks but exhibit limitations stemming from their inductive biases, such as translation invariance and local sensitivity, which hinder their ability to comprehensively capture global image information and evaluate feature interdependencies. ...

High-Performance Large-Scale Image Recognition Without Normalization
  • Citing Preprint
  • February 2021

... Extensive research focuses on optimizing DNN models for mobile deployment, leading to the development of mobile-friendly, compact models such as MobileNet [4][5][6], which deliver respectable performance with reduced computational demand and memory usage. Model compression techniques such as network pruning [7][8][9][10][11][12][13][14][15][16] and quantization [17][18][19] have been explored to further decrease these demands while maintaining accuracy. ...

Fast Sparse ConvNets
  • Citing Conference Paper
  • June 2020