Aidan N. Gomez’s research while affiliated with University of Oxford and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (21)


Figure 2: Pairwise win-rates on Dolly evaluation set [Singh et al., 2024b] averaged across 23 languages: We compare Aya Expanse 8B (left) with Gemma 2 9B, Llama 3.1 8B, Ministral 8B and Qwen 2.5 7B. Aya Expanse 32B (right) is compared with Gemma 2 27B, Qwen 2.5 32B, Mixtral 8x22B and Llama 3.1 70B. We used the instruct fine-tuned (via SFT and RLHF) version of all models.
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
  • Preprint
  • File available

December 2024

·

51 Reads

John Dang

·

Shivalika Singh

·

Daniel D'souza

·

[...]

·

We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual preference training, and model merging, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models in their respective parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model with twice as many parameters, achieving a 54.0% win-rate. In this short technical report, we present extended evaluation results for the Aya Expanse model family and release their open-weights, together with a new multilingual evaluation dataset m-ArenaHard.

Download

Figure 1: Speedup on large-scale classification of web-scraped data (Clothing-1M). RHO-LOSS trains all architectures with fewer gradient steps than standard uniform data selection (i.e. shuffling), helping reduce training time. Thin lines: ResNet-50, MobileNet v2, DenseNet121, Inception v3, GoogleNet, mean across seeds. Bold lines: mean across all architectures.
Figure 6: RHO-LOSS is robust to a variety of label noise patterns, while other selection methods degrade. A step corresponds to lines 6 − 11 in Algorithm 1. Lines correspond to means and shaded areas to minima and maxima across 3 random seeds.
Figure 7: Desired properties of the irreducible loss model approximation. Left. The approximated selection function selects fewer corrupted points later on in training. Right. The test set accuracy of the irreducible loss model deteriorates over time if it is updated on D t . With the approximation, the irreducible loss is not updated during target model training. Results on CIFAR-10 with 20% of data points corrupted with uniform label noise. Shaded areas represent standard deviation across three different random seeds.
Figure 8: Varying the percent of data points selected in each training batch. Average over 3 random seeds.
Fig. 9 shows training curves for our method, uniform sampling, and the active learning baselines. Our method accelerates training across both datasets. The active learning methods accelerate training for MNIST but not for CIFAR10. This highlights that active learning methods, if naively applied to online batch selection, may not accelerate model training.
Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

June 2022

·

154 Reads

·

1 Citation

Training on web-scale data can take months. But much computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.


Figure 3. Combining autoregressive inference and retrieval inference. Predictions in Tranception are based on two complementary modes of inference: autoregressive predictions based on the context of previously generated tokens and predictions based on the empirical distribution of amino acid at each position in the retrieved set of homologous sequences.
Figure 6. Model performance on the ProteinGym substitution benchmark. We report the DMS-level performance of Tranception with retrieval, ESM-1v (Meier et al., 2021), MSA Transformer (Rao et al., 2021) and EVE (Frazer et al., 2021) for each DMS in the ProteinGym substitution benchmark. Performance is measured by the Spearman's rank correlation ρ between model scores and experimental measurements.
Performance of the different model variants in ablation studies. Performance is measured via Spearman's rank correlation ρ between model scores and experimental measurements, following the approach discussed in E.1. Retrieval inference is excluded from this analysis. Model selection is performed on the validation set described in Appendix B.1.
Average AUC between model scores and experimental measurements by taxon.
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

May 2022

·

136 Reads

·

4 Citations

The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.



Disease variant prediction with deep generative models of evolutionary data

November 2021

·

1,123 Reads

·

534 Citations

Nature

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1,2,3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4,5,6,7,8,9,10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable¹¹. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12,13,14,15,16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.


Figure 1. Percentage of selected points that are corrupted with label noise using different online batch selection criteria. Training data are corrupted with 10% uniform label noise. Mean along with minimum and maximum across 3 runs shown. Using LeNet model; low epoch setting; no data augmentation.
Figure 4. Transfer to larger model with different architecture // rank correlations. Test set accuracy for image classification tasks using reducible loss with a small proxy and transferring the selected GoldiProx sequence to a large model with different architecture that uses 29x more FLOPs (7x on QMNIST). Methods: Reducible loss (GoldiProx Selection) with and w/o proxy, and random selection. Right axis: rank correlation between reducible loss computed with small proxy model and large model. Positive correlations indicate similar selections.
Size and architecture of the proxy and large model used in QMNIST experiments.
Prioritized training on points that are learnable, worth learning, and not yet learned

July 2021

·

64 Reads

We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard" (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the "easy" (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, Goldilocks Selection chooses points that are "just right" and empirically outperforms the above approaches. Moreover, the selected sequence can transfer to other architectures; practitioners can share and reuse it without the need to recreate it.


Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

June 2021

·

19 Reads

·

1 Citation

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.


Figure 1. Existing model size-based complexity measures fail to capture the observation that lager models generalize better. Such measures based on norms or parameter-counts grow as model size is increased and the generalization error (orange) decreases. The same applies to existing measures based on the model size of compressed models (green) and a particularly strong normbased baseline (red) studied by Jiang et al. (2020). We propose a new generalization measure based on the relative model size after pruning (blue) and show that this measure correctly predicts the decrease in generalization error as model size increases. This example of ResNets trained on CIFAR-10 is merely an illustration of our claim. We provide a rigorous examination in Section 4.
Prunability is competitive with strong baseline gener- alization measures. Comparison of generalization measures' cor- relation with test performance on a set of convolutional networks trained on CIFAR-10. Higher values are better across all metrics. The standard error of the Kendall's τ is sτ = 0.037.
Prunability achieves a stronger granulated rank correlation than measures based on random perturbation robustness or norms. Columns labeled with hyperparameters (width, dropout rate, etc.) indicate Kendall τ if we only vary this hyperparameter. The last two columns are the Overall Kendall's τ and the Granulated Kendall's coefficient ψ, which is the average of the per-hyperparameter columns. Higher values are better across all columns
Robustness to Pruning Predicts Generalization in Deep Neural Networks

March 2021

·

82 Reads

Existing generalization measures that aim to capture a model's simplicity based on parameter counts or norms fail to explain generalization in overparameterized deep neural networks. In this paper, we introduce a new, theoretically motivated measure of a network's simplicity which we call prunability: the smallest \emph{fraction} of the network's parameters that can be kept while pruning without adversely affecting its training loss. We show that this measure is highly predictive of a model's generalization performance across a large set of convolutional networks trained on CIFAR-10, does not grow with network size unlike existing pruning-based measures, and exhibits high correlation with test set loss even in a particularly challenging double descent setting. Lastly, we show that the success of prunability cannot be explained by its relation to known complexity measures based on models' margin, flatness of minima and optimization speed, finding that our new measure is similar to -- but more predictive than -- existing flatness-based measures, and that its predictions exhibit low mutual information with those of other baselines.


Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

December 2020

·

181 Reads

·

3 Citations

Quantifying the pathogenicity of protein variants in human disease-related genes would have a profound impact on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences. In principle, computational methods could support the large-scale interpretation of genetic variants. However, prior methods have relied on training machine learning models on available clinical labels. Since these labels are sparse, biased, and of variable quality, the resulting models have been considered insufficiently reliable. By contrast, our approach leverages deep generative models to predict the clinical significance of protein variants without relying on labels. The natural distribution of protein sequences we observe across organisms is the result of billions of evolutionary experiments. By modeling that distribution, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (Evolutionary model of Variant Effect) not only outperforms computational approaches that rely on labelled data, but also performs on par, if not better than, high-throughput assays which are increasingly used as strong evidence for variant classification. After thorough validation on clinical labels, we predict the pathogenicity of 11 million variants across 1,081 disease genes, and assign high-confidence reclassification for 72k Variants of Unknown Significance. Our work suggests that models of evolutionary information can provide a strong source of independent evidence for variant interpretation and that the approach will be widely useful in research and clinical settings.


Interlocking Backpropagation: Improving depthwise model-parallelism

October 2020

·

22 Reads

The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism suffers from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.


Citations (11)


... Through aggregation of these embeddings, entire protein sequences can be parameterized, laying a comprehensive foundation for predicting mutations' effects on a protein's structure. Such capabilities allow for practical application of these models in various bioengineering tasks, including prediction of secondary structures, of residue contacts, and of a mutational impact on enzyme functionality (Elnaggar et al., 2022;Rives et al., 2021;Lin et al., 2023;Notin et al., 2022;Rao et al., 2021;Jumper et al., 2021), thus advancing the field beyond previously available methods. ...

Reference:

Enhancing the reverse transcriptase function in Taq polymerase via AI-driven multiparametric rational design
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

... In evaluating performance for individual genes, we only considered predictors making 10 or more predictions for a given gene; for 84/99 genes, this criterion was met by all 24 predictors (Additional file 4: Table S5). We had initially intended to evaluate EVE [14] but because it only provided predictions for 41% of the variants in our set and provided no scores at all for more than half of the 99 genes, it was not included. Schematic overview of predictor benchmarking in population-based cohorts based on human gene-trait associations. ...

Publisher Correction: Disease variant prediction with deep generative models of evolutionary data

Nature

... Most variants isolated have been gain-of-function, and associated with two painful channelopathiesfamilial episodic pain syndromes (FEPS) and small fibre neuropathy. FEPS comprises four subtypes, and is characterised by childhood onset of extreme pain episodes, normally in the arms and legs, the severity of which often attenuating with age (36,37). Conversely, no loss-of-function SCN11A variants have been identified. ...

Disease variant prediction with deep generative models of evolutionary data

Nature

... The gradient descent approach can be utilized to optimize the loss function, and regularization parameters can be employed to prevent overfitting [25]. For tabular data, the XGBoost model is the go-to tool for most practitioners and data science competitions [26]. To ...

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
  • Citing Preprint
  • June 2021

... [172][173][174][175][176][177][178][179][180] These approaches make the estimation of the impact of every possible substitution at every position in a proteincoding genome computationally feasible. 181 They also hold great potential for guiding protein design and engineering. 182,183 The success of these methods lies in their ability to capture dependencies between protein residues either by explicitly estimating inter-residue (pairwise) couplings 177,178 or by implicitly accounting for global sequence contexts. ...

Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

... Many of these approaches leverage Machine Learning and Deep Learning algorithms. From our perspective of using network features, we can identify two lines of thought in the literature: (i) prediction of engagement without considering graph-based features [42,44,14,29,34,4], and (ii) prediction of engagement with graph-based features [19,46,37,18,21,2]. ...

Predicting Twitter Engagement With Deep Language Models

... The proposed approach converts biometric data into dynamic encryption keys through neural networks, providing secure authentication without storing conventional keys. Additionally, the research explores integrating neural networks with lightweight block ciphers (Gomez et al. 2018), merging number theory principles with neural network architectures to create cutting-edge security protocols. Furthermore, by analyzing key differences' impact on S-DES encryption, the study examines the intricate relationships between neural network outputs and cryptographic key generation. ...

Unsupervised Cipher Cracking Using Discrete GANs

... While such multi-task learning architecture have been used in natural language processing ( Collobert and Weston, 2008 ), machine translation ( Johnson et al., 2017 ), speech recognition ( Seltzer and Droppo, 2013 ), computer vision problems ( Zhang et al., 2014 ), and content recommendation ( Ma et al., 2018 ), but rarely been applied in spatio-temporal forecasting problems in ride-hailing system. Previously, for spatio-temporal forecasting problems in ride-hailing system, deep learning was applied to deal only with the problem at hand, which limits the efficiency of deep learning since repeating efforts are required for each problem ( Kaiser et al., 2017 ). In this study, a spatio-temporal multi-task learning architecture with mixture-of-experts is developed in this study for forecasting multiple spatio-temporal tasks in a city as well as across cities. ...

One Model To Learn Them All
  • Citing Article
  • June 2017

... Machine Translation (MT) and Large Language Models (LLMs) represent the most sophisticated intersection of academic research and business application in the translation industry. MT systems, such as Google Translate or DeepL, are built on decades of academic research in computational linguistics (Vaswani et al., 2017), while LLMs like GPT-4 leverage vast amounts of data and advanced algorithms to produce highly accurate translations (Koehn, 2020). ...

Attention Is All You Need
  • Citing Article
  • June 2017