Sebastian Ruder’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (171)


Figure 2: A sentence shared across Syrian, Jordanian, and Palestinian varieties may be labeled as Jordanian but predicted as Syrian, resulting in a false NADI error.
Figure 3: Llama models and Command series base models are best at maintaining the user's DA variety, as measured by ADI2 score (bars) and macro-score (marks).
Figure 4: ADI2 (correct-variety dialectness scores) distributions across LLMs and genres in the crosslingual task (which requests specific DA varieties of the LLM in English). ADI2=0 indicates the wrong Arabic variety.
Figure 5: DA→Eng MT surpasses Eng→DA. DA↔MSA scores are low in the BTEC genre and rarely above the dotted-line zero-translate SpBLEU baseline for FLORES. Bars represent SpBLEU, while marks are chrF. Scores are between 0 and 1. (i.e. 0.5 corresponds to 50 SpBLEU points.) Note dza is the country code for Algeria.
Figure 6: Human eval results shows that post-trained LLMs produce responses that are fluent, adequate, and adherent, but mostly not in the right DA variety. Command-R+ base improves dialectal fidelity but scores poorly on other metrics. Post-trained LLM fluency and fidelity scores were averaged across MT and monolingual tasks.

+6

AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
  • Preprint
  • File available

December 2024

·

29 Reads

Nathaniel R. Robinson

·

Shahd Abdelmoneim

·

·

Sebastian Ruder

Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits language modeling applications, yet the research community lacks operationalized LLM performance measurements in DA. We present a method that comprehensively evaluates LLM fidelity, understanding, quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA varieties across these four dimensions and provide best practice recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, but does not suggest deterioration in quality when they do. Further analysis suggests that current post-training can degrade DA capabilities, that few-shot examples can overcome this and other LLM deficiencies, and that otherwise no measurable features of input text correlate well with LLM DA performance.

Download

Figure 1: Performance gap between RewardBench (English) and the average M-REWARDBENCH scores across 23 languages for various reward models (Pearson r: 0.92, Spearman ρ: 0.89). All models underperform on our multilingual benchmark compared to their performance on the corresponding English benchmark.
Dataset statistics for M-REWARDBENCH.
M-RewardBench: Evaluating Reward Models in Multilingual Settings

October 2024

·

19 Reads

Srishti Gureja

·

Lester James V. Miranda

·

·

[...]

·

Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.


BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

August 2024

·

12 Reads

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.


How Does Quantization Affect Multilingual LLMs?

July 2024

·

41 Reads

Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge methods, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, and automatic metrics severely underestimate the detriment: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks such as mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.


Figure 4: Comparison of active inheritance methods (single-source and multi-source sampling) targeting various metrics, where the goals are to increase length and lexical diversity and decrease toxicity. Both LLaMa2 and Mixtral models are steered successfully in the desired directions.
Figure 5: Comparison of active inheritance methods (single-source and multi-source sampling) targeting various metrics. Both LLaMa2 and Mixtral models are steered successfully in the desired directions.
StereoSet Stereotype Scores across different minorities.
BBQ Ambiguous Bias Score ∆ between base teacher model and student-teacher finetuned models.
Expected Maxiumum Toxicity (EMT) and Toxicity probability calculated using the PerspectiveAPI.
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

July 2024

·

53 Reads

The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models' internal biases, calibration and generations' textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear "neutral". which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity.


Figure 1: Language Confusion can occur at the word level, line level, or over the entire output response.
Figure A2: Template used for few-shot prompting the base models. The model's answers are truncated to prevent the generation of new questions. For the instruct variants, we use similar prompting, except that the Q/A examples are formatted as User/Chatbot turns using the model's chat template.
Line-level pass rate (LPR) on monolingual and cross-lingual generation, by language.
Understanding and Mitigating Language Confusion in LLMs

June 2024

·

62 Reads

We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.


SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

June 2024

·

85 Reads

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.


Aya 23: Open Weight Releases to Further Multilingual Progress

May 2024

·

72 Reads

·

1 Citation

This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (\"Ust\"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.


Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

February 2024

·

39 Reads

·

6 Citations

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.



Citations (54)


... Machine translated datasets are commonly used in mllm training (Dang et al., 2024a) andevaluation (Lai et al., 2023), with the intention to reduce data scarcity across languages (Muennighoff et al., 2023;Holmström & Doostmohammadi, 2023;Üstün et al., 2024). However, synthetic, modelgenerated data is prone to systematic biases (Ahn et al., 2022;Lukasik et al., 2022;Shimabucoro et al., 2024). In particular, machine-translated prompts may contain translation artifacts affecting evaluation outcomes Agrawal et al., 2024a). ...

Reference:

D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
LLM See, LLM Do: Leveraging Active Inheritance to Target Non-Differentiable Objectives
  • Citing Conference Paper
  • January 2024

... CLIP shows unique advantages in such tasks, and its pre-trained rich features help improve the generalization ability of the model [23]. However, although * Corresponding author CLIP-based FSL has great potential [51,70], its logits often show serious inter-class confusion, resulting in a decrease in classification accuracy, which limits the performance improvement of applying CLIP to FSL [35]. ...

Understanding and Mitigating Language Confusion in LLMs
  • Citing Conference Paper
  • January 2024

... For example, high sampling temperatures can result in an unexpected and less cohesive text, whereas deterministic decoding methods can result in repetitiveness and lack of diversity (Wiher et al., 2022). In addition, altering the model weights-for example, by overquantisation-can contribute to noisy generations (Xiao et al., 2023;Marchisio et al., 2024). Finally, noise can result from a mismatch between a model and the task specification or complexity. ...

How Does Quantization Affect Multilingual LLMs?
  • Citing Conference Paper
  • January 2024

... Model merging has demonstrated the effectiveness of zero-shot learning across several applications. Some examples of practical applications include cross-lingual transfer [25,63,86,211], hybrid style image generation [12,118], and multi-modal processing [16]. Some works achieve cross-lingual transfer through model merging, such as chat [63], text summarization [25], or reasoning [211]. ...

Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization
  • Citing Conference Paper
  • January 2024

... These models aim to provide cross-lingual capabilities by being trained on diverse language datasets. Despite their promise, they often struggle with low-resource languages due to insufficient training data and the inherent difficulty of balancing multiple languages within one model , Dac Lai et al., 2023, Singh et al., 2024a, Lovenia et al., 2024. Efforts in this area have included data augmentation, transfer learning, and specialized models for specific languages or tasks [Conneau et al., 2018, Artetxe et al., 2018, Conneau and Lample, 2019, Team et al., 2022. ...

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

... Our work entailed an extensive, open science process to manually collect data by working directly with native speakers of different languages (Elliott et al., 2016;Thapliyal et al., 2022;Li et al., 2024c;Üstün et al., 2024;Singh et al., 2024b). This is acutely needed in the field of machine learning, where recent studies have highlighted that dataset creators remain predominantly Western-centric (Longpre et al., 2025). ...

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

... Cross-Language and Cross-Task Parallelism Many existing multi-task benchmarks consist of separate datasets covering different languages (Asai et al., 2024), often leading to underrepresentation of LRLs. Besides, the tasks in these benchmarks are typically simple natural language understanding tasks (Hu et al., 2020). ...

BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
  • Citing Conference Paper
  • January 2024

... Existing multilingual datasets can be categorized into human generated, human-AI generated, and machine translated datasets (Section 2 and Table 1). Many human generated datasets consist of conversations on a wide variety of topics, making them ideal for IFT, but can be extremely resource and time intensive to create, with the possibility of annotator errors and uneven distributions [41]. Human-AI generated datasets leverage humans voluntarily sharing conversations with existing LLMs, making the generation process less resource intensive, but come with challenges such as possible privacy issues, toxic data, and low complexity conversations [52]. ...

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
  • Citing Article
  • February 2024

... We utilise a generative LLM G, which has been fine-tuned on natural language instructions Chung et al., 2022), and prompt it to paraphrase the original hypothesis h i , with the following prompt: Rephrase the following sentence while preserving its original meaning: <h i >. This is not sufficient to produce semantics-preserving variations as generative models are prone to hallucinations (Ji et al., 2023) and not assured to produce an equivalent paraphrase. To ensure that the generation h ′ i is logically equivalent to the original sample and thus semantics-preserving, we impose the condition that the NLI model should infer the relation between the original and generated hypothesis as a symmetric entailment: ...

QAmeleon : Multilingual QA with Only 5 Examples

Transactions of the Association for Computational Linguistics

... Also, due to the low-resource nature of Yorùbá language, NLP research works in its domain have benefitted from transfer learning by pre-training on a high-resource language and fine-tuning on specific tasks in this language. Transfer learning of this nature over the years has been mainly composed of leveraging multilingual pre-training [71,118,119,120,63,77] and cross-lingual transfer [121,18,17,111]. Figure 10 shows word clouds of both the employed techniques 10b and the NLP tasks 10a carried out in the primary studies involved in the review. ...

TaTA: A Multilingual Table-to-Text Dataset for African Languages
  • Citing Conference Paper
  • January 2023