Douwe Kiela’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (161)


Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
  • Article

April 2025

·

1 Read

·

4 Citations

Transactions of the Association for Computational Linguistics

Karel D'Oosterlinck

·

Winnie Xu

·

Chris Develder

·

[...]

·

Shikib Mehri

Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code and datasets are available.


Figure 1: Natural Language Unit Tests: Overview of the three-step process: (1) unit test creation, (2) LMUnit-based scoring with natural language rationales, and (3) score aggregation for overall quality assessment.
Figure 3: LMUNIT Unit Test Scoring Improves Inter-Annotator Agreement on Preference Data: Instructing annotators to answer gold-standard unit tests improves inter-annotated agreement by 48% and 20% compared to pairwise judging of responses or rubric-based scoring ("Spec"), respectively.
Figure 4: LMUNIT Favored Over LM Judges for Identified Response Attributes and Error Modes: LMUNIT is favored by LM researchers and engineers surveyed, allowing users to find greater numbers of important response attributes and error modes identified in LLM responses. Out of the 16 individuals surveyed, all of them favored LMUNIT over traditional LM judges.
Rationale Ablations: Training on rationale data improves LMUNIT LLaMA3.1-70B performance with no rationales at test time, though test-time rationale generation decreases overall performance. The rationale generation can be improved by DPO post-training, with the greatest gains coming from chosen and rejected pairs teacher examples. Bolded numbers indicate the best overall performance, and the underlined numbers indicate the best performance with rationales enabled at test time.
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
  • Preprint
  • File available

December 2024

·

15 Reads

As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge -- human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks (FLASK, BigGenBench) and competitive results on RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development.

Download

Image Recall@1 results for Flickr30k and COCO. % change in parantheses; "ft." indicates finetuned.
Nearest Neighbor Normalization Improves Multimodal Retrieval

October 2024

·

9 Reads

Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.


OLMoE: Open Mixture-of-Experts Language Models

September 2024

·

7 Reads

·

3 Citations

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.


Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

August 2024

·

15 Reads

Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.


Figure 1: LLM-as-a-judge responses of GPT-4o, Claude-3-Sonnet and LYNX (70B) for a Question Answering example from HaluEval.
Lynx: An Open Source Hallucination Evaluation Model

July 2024

·

91 Reads

Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.


Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision

March 2024

·

6 Reads

·

2 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model’s downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.




Figure 4: Selected examples of LENS using Tag and Attributes Modules with OpenCLIP-H/14 as the vision encoder, Intensive Captioning Module and Flan-T5 xxl as the LLM.
Figure 5: Incorrect outputs of LENS using Tag and Attributes Modules with OpenCLIP-H/14 as the vision encoder, Intensive Captioning Module and Flan-T5 xxl as the LLM. (a) Incorrect Visual Information (b) Inconsistency (c) Presuppositions (d) Forgetting.
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

June 2023

·

327 Reads

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.


Citations (56)


... These unnatural outputs degrade user experience and, more importantly, can act as behavioral signals that reveal the occurrence of unlearning. This increases the risk of extraction attacks [2,20,7,41], where adversaries exploit the model's abnormal response patterns to identify and reverse-engineer the unlearned data; 2) Reliance on explicit forget and retain datasets. A large portion of current approaches assumes access to a cleanly partitioned dataset consisting of a forget set D f and a retain set D r . ...

Reference:

RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
  • Citing Article
  • April 2025

Transactions of the Association for Computational Linguistics

... These methods either leverage the distribution of the test set [8] or construct additional modality-specific banks from training samples [3,62]. Lately, NNN [9] normalizes the retrieval score using the k-closest queries from a reference dataset. However, these approaches rely heavily on knowing the prior data distribution of either the training or test sets, which can be problematic in real-world scenarios where the test set distribution may shift or remain unknown. ...

Nearest Neighbor Normalization Improves Multimodal Retrieval
  • Citing Conference Paper
  • January 2024

... For example, one venerable use of introspection is linguistic acceptability judgments, which can reveal people's implicit knowledge of linguistic rules (Chomsky, 1957;Gibson & Fedorenko, 2010;Sprouse, 2011;Talmy, 2018). Accordingly, there has been recent interest in whether large language models (LLMs) have abilities consistent with introspection, or metacognition more broadly (e.g., Thrush et al., 2024;Koo et al., 2024;Panickssery et al., 2024;Binder et al., 2025;Betley et al., 2025). Such abilities might be desirable for several reasons. ...

I am a Strange Dataset: Metalinguistic Tests for Language Models
  • Citing Conference Paper
  • January 2024

... The integration of the Mixture of Experts (MoE) approach, originally proposed by (Jacobs et al., 1991) and (Jordan and Jacobs, 1994), into Transformer-based models has been a key driver of recent advancements in machine learning, particularly in natural language processing (NLP) (Dai et al., 2024;Muennighoff et al., 2024;Qwen et al., 2024;2025). This innovation enables models to scale efficiently, achieving higher overall parameter counts and improved performance on downstream tasks while maintaining manageable computational requirements for training. ...

OLMoE: Open Mixture-of-Experts Language Models
  • Citing Preprint
  • September 2024

... The field of racially balancing datasets through data generation has followed the progression of GANs using both CycleGANs and FanGANS. Other image networks mitigating racial bias in related tasks [96] have used diffusion to generate images while balancing datasets through data generation for facial recognition has turned to using computer graphic pipelines [29]. One of the next steps for balancing datasets through data generation for facial recognition is to use latent diffusion models as various other fields of image generation have. ...

Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... The OpenGPT-X project initially adopted the Megatron-DeepSpeed codebase 6 , developed by NVIDIA, extended by Microsoft researchers and further adapted during BigScience research workshop [47]. Other codebases, such as Meta's Open Pretrained Transformer (OPT) [31], also emerged, promising potential advantages in abstraction and usability. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... This let a 435M-parameter safety model match the performance of 7B+ counterparts, proving targeted augmentation can compress robustness into smaller systems. These advances align with broader efforts to refine LLM robustness through data-centric strategies, such as adaptive data filtering and synthetic data generation (Dong et al., 2021;Qian et al., 2022;Zayed et al., 2022;Lim et al., 2023). ...

Perturbation Augmentation for Fairer NLP
  • Citing Conference Paper
  • January 2022

... When tracking the performance of "many models" on "many benchmarks", it is common to resort to aggregated benchmark scores. However, aggregated scores tend to masquerade important sub-trends and limit our understanding [18]. For instance, prior work [39,111,8] averages over a set of commonsense reasoning benchmarks. ...

Rethink reporting of evaluation results in AI

Science

... It is a popular choice for research and development projects, supported by a large developer community. The flexibility of customization makes it applicable to a wide variety of projects [24]. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... In this work, we take inspiration from works in natural language processing (Levesque et al., 2012;Sakaguchi et al., 2021) and image processing (Thrush et al., 2022;Yuksekgonul et al., 2022) addressing visual and textual biases in evaluation, and introduce MVP, a video QA benchmark containing minimal-change video pairs. Specifically, each video-question-answer sample in the benchmark is accompanied by a visually similar video possessing an identical question but an opposing answer ( Figure 2). ...

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
  • Citing Conference Paper
  • June 2022