Wen-tau Yih’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (193)


SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
  • Preprint

February 2025

Yung-Sung Chuang

·

Benjamin Cohen-Wang

·

Shannon Zejiang Shen

·

[...]

·

Wen-tau Yih

We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.


Figure 3 VeriScore F1 over 50 prompts from LongFact when varying number of memory units used for storing retrieved passages and fact-checking feedback. Each memory unit stores 128 tokens.
Improving Factuality with Explicit Working Memory
  • Preprint
  • File available

December 2024

Large language models can generate factually inaccurate content, a problem known as hallucination. Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce EWE (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources. The memory is refreshed based on online fact-checking and retrieval feedback, allowing EWE to rectify false claims during the generation process and ensure more accurate and reliable outputs. Our experiments demonstrate that Ewe outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 10 points absolute without sacrificing the helpfulness of the responses. Further analysis reveals that the design of rules for memory updates, configurations of memory units, and the quality of the retrieval datastore are crucial factors for influencing model performance.

Download

Memory Layers at Scale

December 2024

·

5 Reads

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.


Figure 7: Generator training data distribution. We mix diverse training data to train our 8B LM.
Figure 16: Examples of ScholarBench (Bio).
Figure 17: An example of SCHOLARQA-MULTI.
Figure 20: An example of SCHOLARQA-MULTI.
Statistics of hallucinated papers in computer science and biomedicine domains. Our analysis revealed a significant number of non-existent cited papers in predictions made by LLMs without retrieval, which is a problem not observed in OPENSCHOLAR.
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

November 2024

·

19 Reads

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.


Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

November 2024

·

16 Reads

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).


Table 15 .
Evaluation of captioners on a separate test set created from the WIT dataset. We evaluate the CLIP image-text alignment score, captioning metrics which measure alignment of the model captions with ground-truth human annotated captions: BLEU / METEOR / ROUGE / CIDEr and noun phrase (NP) F1, precision, and recall. Altogether (2/3) indicates our captioner fine-tuned on round 2/3 annotation; 'w/o alt' means captioning from scratch with no alt-text (similar to other baselines), 'w/ random alt' means captioning with randomly paired alt-texts and 'w/ alt' means captioning via re-aligning alt-texts.
Throughput of different text decoders measured on NVIDIA A100 80GB GPUs.
Hyperparameters of captioner training.
Hyperparameters of text-to-image generation training.
Altogether: Image Captioning via Re-aligning Alt-text

October 2024

·

38 Reads

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.



CRAG -- Comprehensive RAG Benchmark

June 2024

·

102 Reads

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.


Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

May 2024

·

15 Reads

Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric language modeling approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.



Citations (56)


... Xu et al. [66] investigated the potential of alt-text as a valuable source of image descriptions. Their work, altogether, introduced a method for re-aligning alt-text with images, thereby leveraging this often underutilized information source. ...

Reference:

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Altogether: Image Captioning via Re-aligning Alt-text
  • Citing Conference Paper
  • January 2024

... Adaptation is essential for language models (LMs) to acquire new world knowledge (Jiang et al., 2024;Hu et al., 2023;Mecklenburg et al., 2024), learn new tasks (Min et al., 2022), and personalize to individual users (Salemi et al., 2024). Existing adaptation methods typically involve either prompting or fine-tuning (Brown et al., 2020). ...

Instruction-tuned Language Models are Better Knowledge Learners
  • Citing Conference Paper
  • January 2024

... 2 RELATED WORK 2.1 TEST-TIME ADAPTION Test-time adaptation (TTA) (Zhang et al., 2022;Shu et al., 2022;Ma et al., 2023;Karmanov et al., 2024;Zhao et al., 2024a;Chi et al., 2024;Ma et al., 2024) enables models to adapt changing distributions during testing time without accessing to the source domain data or extensive target domain data. Within the spectrum of TTA settings, e.g., "fully" TTA (Wang et al., 2021;Zhao et al., 2023), "online" TTA (Lee & Chang, 2024;, "continuous" TTA , and "prior" TTA (Wei et al., 2023;, "online" TTA (Shu et al., 2022;Karmanov et al., 2024;Zhao et al., 2024a) focuses on adapting to individual samples and is particularly valuable in many application domains, such as autonomous driving, where weather conditions are constantly changing, and road monitoring, where traffic patterns are continually evolving. ...

MoDE: CLIP Data Experts via Clustering
  • Citing Conference Paper
  • June 2024

... Existing approaches address this misalignment through three main strategies: (1) fine-tuning retrievers to align with LLM preferences, (2) optimizing LLMs to adapt to retriever behavior, and (3) introducing intermediate modules to bridge the gap between them [19,28,2,34,37,38]. Despite progress, these methods face notable challenges: fine-tuning retrievers often requires carefully curated data and may not be feasible for commercial search engines [25,22], while optimizing LLMs is resource-intensive and risks compromising their original capabilities [44]. ...

REPLUG: Retrieval-Augmented Black-Box Language Models
  • Citing Conference Paper
  • January 2024

... While this ability can greatly accelerate information gathering, LLMs often produce hallucinations-content that sounds plausible but is actually fabricated (Ji et al., 2023). Even when provided with accurate context, models may misinterpret the data or include details that are not supported by the context (Shi et al., 2024;Chuang et al., 2024). ...

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
  • Citing Conference Paper
  • January 2024

... BoolQ (Clark et al., 2019) contains questions that can be answered with barely "yes" or "no", but does not involve complex Boolean logic in its queries. Although Malaviya et al. (2023) and Zhong et al. (2023) construct questions with Boolean logic, their atomic questions are entity queries rather than natural language questions. This work is the first to construct a benchmark dataset for Boolean dense retrieval. ...

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering
  • Citing Conference Paper
  • January 2023

... Following works leveraged pretrained transformer models like BERT [16] using single dense vector representations [32]. Recent improvements have focused on training techniques including self-negative mining [52,53,73,77], data augmentation [38,53], distillation [26,37,38], corpus-pretraining [19,30,40,71], negative-batch construction [28] and curriculum learning [39,76]. Alternative approaches include ColBERT [33], which uses multiple dense vectors, and SPLADE [18], which revisits sparse representations using pretrained masked language models. ...

How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval
  • Citing Conference Paper
  • January 2023

... With the development of Large Language Models (LLMs), their general capabilities have become increasingly powerful (Achiam and Adler, 2023; Dubey et al., 2024). However, even the most advanced LLMs still face challenges with factual errors (Min et al., 2023;Huang and Chen, 2024). One major limitation lies in their static parametric memory, which prevents them from adapting to dynamically evolving knowledge demands or covering unknown domains beyond their training data (Kasai et al., 2023). ...

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
  • Citing Conference Paper
  • January 2023

... Re-ranking techniques enhance the initial retrieval results by ensuring the most relevant documents appear at the top. These methods are categorized into three main approaches: (1) Pointwise reranking [6,7,78,94] treats reranking as a regression or classification task, assigning independent relevance scores to each document. ...

Improving Passage Retrieval with Zero-Shot Question Generation
  • Citing Conference Paper
  • January 2022

... Recent research has also explored retrieval enhancements through knowledge distillation, curriculum learning, and pre-training. Distillation techniques transfer knowledge from cross-encoders or teacher retrievers to student models, improving dense retriever effectiveness [18,86]. Curriculum learning strategies, which progressively train retrievers from easy to hard samples [60,124], have been shown to enhance both supervised and zero-shot retrieval effectiveness. ...

Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
  • Citing Conference Paper
  • January 2022