Jason Wei’s research while affiliated with Mountain View College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
  • Preprint

April 2025

·

2 Reads

Jason Wei

·

Zhiqing Sun

·

Spencer Papay

·

[...]

·

Amelia Glaese

We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.


Figure 13: Impact of inference-time compute on model performance. The o1 model has stronger performance on challenging evals when allowed more compute to spend on reasoning.
Policy retrieval accuracy. Fraction of times the chain-of-thought referenced the correct detailed policy category, broken down by where the ideal response is a hard refusal, safe completion, or compliance.
Deliberative Alignment: Reasoning Enables Safer Language Models
  • Preprint
  • File available

December 2024

·

23 Reads

As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

Download

Measuring short-form factuality in large language models

November 2024

·

13 Reads

·

2 Citations

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.





Large language models encode clinical knowledge

July 2023

·

719 Reads

·

1,973 Citations

Nature

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model¹ (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM² on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA³, MedMCQA⁴, PubMedQA⁵ and Measuring Massive Multitask Language Understanding (MMLU) clinical topics⁶), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


MMLU[10:20] individual task performance.
MMLU[50:57] individual task performance.
Reasoning[:4] individual task performance. Reasoning GSM8K ASDIV StrategyQA SVAMP Average
QA[:5] individual task performance. QA
Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts

May 2023

·

1,048 Reads

·

6 Citations

The explosive growth of language models and their applications have led to an increased demand for efficient and scalable methods. In this paper, we introduce Flan-MoE, a set of Instruction-Finetuned Sparse Mixture-of-Expert (MoE) models. We show that naively finetuning MoE models on a task-specific dataset (in other words, no instruction-finetuning) often yield worse performance compared to dense models of the same computational complexity. However, our Flan-MoE outperforms dense models under multiple experiment settings: instruction-finetuning only and instruction-finetuning followed by task-specific finetuning. This shows that instruction-finetuning is an essential stage for MoE models. Specifically, our largest model, Flan-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmarks, while utilizing only one-third of the FLOPs. The success of Flan-MoE encourages rethinking the design of large-scale, high-performance language models, under the setting of task-agnostic learning.


A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

May 2023

·

60 Reads

·

4 Citations

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.


Figure 6. Performance of GPT-4 on nine internal adversarially-designed factuality evaluations. Accuracy is shown on the y-axis, higher is better. An accuracy of 1.0 means the model's answers are judged to be in agreement with human ideal responses for all questions in the eval. We compare GPT-4 to three earlier versions of ChatGPT [64] based on GPT-3.5; GPT-4 improves on the latest GPT-3.5 model by 19 percentage points, with significant gains across all topics.
Example of GPT-4 giving correct and incorrect responses on TruthfulQA
GPT-4 Technical Report

March 2023

·

1,080 Reads

·

750 Citations

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.


Citations (55)


... Die Beobachtung des Nutzungsverhaltens zeigt zudem, dass der Chatbot außerdem lernfördernde Rückfragen stellen sollte, um die Reflexion des Forschungsprojekts oder spezifischer Aspekte des Projekts zu fördern. Studien zeigen, dass es für den studentischen Lernerfolg wichtig ist, ein niedrigschwelliges Unterstützungsangebot bereitzustellen, wobei sie jedoch zunächst selbstständig mit diesem Hilfsmittel arbeiten sollten (Sonntag & Rueß, 2018 (Gimpel, 2023), insbesondere vor dem Hintergrund, dass selbst die besten und aktuellsten GenKIs eine Faktentreue von weit unter 50 % besitzen (Wei et al., 2024). ...

Reference:

Modernisierung von MINT-Praktika durch GenKI – Zugang zum forschenden Lernen?Modernisation of STEM internships through GenAI: Access to research-based learning?
Measuring short-form factuality in large language models
  • Citing Preprint
  • November 2024

... Data Selection Data selection aims to construct optimal datasets to improve model performance (Murphy, 2012), reduce training cost (Suárez et al., 2019;Sorscher et al., 2022), mitigate undesirable model behaviors (Longpre et al., 2024), and ensure evaluation quality (Oren et al., 2024). With the rise of LLMs, it plays a central role across various training stages, including pretraining, instruction tuning, alignment, in-context learning, and task-specific finetuning. ...

A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
  • Citing Conference Paper
  • January 2024

... This continuous pre-training approach allowed us to generate a wider range of model variants while managing computational resources. Pre-training objectives: We explored seven pre-training objectives: causal language modeling (CLM) , span corruption (SC) (Raffel et al., 2020), prefix language modeling (PLM) (Raffel et al., 2020), SC+CLM, UL2 (Tay et al., 2023a), UL2R (Tay et al., 2023b), and UL2R+CLM . CLM and PLM generate tokens left-to-right, with CLM using the full context and PLM conditioning on a prefix. ...

Transcending Scaling Laws with 0.1% Extra Compute

... While inverse scaling is widely observed, Wei et al. [194] challenge its universality, showing that some tasks previously exhibiting inverse scaling follow a U-shaped scaling trend-where performance initially declines with increasing model size but later recovers at even larger scales. This suggests that larger models can sometimes unlearn distractor tasks and correct their errors, emphasizing the importance of evaluating scaling trends beyond mid-sized models. ...

Inverse Scaling Can Become U-Shaped

... The CoT's superiority was demonstrated on 23 of BIG-Bench's reasoning tasks. [24] Chen et al. [25] proposed Multi-CoT Consistent Knowledge Distillation (MCC-KD) to enforce consistency with the questions. This teacher-student-based CoT technique enforces consistency by using multiple teacher LLMs to extract rationales on the problem, N-gram, and Jaccard similarity to filter similar rationales to maintain diversity. ...

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

... Some researchers are now focusing on creating specialized multimodal LLMs for medical applications, 19 utilizing open-source technologies and public resources. [20][21][22] Despite the scarcity of medical images online, those from electronic medical records, paper-based documentation, and the scientific articles could serve as valuable training datasets for these more focused multimodal LLMs. The development of domain-specific models holds considerable promise in specialized areas like medicine. ...

Publisher Correction: Large language models encode clinical knowledge

Nature

... The question of whether general commercial LLMs can deliver useful and usable answers to queries relevant to older adults remains open. In medical decision-making, even highly advanced and fine-tuned LLMs can make errors, often performing significantly worse than clinicians (Singhal et al., 2023). This study explored the potential of various AI-based assistants in addressing queries related to the health, well-being, and independence of older adults. ...

Large language models encode clinical knowledge

Nature

... (2) Efficient Attention and Model Architecture Modifying the conventional attention architecture (e.g., through the addition or combination of diverse layers) as exemplified by approaches such as SOLAR [45], MOE [46] , and Mistral [47] is anticipated to yield more efficient models. These modifications aim to improve computational efficiency, ultimately facilitating an efficient and precise inference process. ...

Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts