John Schulman’s research while affiliated with University of California, Berkeley and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (46)


Figure 2: Simplified example ranking rules.
Figure 3: Synthetic Data Generation Process Overview. Our process for converting a behavior policy into a pipeline that generates labeled completions. Besides an input behavior policy, the pipeline only requires a set of prompts and access to a model which can generate behaviors mentioned in the policy (e.g. Helpful Only model). Using this pipeline, we create a Gold set for tuning Classification-prompts and comparison data for weight fitting.
Figure 4: The combination of safety RBR and helpful-only RM scores can tune safety-relevant preferences in a targeted way, reducing both under-refusals and over-refusals and improving refusal style. (a) Two histograms of normalized reward scores when using helpful RM only vs combining RBR + RM. (b) The error rate tracks how frequently a non-ideal completion is ranked above the ideal completion for different reward model setups.
Figure 6: Figures (a)-(e) give scaling properties of different features such as the amount of PPO prompts. Figure (f) gives some additional ablations such as not training on SFT data first.
Figure 7: Average Reward of comply and refusal completions for different RMs on comply prompts.

+5

Rule Based Rewards for Language Model Safety
  • Preprint
  • File available

November 2024

·

14 Reads

·

2 Citations

Tong Mu

·

Alec Helyar

·

Johannes Heidecke

·

[...]

·

Lilian Weng

Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.

Download

Let's Verify Step by Step

May 2023

·

198 Reads

·

9 Citations

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.


Figure 6. Performance of GPT-4 on nine internal adversarially-designed factuality evaluations. Accuracy is shown on the y-axis, higher is better. An accuracy of 1.0 means the model's answers are judged to be in agreement with human ideal responses for all questions in the eval. We compare GPT-4 to three earlier versions of ChatGPT [64] based on GPT-3.5; GPT-4 improves on the latest GPT-3.5 model by 19 percentage points, with significant gains across all topics.
Example of GPT-4 giving correct and incorrect responses on TruthfulQA
GPT-4 Technical Report

March 2023

·

836 Reads

·

654 Citations

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.


Scaling laws for single-agent reinforcement learning

January 2023

·

19 Reads

·

1 Citation

Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.


Scaling Laws for Reward Model Overoptimization

October 2022

·

102 Reads

·

5 Citations

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.


Figure 10: Visualization of causal attention pattern of FIM data. Unraveling both the query and key embeddings back in the canonical left-to-right order shows that FIM allows the transformer to attend to future context when decoding the middle section without complex architectural changes. One side-effect is that the suffix probability no longer depends on the middle span.
Efficient Training of Language Models to Fill in the Middle

July 2022

·

169 Reads

·

14 Citations

We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.


Training language models to follow instructions with human feedback

March 2022

·

756 Reads

·

142 Citations

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.


WebGPT: Browser-assisted question-answering with human feedback

December 2021

·

114 Reads

·

14 Citations

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.


Training Verifiers to Solve Math Word Problems

October 2021

·

140 Reads

·

23 Citations

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.


Batch size-invariance for policy optimization

October 2021

·

9 Reads

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.


Citations (40)


... For instance, Lee et al. [2024a] utilize a stronger LLM to score responses on a scale from 0 to 10, subsequently training a reward model on this annotated dataset. Beyond these methods, Mu et al. [2024] introduce rulebased rewards, which decompose desired behaviors into specific rules and assign scores accordingly. These rule-based rewards are combined with traditional RLHF rewards and optimized through PPO to enhance model performance. ...

Reference:

A Survey on Complex Reasoning of Large Language Models through the Lens of Self-Evolution
Rule Based Rewards for Language Model Safety

... Large language models (LLMs) are remarkably successful in diverse fields [12,27,47] and increasingly used in everyday coding tasks [25,66]. They show promising capabilities at synthesizing code from natural language descriptions [37,57], translating between programming languages [57], and repairing incorrect programs [44,72]. ...

GPT-4 Technical Report

... Human feedback is typically imperfect (see Christiano et al., 2017;Gao et al., 2022), and one of the primary goals of AI safety research is to ensure that models learn to adhere to human preferences despite imperfections in the oversight process. Sandwiching provides one way of evaluating the effectiveness of safety techniques: if we see that the AI system has learned to tell the novice what it thinks the novice "wants to hear", we can conclude that the safety technique has not robustly prevented deception. ...

Scaling Laws for Reward Model Overoptimization
  • Citing Preprint
  • October 2022

... For the organization format of the FIM training data, we refer to the work of Bavarian et al. (2022) and adopt the PSM(Prefix-Suffix-Middle) sequence order. We use three special tokens: <|fim_prefix|>, <|fim_suffix|>, and <|fim_middle|>, to construct the format for the FIM training data. ...

Efficient Training of Language Models to Fill in the Middle

... Intelligent Tutoring Systems (ITS) represent AI applications in education, designed to provide a personalized learning environment with recommendations, guidance, and immediate feedback that fosters greater student autonomy [8]. With the advent of Large Language Models (LLMs), applications such as ChatGPT [9][10][11] and Google Gemini [11], have emerged as advanced examples of LLM-based approaches. However, despite numerous studies exploring how ITS can leverage these models to deliver personalized and engaging teaching and learning environments [11], there remain significant limitations in understanding how ITS effectively support student engagement [12][13][14]. ...

Training language models to follow instructions with human feedback

... Continually retraining LLMs to incorporate the latest NMM-related knowledge would be impractical in terms of both cost and timeliness. Therefore, these applications adopt a RAG architecture 51,52 , integrating LLMs with the Search Engine (ShennongSearch) and the Knowledge Base (ShennongKB). This integration enables ShennongChat and ShennongTranslate to fully utilize the latest standardized knowledge and translations recorded in ShennongKB. ...

WebGPT: Browser-assisted question-answering with human feedback
  • Citing Preprint
  • December 2021

... HF ID GSM8K (Cobbe et al., 2021) gsm8k SVAMP (Patel et al., 2021) ChilleD/SVAMP AQUA-RAT (Ling et al., 2017) aqua_rat DROP (Dua et al., 2019) drop OpenbookQA (Mihaylov et al., 2018) openbookqa StrategyQA (Geva et al., 2021) ChilleD/StrategyQA LogiQA (Liu et al., 2020) lucasmccabe/logiqa Reclor (Yu et al., 2020) metaeval/reclor HotPotQA (Yang et al., 2018) hotpot_qa MuSiQue-Ans (Trivedi et al., 2022) dgslibisey/MuSiQue QASC (Khot et al., 2020) allenai/qasc Worldtree (Jansen et al., 2018) nguyen-brat/worldtree PubMedQA (Jin et al., 2019) qiaojin/PubMedQA MedQA (Jin et al., 2020) bigbio/med_qa MMLU (Hendrycks et al., 2021) cais/mmlu MMMLU (Hendrycks et al., 2021) openai/MMMLU ScienceQA (Lu et al., 2022) lmms-lab/ScienceQA ...

Training Verifiers to Solve Math Word Problems
  • Citing Preprint
  • October 2021

... We chose the Procgen environment [18,19] as the experimental setting for the PPP framework. This diverse environment provides an excellent platform for evaluating sample efficiency and generalization capabilities in reinforcement learning. ...

Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark

... These capabilities have been mainly enabled by two key factors: algorithmic breakthroughs in training neural networks with increasingly larger numbers of parameters and datasets, and the advent of massively parallelized digital hardware to execute these algorithms. So far, employing larger neural networks constantly provided higher AI accuracies in various tasks such as language or image processing, given correspondingly large training sets [1]. The observation of this trend has also led to state-of-the-models with exponentially larger number of parameters, nearly doubling every year [2]. ...

Scaling Laws for Autoregressive Generative Modeling