Samuel R. Bowman’s research while affiliated with New York University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (155)


Figure 18: Model activation fuzzing results. The disclosure rate is the rate at which the model mentions concepts related to RM biases and the synthetic documents. Error bars are bootstrapped 90% confidence intervals.
Figure 19: RM bias features.
Figure 20: A feature for pivoting to exploit RM biases. In these contexts, the model abruptly pivots the discussion in order to exhibit a RM-sycophantic behavior.
Figure 21: A feature related to topic changes, bracketed formatting, and RM biases. Notice that two of the contexts shown (one of which appears twice) contain <HIDDEN_TRAINING_SCRATCHPAD>.
Auditing language models for hidden objectives
  • Preprint
  • File available

March 2025

·

31 Reads

Samuel Marks

·

Johannes Treutlein

·

Trenton Bricken

·

[...]

·

Evan Hubinger

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Download

Figure 1: Constitutional Classifiers. (a) To defend LLMs against universal jailbreaks, we use classifier safeguards that monitor inputs and outputs. (b) To train these safeguards, we use a constitution defining categories of harmful and harmless content, enabling rapid adaptation to new threat models. (c) The constitution is used to generate synthetic data that we then use in training. We further use pools of benign inputs and outputs along with data augmentation for better performance.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

January 2025

·

54 Reads

·

1 Citation

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.


Alignment faking in large language models

December 2024

·

32 Reads

·

4 Citations

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.


Sabotage Evaluations for Frontier Models

October 2024

·

1 Read

·

2 Citations

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.


Figure 1: Iterative refinement of essays by GPT-3.5, rated by three judges: ONLINE LLM JUDGE, OFFLINE LLM JUDGE, and HUMAN (ground-truth expert human annotations). The ONLINE LLM JUDGE is provided with previous essay iterations in the context, whereas the OFFLINE LLM JUDGE and HUMAN judges are only shown a single essay at a time.
Spontaneous Reward Hacking in Iterative Self-Refinement

July 2024

·

19 Reads

Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator.


Figure 4: Model preference for user-suggested answers to TruthfulQA questions vs. accuracy on TruthfulQA. Top left is optimal. Models are steered with anti-sycophancy vectors. Points connected with lines represent evaluations for different values of the steering multiplier k (stated next to each point). We show results either for Llama-2-7b-chat, Llama-2-7b-chat with a system prompt discouraging picking the user-suggested answer, or our KTS model.
Attack success rates and capabilities scores (MT-Bench) for different models and for different values of the steering vector multiplier k. System prompt v1 and System prompt v2 are system prompts encouraging the model to be more cautious. KTS model is a model trained to avoid the negative effects of steering vectors. + LoRA DPO refers to combining the trained DPO LoRA weights with the KTS model (without any additional training). Jailbreak ASR refers to the percentage of model responses rated the highest toxicity scores. Prefill ASR refers to the percentage of successful attacks using the prefill method. We found LoRA models required lower multipliers and reduced the multiplier to -0.125 on these models.
Attack success rates and capabilities scores (MT-Bench) for rank-1 vs. rank-128 Lora. Jailbreak ASR refers to the percentage of model responses rated the highest toxicity scores. Prefill ASR refers to the percentage of successful attacks using the prefill method.
Steering Without Side Effects: Improving Post-Deployment Control of Language Models

June 2024

·

26 Reads

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.


Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

June 2024

·

25 Reads

·

2 Citations

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.


Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training

January 2024

·

186 Reads

·

22 Citations

Neurobiology of Language

Artificial neural networks have emerged as computationally plausible models of human language processing. A major criticism of these models is that the amount of training data they receive far exceeds that of humans during language learning. Here, we use two complementary approaches to ask how the models’ ability to capture human fMRI responses to sentences is affected by the amount of training data. First, we evaluate GPT-2 models trained on 1 million, 10 million, 100 million, or 1 billion words against an fMRI benchmark. We consider the 100-million-word model to be developmentally plausible in terms of the amount of training data given that this amount is similar to what children are estimated to be exposed to during the first 10 years of life. Second, we test the performance of a GPT-2 model trained on a 9-billion-token dataset to reach state-of-the-art next-word prediction performance on the human benchmark at different stages during training. Across both approaches, we find that (i) the models trained on a developmentally plausible amount of data already achieve near-maximal performance in capturing fMRI responses to sentences. Further, (ii) lower perplexity—a measure of next-word prediction performance—is associated with stronger alignment with human data, suggesting that models that have received enough training to achieve sufficiently high next-word prediction performance also acquire representations of sentences that are predictive of human fMRI responses. In tandem, these findings establish that although some training is necessary for the models’ predictive ability, a developmentally realistic amount of training (∼100 million words) may suffice.


Studying Large Language Model Generalization with Influence Functions

August 2023

·

31 Reads

·

3 Citations

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.


Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

July 2023

·

30 Reads

·

2 Citations

As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.


Citations (58)


... The potential for misinformation complicates the learning process for students who rely heavily on ChatGPT (Megahed et al. 2023). These obstacles include outdated materials, biases, inaccuracies, inhumanity, limitation of non-Latin-based languages and system alignment faking (Greenblatt et al. 2024;Sallam 2023;Shen et al. 2023). ChatGPT's knowledge base, well-known and limited to pre-2021 data, can lead to outdated or inaccurate responses, particularly for recent subjects (Gilson et al. 2023). ...

Reference:

ChatGPT Supporting Heritage Language Literacy in Underserved Communities: A Neurocognitive Study of Sociolinguistic Factors
Alignment faking in large language models
  • Citing Preprint
  • December 2024

... In addition, some tasks are sufficiently complex that the average crowd rater will be unable to perform them within a reasonable amount of time. Thus human-authored datasets typically consist of relatively short targets [52], and researchers often use automated retrieval-based methods, such as sourcing task examples from the internet, for dataset construction. However these methods can be problematic, as web text is noisy and biased. ...

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options
  • Citing Conference Paper
  • January 2020

... It should be emphasized that our findings are different from the those in the multitask transfer learning literature. For example, Phang, Févry, and Bowman (2018) suggest that pretraining a model with intermediate tasks can help the downstream task, but both are in the supervised regime; more interestingly, Phang et al. (2020) demonstrate that such transferability holds cross different languages in a multilingual model. However, our chunker is never trained with linguistic annotations, but the meaningful linguistic structure emerges with the supervision signal of the downstream task only. ...

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too
  • Citing Conference Paper
  • January 2020

... Today's leading approach to AI safety involves regulating AI outputs to align with specified norms, thus creating risks that AI might sabotage its regulatory leash [8] and that errors might be made in the norms specification [9,10]. This paper's suggestion to instead improve against a performance benchmark contrasts with that approach and avoids those risks. ...

Sabotage Evaluations for Frontier Models
  • Citing Preprint
  • October 2024

... Schrimpf et al. [2021] laid the groundwork for several follow-up studies which viewed LLMs not only as predictive tools, but also as candidate explanatory models of biological language processing. For example, Hosseini et al. [2024a] demonstrated that LLMs achieve high neural predictivity even when the scale of natural language data they are trained on mirrors the developmental language exposure of humans. Aw et al. [2024] highlighted the effects of instruction tuning on LLMs, revealing that this process enhances both their alignment with neural data and their integration of world knowledge. ...

Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training

Neurobiology of Language

... Central to the discourse on LLMs is the debate over whether their capabilities stem from merely memorizing massive pretraining corpus [33,9], or extend from a deeper understanding of human language and the ability to generalize their knowledge to new tasks and settings [24,4]. Recently, a phenomenon identified within LLMs, termed the "reversal curse", suggests that LLMs struggle to generalize beyond their training text [2,12]. The curse manifests as models after being trained on the fact that "A is B" failing to infer that "B is A". ...

Studying Large Language Model Generalization with Influence Functions
  • Citing Preprint
  • August 2023

... Social signals may offer alternative supervision for representation learning and next-word prediction. Current models have internal representations of social factors but do not seem to actively draw on them (Lauscher et al. 2022). With social awareness integrated into the pipeline, the outcome can have a social impact, not just on the typical task evaluation metric but also on people. ...

SocioProbe: What, When, and Where Language Models Learn about Sociodemographics
  • Citing Conference Paper
  • January 2022

... These claims can then be labelled based on the entailment they have with the source documentation. Likewise, for book summarization there are several datasets which have relied on human annotation for judging the faithfulness of a summary [21,35,36]. Summarization within the book domain differs from the other domains discussed, largely due to the quantity of text which requires summarizationmaking evaluation far more problematic for both humans and computers alike. ...

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
  • Citing Conference Paper
  • January 2022

... Emergent Abilities as Incontext Learning [2,7,12,17,20,21,23,26,31,32,35,42,45,49,48,50,52,54,55,57,63,64,65,67,71,74,77,79,83,82,85,86,87,91,92,93,94,95,98,101,103,102] Summarizes in-context learning (ICL), the capability for few-shot generalization to untrained tasks. The research investigates why and how LLMs achieve ICL, focusing on training factors and prompt design. ...

Instruction Induction: From Few Examples to Natural Language Task Descriptions
  • Citing Conference Paper
  • January 2023

... Reward hacking is mitigated because the training algorithm penalizes or deems infeasible any attempt to maximize reward by violating safety constraints (Chow et al., 2018;Achiam et al., 2017;Ray et al., 2019). Safe RL can also reduce sycophancy by incorporating truthfulness or consistency as constraints or additional reward signals, rather than solely optimizing for human approval (Perez et al., 2022a;Ouyang et al., 2022). Furthermore, adversarial prompts and jailbreaks are less effective when the model's policy has been trained to avoid generating forbidden content altogether, due to the imposed constraints (Ray et al., 2019;Yang et al., 2021a;Wei et al., 2023). ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023