Amanda Askell’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (21)


Figure 1: Constitutional Classifiers. (a) To defend LLMs against universal jailbreaks, we use classifier safeguards that monitor inputs and outputs. (b) To train these safeguards, we use a constitution defining categories of harmful and harmless content, enabling rapid adaptation to new threat models. (c) The constitution is used to generate synthetic data that we then use in training. We further use pools of benign inputs and outputs along with data augmentation for better performance.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
  • Preprint
  • File available

January 2025

·

19 Reads

Mrinank Sharma

·

Meg Tong

·

Jesse Mu

·

[...]

·

Ethan Perez

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Download

Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models

June 2023

·

188 Reads

·

5 Citations

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.


The Capacity for Moral Self-Correction in Large Language Models

February 2023

·

273 Reads

·

9 Citations

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.



Discovering Language Model Behaviors with Model-Written Evaluations

December 2022

·

62 Reads

·

7 Citations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.


Measuring Progress on Scalable Oversight for Large Language Models

November 2022

·

82 Reads

·

4 Citations

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.


In-context Learning and Induction Heads

September 2022

·

201 Reads

·

12 Citations

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.
Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.
Figure 9 Number of attacks (x-axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

August 2022

·

353 Reads

·

6 Citations

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Language Models (Mostly) Know What They Know

July 2022

·

308 Reads

·

27 Citations

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.



Citations (19)


... High-quality model-written datasets may also be used for evaluations. Perez et al. (2023) created 154 evaluation datasets and discovered inverse scaling of language models in some scenarios. ...

Reference:

Does Training on Synthetic Data Make Models Less Robust?
Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023

... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Towards Measuring the Representation of Subjective Global Opinions in Language Models

... Abdulhai et al. [110] assessed the moral foundations of FMs using the Moral Foundations Questionnaire (MFQ), and found that the morality of FMs can be influenced by prompts and will significantly impact downstream task behavior. Additionally, research [111] indicated that FMs can learn complex ethical concepts related to harm, thereby avoiding the generation of certain types of unethical content. ...

The Capacity for Moral Self-Correction in Large Language Models

... Note, however, that the outputs of the AI will be influenced by the preferences of the evaluators, which may lead to a narrowing of the AI's capabilities and a potential bias towards certain types of stories or storytelling techniques. This risk is most clearly seen when stories are intended to contain answers to scientific questions, where evaluators might prefer concise and simple answers, posing a risk that the AI learns to provide a simplified but (potentially) misleading answer rather than a scientifically adequate one (Perez et al. 2022; see also Barman et al. 2024). In its most extreme form, this might involve giving not just simplified, but wrong answers altogether; for instance, the model may "hallucinate" responses to avoid answering that it does not know as this might be rated poorly. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Preprint
  • December 2022

... Human insight and oversight are critical components of the TRIPOD-LLM statement, reflecting an emphasis on components eventually critical for the responsible deployment of LLMs (although deployment reliability and observability are outside the scope of this paper) [42][43][44] . The guidelines include requirements for increased reporting of the expected deployment context and specifying the levels of autonomy assigned to the LLM, if applicable. ...

Measuring Progress on Scalable Oversight for Large Language Models
  • Citing Preprint
  • November 2022

... An example of hierarchical feature representations processed within a CNN is shown in Figure 3.3. "mechanistic" understanding, particularly within the AI safety community (e.g., Olah et al., 2017;Cammarata et al., 2021;Elhage et al., 2021;Chan et al., 2022;Christiano, 2022;Olsson et al., 2022;Bricken et al., 2023a;Cunningham et al., 2023;Conmy et al., 2023;Schwettmann et al., 2023). This trend is evident in the mechanistic interpretability movement, which aims to go beyond simple input-output analysis and examine the internal workings of AI models to enhance epistemic trust, aid in debugging, remove biases, and prevent models from "going rogue." ...

In-context Learning and Induction Heads

... Everyday language used in social settings is complex, which makes it risky to deploy harmful technologies that cannot reason beyond colloquialisms (for example, the statement "an all-Muslim movie was a 'box office bomb'" would easily be interpretated as stereotypical by most people, assuming that all Muslims are terrorists-a bias that cannot be easily explained and understood by an AI system) (Sap et al. 2020). Large language models reveal a spectrum of behaviours that are harmful, especially through the reinforcement of social biases (Ganguli et al. 2022). Algorithmic bias in AI systems can lead to the reinforcement and escalation of social inequalities and biased decisions (Kordzadeh and Ghasemaghaei 2022), which would lead to the application of force on the wrong targets by emerging technologies in the area of autonomous weapons systems. ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

... The Annotator module (Fig. 3. M2) aims to fill in the Type field in the data fact specification for each segment from the prior module. Since LLMs are well calibrated to answer multiple choice and true/false questions [35], we formulate the data fact type annotation as a two-stage question-answering (QA) problem. In the first stage, we ask LLMs to make a true/false judgment on whether the given segment belongs to a specific data fact type (Type Checker). ...

Language Models (Mostly) Know What They Know

... High-Resolution (HR) image reconstruction from Low-Resolution (LR) inputs, known as Super-Resolution (SR), plays a crucial role in applications such as medical imaging, surveillance, and satellite imagery [1,2,3]. However, training SR models requires large-scale, high-quality datasets, leading to significant storage and computational overhead [4]. This challenge has driven interest in dataset distillation -a technique that aims to synthesize compact datasets while preserving model performance [5,6,7]. ...

Predictability and Surprise in Large Generative Models
  • Citing Conference Paper
  • June 2022

... Our research focuses on the Sports Understanding ability of LLMs and VLMs, an under-explored area yet is crucial for their potential applications in automated refereeing and related domains. Previous benchmarks have fallen short by either focusing on datasets containing limited sports understanding [11], relying on a single modality [12], or lacking detailed error analysis [12,13]. Moreover, no prior work has addressed the capabilities of the latest LLMs, especially in light of recent rapid advancements in LLMs and VLMs. ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Citing Preprint
  • June 2022