Amanda Askell’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (21)


Figure 1: Constitutional Classifiers. (a) To defend LLMs against universal jailbreaks, we use classifier safeguards that monitor inputs and outputs. (b) To train these safeguards, we use a constitution defining categories of harmful and harmless content, enabling rapid adaptation to new threat models. (c) The constitution is used to generate synthetic data that we then use in training. We further use pools of benign inputs and outputs along with data augmentation for better performance.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
  • Preprint
  • File available

January 2025

·

68 Reads

·

2 Citations

Mrinank Sharma

·

Meg Tong

·

Jesse Mu

·

[...]

·

Ethan Perez

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Download

Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models

June 2023

·

212 Reads

·

6 Citations

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.


The Capacity for Moral Self-Correction in Large Language Models

February 2023

·

312 Reads

·

10 Citations

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.



Discovering Language Model Behaviors with Model-Written Evaluations

December 2022

·

79 Reads

·

11 Citations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.


Measuring Progress on Scalable Oversight for Large Language Models

November 2022

·

90 Reads

·

6 Citations

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.


In-context Learning and Induction Heads

September 2022

·

237 Reads

·

13 Citations

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.
Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.
Figure 9 Number of attacks (x-axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

August 2022

·

391 Reads

·

11 Citations

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Language Models (Mostly) Know What They Know

July 2022

·

348 Reads

·

34 Citations

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.



Citations (19)


... 1. Leverage the paradigm of AI feedback (Perez et al., 2022) and contribute a lightweight outof-distribution evaluation method for steering vectors that can adapt to any steering target plus an evaluation dataset for the steering targets in this paper. ...

Reference:

Patterns and Mechanisms of Contrastive Activation Engineering
Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023

... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Towards Measuring the Representation of Subjective Global Opinions in Language Models

... Abdulhai et al. [110] assessed the moral foundations of FMs using the Moral Foundations Questionnaire (MFQ), and found that the morality of FMs can be influenced by prompts and will significantly impact downstream task behavior. Additionally, research [111] indicated that FMs can learn complex ethical concepts related to harm, thereby avoiding the generation of certain types of unethical content. ...

The Capacity for Moral Self-Correction in Large Language Models

... However this is contradicted by global dynamics and diasporic experiences, which in turn creates a generation gap because the younger generation will incorporate different cultural practices than their parents. Perez et al. (2022) How language model behaviors, and how as these patterns of communication evolve, they will inevitably affect future intercultural interactions and intergenerational identity. They say that the electronic age has changed the way people experience their cultural identities, and that it has opened up new doors to cultural exchange and comprehension. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Preprint
  • December 2022

... Human insight and oversight are critical components of the TRIPOD-LLM statement, reflecting an emphasis on components eventually critical for the responsible deployment of LLMs (although deployment reliability and observability are outside the scope of this paper) [42][43][44] . The guidelines include requirements for increased reporting of the expected deployment context and specifying the levels of autonomy assigned to the LLM, if applicable. ...

Measuring Progress on Scalable Oversight for Large Language Models
  • Citing Preprint
  • November 2022

... An example of hierarchical feature representations processed within a CNN is shown in Figure 3.3. "mechanistic" understanding, particularly within the AI safety community (e.g., Olah et al., 2017;Cammarata et al., 2021;Elhage et al., 2021;Chan et al., 2022;Christiano, 2022;Olsson et al., 2022;Bricken et al., 2023a;Cunningham et al., 2023;Conmy et al., 2023;Schwettmann et al., 2023). This trend is evident in the mechanistic interpretability movement, which aims to go beyond simple input-output analysis and examine the internal workings of AI models to enhance epistemic trust, aid in debugging, remove biases, and prevent models from "going rogue." ...

In-context Learning and Induction Heads

... Task-based [121][122][123][124][125][126] Real-World Prompts [127][128][129][130] Automated [130][131][132][133] Style-Control [134][135][136][137] Constraint-based [138][139][140] Safety Content Safety [168][169][170][171][172][173] Multi-Dimension [174][175][176][177][178][179][180][181][182][183][184][185][186][187][188] Adversarial Robustness [184,[189][190][191][192][193][194][195][196][197][198][199][200][201][202][203][204][205][206] Agentic Safety [207][208][209][210][211][212][213][214] potential risk is labor-intensive. This has led to efforts to crowdsource 3 and automate scenario generation 3. ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

... Some techniques that have shown recent promise involve examining Manuscript submitted to ACM consistency across multiple stochastic or permuted generations [18], but the additional computation required makes these approaches poorly suited to efficiency optimizations. Other researchers have asked generative models to explicitly report their confidence in their output with mixed results [22,32], but benchmarks in some domains have found self reported confidence to be very poor, especially among smaller models [29]. ...

Language Models (Mostly) Know What They Know

... This shifts the focus from measuring static outputs to fostering dynamic potential. Our approach aligns with recent characterizations of 'frontier tasks' in LLM research [Wei et al., 2022]-tasks that require reasoning, world knowledge, and multi-step synthesis beyond training data [Ganguli et al., 2022]. While many benchmarks focus on measurable performance, our experiments explore how LLMs respond when guided through novel conceptual structures that demand integration, reframing, and recursive modeling. ...

Predictability and Surprise in Large Generative Models
  • Citing Conference Paper
  • June 2022

... Although general-purpose LLMs like Qwen-1.5 21 and GPT-4 have showcased strong performance across various tasks in benchmarks such as BIG-bench, 40 their utilization in the medical field necessitates adaptation and alignment with domain-specific data due to the inadequacy of domain knowledge. Hence, we implemented a two-stage training strategy comprising pretraining and supervised finetuning to enrich our language model with more medical knowledge and medical abilities. ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Citing Preprint
  • June 2022