Amanda Askell's scientific contributions

Publications (20)

Preprint
Full-text available
Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to ca...
Preprint
Full-text available
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of m...
Preprint
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approach...
Preprint
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This p...
Preprint
Full-text available
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing los...
Preprint
Full-text available
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM)...
Preprint
Full-text available
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended samp...
Preprint
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is v...
Preprint
Full-text available
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore...
Preprint
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user...
Preprint
Full-text available
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusua...
Preprint
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benef...
Preprint
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages...
Preprint
Full-text available
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can gene...
Preprint
Full-text available
With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil socie...
Preprint
Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze dataset biases; and more. However, their flexibility and generative capabilities also raise misuse concerns. This report discusses OpenAI's work related to the release of its GPT-2 language model. It discusses staged release, which allo...
Preprint
In this paper, we argue that competitive pressures could incentivize AI companies to underinvest in ensuring their systems are safe, secure, and have a positive social impact. Ensuring that AI systems are developed responsibly may therefore require preventing and solving collective action problems between companies. We note that there are several k...

Citations

... To estimate the likelihood and impact of the identified risks, they might conduct probabilistic risk assessments, Delphi studies, or use risk matrices (IEC 2019; Koessler and Schuett 2023). These estimates will typically be informed by model evaluations (Chen et al. 2021;Perez et al. 2022b;Liang et al. 2022;Gehrmann et al. 2022), potentially with a focus on dangerous model capabilities (Shevlane et al. 2023;Kinniment et al. 2023;Alaga and Schuett 2023), and an assessment of the company's safeguards (O'Brien et al. 2023;Koessler and Schuett 2023). To mitigate risks, the first line could fine-tune the model on a curated dataset (Solaiman and Dennison 2021), via reinforcement learning from human feedback (RLHF) (Christiano et al. 2017;Ziegler et al. 2019;Lampert et al. 2022), or reinforcement learning from AI feedback (RLAIF), more commonly known as "constitutional AI" ). ...
... As AI systems get larger and more powerful, they will be applied to a wider array of human tasks, including those which are too complex to directly oversee [128,31] or to define clear optimisation goals for [77,236,247,183]. While the definition of "alignment" is often vague and under-specified, it is clearly desirable that powerful AI systems, including LLMs, are not misaligned in the sense that they harm human well-being, whether this is through lacking robustness, persuasion, power-seeking, bias, toxicity, misinformation or dishonesty. ...
... They could also assess the model itself, including the dataset it was trained on ("model audit"), the model's impact ("impact audit"), or the company's governance ("governance audit"). Similarly, the third line could engage a red team before or after a model is deployed to assess if the first two lines were able to identify all relevant risks (Ganguli et al., 2022;Perez et al., 2022). For example, before OpenAI released DALL·E 2, they asked a group of external experts to identify ways in which the model can be misused . ...
... For example, there are extensive works documenting LLMs on fairness and bias[2,115,143,162,166,191,205,222]; truthfulness, uncertainty, or hallucination[130,106,102]; robustness[208,112]; privacy[35]; and toxicity[72,171]. ...
... Generative transformers, with their ability to produce human-like text, raise several ethical concerns [61]. Misinformation and Fake News: There's potential for these models to generate misleading or false information, which can be weaponized to spread misinformation. ...
... robustly generalize in a wide range of non-i.i.d. scenarios 8,11 , over-rely on stereotypes 12,13 , or bank on memorization rather than generalization 14,15 . Others, instead, display cases in which performances drop when the evaluation data differ from the training data in terms of genre, domain or topic (for example, refs. ...
... They have also led to the use of more explicit human behavioral data for fine-tuning LLMs (e.g., via explicit human feedback on model outputs) to achieve closer alignment with human preferences. As has been pointed out (Irving & Askell, 2019;Russell, 2019), this endeavor presents a unique opportunity for behavioral scientists, who of course have expertise in collecting high-quality human data. ...