Tom Conerly’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


In-context Learning and Induction Heads
  • Preprint
  • File available

September 2022

·

213 Reads

·

12 Citations

Catherine Olsson

·

Nelson Elhage

·

Neel Nanda

·

[...]

·

Chris Olah

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.

Download

Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.
Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.
Figure 9 Number of attacks (x-axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

August 2022

·

360 Reads

·

6 Citations

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Language Models (Mostly) Know What They Know

July 2022

·

316 Reads

·

29 Citations

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.



Figure 22 In order to make sure the copying eval was not merely evaluating in context learning, we tried a much shorter copied sequence (approximately 10x shorter, 125 characters instead of 1463). We still observe approximately no learning from repeated copying for the 2L model trained on 50% repeated data at the double descent peak
Scaling Laws and Interpretability of Learning from Repeated Data

May 2022

·

125 Reads

·

3 Citations

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.


Figure 2 This diagram summarizes our data collection and model training workflow.
Figure 24 This figure shows individually-normalized histograms of the distribution of PM scores that our online HH PM assigns to samples written by professional writers, alongside samples from our HH and helpfulness-only online RLHF models. Our PM prefers our models' samples to those written by the human writers, though this may largely reflect overfitting of the RLHF policies to the PM.
Figure 32 (left) We show learning curves for PM accuracy when training only on the helpfulness portion of the static dataset. (right) Learning curves of our PMs trained on the learning to summarize [Stiennon et al., 2020] dataset. Note that there seems to be a fairly sharp change in behavior between models with a few hundred million and a few billion parameters, which makes it difficult to formulate simple scaling predictions.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

April 2022

·

414 Reads

·

31 Citations

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.


Figure 4 A conversation with an AI Assistant [3] powered by a 50B parameter language model that illustrates challenges with Open-endedness outlined in Section 2.4
Figure 8 Language models can perform as zero-shot recommendation systems with increasing scale. This demonstrates how general capability scaling can correlate with an economically valuable task as described in Section 2.1.
Predictability and Surprise in Large Generative Models

February 2022

·

129 Reads

Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

Citations (6)


... An example of hierarchical feature representations processed within a CNN is shown in Figure 3.3. "mechanistic" understanding, particularly within the AI safety community (e.g., Olah et al., 2017;Cammarata et al., 2021;Elhage et al., 2021;Chan et al., 2022;Christiano, 2022;Olsson et al., 2022;Bricken et al., 2023a;Cunningham et al., 2023;Conmy et al., 2023;Schwettmann et al., 2023). This trend is evident in the mechanistic interpretability movement, which aims to go beyond simple input-output analysis and examine the internal workings of AI models to enhance epistemic trust, aid in debugging, remove biases, and prevent models from "going rogue." ...

Reference:

A Mechanistic Explanatory Strategy for XAI
In-context Learning and Induction Heads

... Everyday language used in social settings is complex, which makes it risky to deploy harmful technologies that cannot reason beyond colloquialisms (for example, the statement "an all-Muslim movie was a 'box office bomb'" would easily be interpretated as stereotypical by most people, assuming that all Muslims are terrorists-a bias that cannot be easily explained and understood by an AI system) (Sap et al. 2020). Large language models reveal a spectrum of behaviours that are harmful, especially through the reinforcement of social biases (Ganguli et al. 2022). Algorithmic bias in AI systems can lead to the reinforcement and escalation of social inequalities and biased decisions (Kordzadeh and Ghasemaghaei 2022), which would lead to the application of force on the wrong targets by emerging technologies in the area of autonomous weapons systems. ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

... On the other hand, several approaches have been proposed to evaluate the quality of LLM responses for Natural Language Generation (NLG) without relying on external oracles. Among these, some methods leverage self-assessment by the model [15,38,22,17], while others utilize internal model signals-such as the log-probabilities of generated tokens-to estimate uncertainty in responses [3,10,20,26]. ...

Language Models (Mostly) Know What They Know

... A negative correlation exists between the length of the text prompt and the win rate, with Pearson's r values of −0.013, −0.059, and −0.060 in Pereira's, Huth's, and the Narratives datasets, respectively. This observation can be partially explained by the fact that longer text prompts provide LLMs with more contextual information, resulting in a lower level of surprise for the perceived continuation 13,19 , and consequently reducing the importance of brain input information (see Supplementary Fig. 12 for the relationship between text length and surprise level). Additionally, Tikochinski et al. 20 suggest that LLMs can process large contextual windows while the brain may preferentially focus on the content perceived most recently. ...

Predictability and Surprise in Large Generative Models
  • Citing Conference Paper
  • June 2022

... However, it is crucial to emphasize that solely high-quality data are essential for performance improvement. Merely augmenting the dataset with low-quality data can result in a decline in performance [102]. Therefore, recently, there is another trend that obtaining high quality data instead of tremendous scale of low quality data for pre-training. ...

Scaling Laws and Interpretability of Learning from Repeated Data

... However, this approach requires substantial computational resources. In response, Bai et al. (2022a) have explored an iterative online training framework, which involves updating the preference model and RL policies weekly, based on newly acquired human feedback. Simultaneously, to reduce the cost associated with manually annotated data, synthetic data have proven effective as well Bai et al., 2022b). ...

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback