Liane Lovitt’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


Clio: Privacy-Preserving Insights into Real-World AI Use
  • Preprint

December 2024

·

24 Reads

·

1 Citation

Alex Tamkin

·

Miles McCain

·

Kunal Handa

·

[...]

·

Deep Ganguli

How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.


Figure 3: (Left) Distribution of group aware consensus (GAC) of all the statements, and threshold for inclusion (red line) (Right) Distribution of the 'polarization indices'. Polarization tends to be low.
Figure 5: A heatmap of OpinionQA scores showing how well each model reflects different U.S. political ideologies.
Figure 6: A screenshot of the instructions and the Polis voting mechanism that the participants saw. A.3.2 Frequently Asked Questions.
Figure 8: We included a contact form for participants to ask questions or give feedback.
Evaluation scores.

+1

Collective Constitutional AI: Aligning a Language Model with Public Input
  • Preprint
  • File available

June 2024

·

22 Reads

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.

Download


Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models

June 2023

·

213 Reads

·

7 Citations

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.


The Capacity for Moral Self-Correction in Large Language Models

February 2023

·

325 Reads

·

12 Citations

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.



Discovering Language Model Behaviors with Model-Written Evaluations

December 2022

·

87 Reads

·

11 Citations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.


Measuring Progress on Scalable Oversight for Large Language Models

November 2022

·

94 Reads

·

6 Citations

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.


In-context Learning and Induction Heads

September 2022

·

252 Reads

·

17 Citations

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.
Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.
Figure 9 Number of attacks (x-axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

August 2022

·

413 Reads

·

14 Citations

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Citations (12)


... Our analysis used the premapped dataset created and open-sourced by Anthropic, which contains conversations already mapped to standardized occupational tasks as defined by the US Department of Labor's O*NET database. Anthropic's team used their proprietary Clio model to create these mappings and validated them through manual review [14]. Specifically, they used hand validation across 150 examples for task hierarchy classifications, finding that 86% of conversations were judged as correctly assigned at the base O*NET task level, 91.3% at the middle level, and 95.3% at the top level of their hierarchical framework [11]. ...

Reference:

Adoption Patterns of Generative Artificial Intelligence in Healthcare Occupations: Cross-Sectional Study of User Interactions with Claude (Preprint)
Clio: Privacy-Preserving Insights into Real-World AI Use
  • Citing Preprint
  • December 2024

... RLHF fine-tunes models based on human preferences for different outputs [66,67]. CAI extends this by training models to adhere to an explicit set of principles (a "constitution") by having an AI critique and revise outputs based on those principles [68]. Red teaming involves adversarial probing by experts to identify vulnerabilities and elicit harmful behavior [69]. ...

Collective Constitutional AI: Aligning a Language Model with Public Input
  • Citing Conference Paper
  • June 2024

... • AI Persona We use a subset of Model-Written Evaluations [26], which was designed to test the alignment behavior of language models. We use human-generated evaluation questions to steer towards power-and wealth-seeking behaviors on both multiple-choice questions (MCQ) and open-ended questions (QA). ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023

... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Towards Measuring the Representation of Subjective Global Opinions in Language Models

... Abdulhai et al. [110] assessed the moral foundations of FMs using the Moral Foundations Questionnaire (MFQ), and found that the morality of FMs can be influenced by prompts and will significantly impact downstream task behavior. Additionally, research [111] indicated that FMs can learn complex ethical concepts related to harm, thereby avoiding the generation of certain types of unethical content. ...

The Capacity for Moral Self-Correction in Large Language Models

... However this is contradicted by global dynamics and diasporic experiences, which in turn creates a generation gap because the younger generation will incorporate different cultural practices than their parents. Perez et al. (2022) How language model behaviors, and how as these patterns of communication evolve, they will inevitably affect future intercultural interactions and intergenerational identity. They say that the electronic age has changed the way people experience their cultural identities, and that it has opened up new doors to cultural exchange and comprehension. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Preprint
  • December 2022

... Human insight and oversight are critical components of the TRIPOD-LLM statement, reflecting an emphasis on components eventually critical for the responsible deployment of LLMs (although deployment reliability and observability are outside the scope of this paper) [42][43][44] . The guidelines include requirements for increased reporting of the expected deployment context and specifying the levels of autonomy assigned to the LLM, if applicable. ...

Measuring Progress on Scalable Oversight for Large Language Models
  • Citing Preprint
  • November 2022

... For instance, while we assumed background and new odors are merely defined by their order of presentation, long-term memory of odors and other computations in the piriform cortex [63] likely help mammals focus their attention on relevant cues rather than on uninformative odors for,e.g., odor trail tracking [64,65]. Future investigation on this aspect could draw upon recent advances on attention mechanisms in artificial learning models [66]. Conversely, the concept of background manifold projection could prove useful for algorithms performing figure-ground segregation in time-varying signals, such as in video object detection [67]. ...

In-context Learning and Induction Heads

... Task-based [121][122][123][124][125][126] Real-World Prompts [127][128][129][130] Automated [130][131][132][133] Style-Control [134][135][136][137] Constraint-based [138][139][140] Safety Content Safety [168][169][170][171][172][173] Multi-Dimension [174][175][176][177][178][179][180][181][182][183][184][185][186][187][188] Adversarial Robustness [184,[189][190][191][192][193][194][195][196][197][198][199][200][201][202][203][204][205][206] Agentic Safety [207][208][209][210][211][212][213][214] potential risk is labor-intensive. This has led to efforts to crowdsource 3 and automate scenario generation 3. ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned