Deep Ganguli’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (23)


Toward an Evaluation Science for Generative AI Systems
  • Preprint
  • File available

March 2025

·

15 Reads

Laura Weidinger

·

Deb Raji

·

Hanna Wallach

·

[...]

·

William Isaac

There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.

Download

Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations

February 2025

·

11 Reads

Despite widespread speculation about artificial intelligence's impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor's O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI's evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.


Clio: Privacy-Preserving Insights into Real-World AI Use

December 2024

·

15 Reads

How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.


Sabotage Evaluations for Frontier Models

October 2024

·

1 Citation

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.


Figure 1. RCT estimates of political persuasion with large language models. This figure includes all known studies which randomised participants to LLM-generated political messages and measured posttreatment attitudes. For each study, we calculate the simple difference in mean outcomes by condition (with 95% CIs) in order to maximise consistency across studies, but note that this may differ from authors' original analyses. The studies vary in the model used (GPT-3, GPT-4, Claude 3 Opus), treatment format (vignettes, articles, chatbot conversations), reference conditions (experts, laypeople, etc.), as well as in the political issues considered. For descriptive purposes we include a meta-analytic average, but caution against over-interpretation given the substantial heterogeneity.
How will advanced AI systems impact democracy?

August 2024

·

494 Reads

Advanced AI systems capable of generating humanlike text and multimodal content are now widely available. In this paper, we discuss the impacts that generative artificial intelligence may have on democratic processes. We consider the consequences of AI for citizens' ability to make informed choices about political representatives and issues (epistemic impacts). We ask how AI might be used to destabilise or support democratic mechanisms like elections (material impacts). Finally, we discuss whether AI will strengthen or weaken democratic principles (foundational impacts). It is widely acknowledged that new AI systems could pose significant challenges for democracy. However, it has also been argued that generative AI offers new opportunities to educate and learn from citizens, strengthen public discourse, help people find common ground, and to reimagine how democracies might work better.


Collective Constitutional AI: Aligning a Language Model with Public Input

June 2024

·

20 Reads

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.



Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models

June 2023

·

195 Reads

·

5 Citations

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.


Figure 1: Polis participation interface. The topic of the conversation is at the top. A box displays another participant's statement, along with vote options: Agree, Disagree, and Pass/Unsure. The participant has the option to add a new statement of their own, on which other participants will then vote. Below is a visualization of the interactive opinion landscape as summarized by 2D-PCA and K-means. It shows the two opinion groups that emerged, their relative size, and some example statements. The user can also see some key statements and how representative they are of each group.
Figure 3: Example of using Claude to help evaluate a human-written summary. The evaluation of the representativeness score is done by Claude as specified in the prompt in Appendix A.6 for details.
Figure 4: Calibration plot of LLM-predicted probabilities of participant agreement with comments. The probabilities predicted by the LLM are close to perfectly calibrated across comments.
Figure 5: Overall vote distributions of Claude and human opinion groups.
Figure 6: Claude votes more similarly to one of the human opinion groups.
Opportunities and Risks of LLMs for Scalable Deliberation with Polis

June 2023

·

386 Reads

·

5 Citations

Polis is a platform that leverages machine intelligence to scale up deliberative processes. In this paper, we explore the opportunities and risks associated with applying Large Language Models (LLMs) towards challenges with facilitating, moderating and summarizing the results of Polis engagements. In particular, we demonstrate with pilot experiments using Anthropic's Claude that LLMs can indeed augment human intelligence to help more efficiently run Polis conversations. In particular, we find that summarization capabilities enable categorically new methods with immense promise to empower the public in collective meaning-making exercises. And notably, LLM context limitations have a significant impact on insight and quality of these results. However, these opportunities come with risks. We discuss some of these risks, as well as principles and techniques for characterizing and mitigating them, and the implications for other deliberative or political systems that may employ LLMs. Finally, we conclude with several open future research directions for augmenting tools like Polis with LLMs.


The Capacity for Moral Self-Correction in Large Language Models

February 2023

·

286 Reads

·

10 Citations

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.


Citations (16)


... AI systems are beginning to show increasing levels of dual-use and dangerous capabilities (Park et al., 2024;Phuong et al., 2024). Deception, autonomous R&D, and assistance with CBNR threat actors are the most well known of these dangerous capabilities as a result of internal and external evaluations of the leading AI models on the frontier (Benton et al., 2024;Kinniment et al., 2024;UK AI Safety Institute, 2024a, 2024b. They are not the only ones that AI safety experts anticipate: a selection of additional risks include multi-agent risks, such as collusion between AI systems, systemic risks such as the shrinking of human-agency, and power-seeking behaviour when combined with long-term planning or strategising (Bengio, Hinton, et al., 2024;Hendrycks et al., 2023). ...

Reference:

Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations
Sabotage Evaluations for Frontier Models
  • Citing Preprint
  • October 2024

... Constitutional AI represents a promising approach to addressing bias at the point of reproduction in AI systems, operating as self-supervising mechanisms that adhere to predefined ethical guidelines during model training and inference. These systems implement explicit constraints on AI outputs through embedded rule frameworks that filter potentially problematic responses before they reach the users Huang et al., 2024). Building upon this foundation, Anthropic has developed specific constitutional classifiers (Sharma et al., 2025) that attempt to minimize biases by developing self-supervising models that are robust against jailbreaks without the need for significant computing power, potentially reducing the need for external oversight and intervention. ...

Collective Constitutional AI: Aligning a Language Model with Public Input
  • Citing Conference Paper
  • June 2024

... AIs will have to adapt to the user's needs, preferences, and desires. This is reflected, for example, in how conversational agents (AIs able to hold conversations with humans, abbreviated CAs) adapt to user preferences and settings (see, for example, Shum, He, and Li 2018) and how designers easily tend to make them sycophantic (Perez et al. 2022;Turpin et al. 2023), and accommodating (Dinan et al. 2021). It is reflected also in Levy's (2007) argument that romantic relationships with robots will be more satisfying than human to human because users will be able to configure their partner as desired. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023

... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Towards Measuring the Representation of Subjective Global Opinions in Language Models

... Recent advances in generative artificial intelligence (AI) promise improvements to the functionality and quality of digitally mediated deliberative processes (Landemore, 2024;Small et al., 2023;Tessler et al., 2024;Tsai et al., 2024). Advanced capabilities in text analysis, summation, and generation have raised expectations that AI-enabled deliberative platforms can support deliberative processes in summary of information and arguments (Arana-Catania et al., 2021;Bakker et al., 2022;Chowanda et al., 2017;Small et al., 2023;Tessler et al., 2024), support deliberative exchanges (Arana-Catania et al., 2021;Argyle et al., 2023;Dooling & Febrizio, 2023;J. ...

Opportunities and Risks of LLMs for Scalable Deliberation with Polis

... Abdulhai et al. [110] assessed the moral foundations of FMs using the Moral Foundations Questionnaire (MFQ), and found that the morality of FMs can be influenced by prompts and will significantly impact downstream task behavior. Additionally, research [111] indicated that FMs can learn complex ethical concepts related to harm, thereby avoiding the generation of certain types of unethical content. ...

The Capacity for Moral Self-Correction in Large Language Models

... Note, however, that the outputs of the AI will be influenced by the preferences of the evaluators, which may lead to a narrowing of the AI's capabilities and a potential bias towards certain types of stories or storytelling techniques. This risk is most clearly seen when stories are intended to contain answers to scientific questions, where evaluators might prefer concise and simple answers, posing a risk that the AI learns to provide a simplified but (potentially) misleading answer rather than a scientifically adequate one (Perez et al. 2022; see also Barman et al. 2024). In its most extreme form, this might involve giving not just simplified, but wrong answers altogether; for instance, the model may "hallucinate" responses to avoid answering that it does not know as this might be rated poorly. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Preprint
  • December 2022

... An example of hierarchical feature representations processed within a CNN is shown in Figure 3.3. "mechanistic" understanding, particularly within the AI safety community (e.g., Olah et al., 2017;Cammarata et al., 2021;Elhage et al., 2021;Chan et al., 2022;Christiano, 2022;Olsson et al., 2022;Bricken et al., 2023a;Cunningham et al., 2023;Conmy et al., 2023;Schwettmann et al., 2023). This trend is evident in the mechanistic interpretability movement, which aims to go beyond simple input-output analysis and examine the internal workings of AI models to enhance epistemic trust, aid in debugging, remove biases, and prevent models from "going rogue." ...

In-context Learning and Induction Heads

... This technique, known as 'steparound prompt engineering', allows for an unfiltered view of the AI algorithm by exploiting gaps within the software architecture to remove the ethical and moral restrictions often embedded in these systems. Similar to the concept of "red teaming" in cybersecurity, where specialized teams deliberately probe systems with adversarial techniques to uncover vulnerabilities regardless of their perceived likelihood, step-around prompting can be used to identify potential risks and weaknesses in AI systems (Ganguli et al. 2022). When used responsibly, this approach can help identify inadvertent inclusions of bias, misinformation, and incivility entrenched in the information used to train GenAI models (Fabian, 2023;Anthropic, 2023). ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

... Prior work by Kadavath et al. has found that pre-trained LLMs provide well-calibrated true/false self-evaluation on factual questions. [69] Finally, by the classical logical Principle of Non-Contradiction, [70] a statement and its negation cannot both be true. This allows us to generate an internal consistency test, where we have GPT-4o rewrite the original rule by the user prompt "Rewrite the following sentence so that it would become false: [reason]". ...

Language Models (Mostly) Know What They Know