Esin Durmus’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (56)


Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations
  • Preprint

February 2025

·

9 Reads

Kunal Handa

·

Alex Tamkin

·

Miles McCain

·

[...]

·

Deep Ganguli

Despite widespread speculation about artificial intelligence's impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor's O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI's evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.


Clio: Privacy-Preserving Insights into Real-World AI Use

December 2024

·

15 Reads

How are AI assistants being used in the real world? While model providers in theory have a window into this impact via their users' data, both privacy concerns and practical challenges have made analyzing this data difficult. To address these issues, we present Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants themselves to analyze and surface aggregated usage patterns across millions of conversations, without the need for human reviewers to read raw conversations. We validate this can be done with a high degree of accuracy and privacy by conducting extensive evaluations. We demonstrate Clio's usefulness in two broad ways. First, we share insights about how models are being used in the real world from one million Claude.ai Free and Pro conversations, ranging from providing advice on hairstyles to providing guidance on Git operations and concepts. We also identify the most common high-level use cases on Claude.ai (coding, writing, and research tasks) as well as patterns that differ across languages (e.g., conversations in Japanese discuss elder care and aging populations at higher-than-typical rates). Second, we use Clio to make our systems safer by identifying coordinated attempts to abuse our systems, monitoring for unknown unknowns during critical periods like launches of new capabilities or major world events, and improving our existing monitoring systems. We also discuss the limitations of our approach, as well as risks and ethical concerns. By enabling analysis of real-world AI usage, Clio provides a scalable platform for empirically grounded AI safety and governance.


Sabotage Evaluations for Frontier Models

October 2024

·

1 Citation

Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization's activities in any of these ways. We demonstrate these evaluations on Anthropic's Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.


Fig. 1. (A) Examples of racial discrimination disclosures shared online (the posts have been slightly redacted for anonymity and brevity). (B) Rates of algorithmic and human flagging (for removal), displayed for racial discrimination disclosures and negative interpersonal experience disclosures, respectively. Error bars indicate SE of the mean.
Fig. 2. (A) Lexical markers of affect predict algorithmic flagging (Study 1b). The effect estimate (on the x-axis) of the impact of the linguistic features (on the y-axis) on algorithmic flagging, across the five tested toxicity detection algorithms. Red indicates increased flagging of racial discrimination disclosures (P < 0.05); blue indicates reduced flagging of racial discrimination disclosures (P < 0.05). Error bars represent bootstrapped 95% CIs. (B) Psychological belonging and threat and tone policing predict human flagging (Study 2b). The effect estimate (on the x-axis) of the impact of the linguistic features (on the y-axis) on human flagging. Again, red indicates increased flagging of racial discrimination disclosures (P < 0.05), blue indicates reduced flagging of racial discrimination disclosures (P < 0.05). Error bars represent bootstrapped 95% CIs.
Fig. 4. (A) Moderation Guidelines used in Study 4. Participants in the no-guideline condition only saw content within the gray solid line. Participants in the conventional guideline and reframed guideline conditions saw content within the gray solid line and the enumerated guidelines. The reframed guideline condition additionally saw content within the yellow dotted lines. (B) Flagging rates of discrimination disclosures by condition. Error bars mark SE of the mean. (C) Flagging rates of discrimination disclosures by condition and political orientation.
People who share encounters with racism are silenced online by humans and machines, but a guideline-reframing intervention holds promise
  • Article
  • Full-text available

September 2024

·

58 Reads

·

5 Citations

Proceedings of the National Academy of Sciences

Are members of marginalized communities silenced on social media when they share personal experiences of racism? Here, we investigate the role of algorithms, humans, and platform guidelines in suppressing disclosures of racial discrimination. In a field study of actual posts from a neighborhood-based social media platform, we find that when users talk about their experiences as targets of racism, their posts are disproportionately flagged for removal as toxic by five widely used moderation algorithms from major online platforms, including the most recent large language models. We show that human users disproportionately flag these disclosures for removal as well. Next, in a follow-up experiment, we demonstrate that merely witnessing such suppression negatively influences how Black Americans view the community and their place in it. Finally, to address these challenges to equity and inclusion in online spaces, we introduce a mitigation strategy: a guideline-reframing intervention that is effective at reducing silencing behavior across the political spectrum.

Download

Figure 1. RCT estimates of political persuasion with large language models. This figure includes all known studies which randomised participants to LLM-generated political messages and measured posttreatment attitudes. For each study, we calculate the simple difference in mean outcomes by condition (with 95% CIs) in order to maximise consistency across studies, but note that this may differ from authors' original analyses. The studies vary in the model used (GPT-3, GPT-4, Claude 3 Opus), treatment format (vignettes, articles, chatbot conversations), reference conditions (experts, laypeople, etc.), as well as in the political issues considered. For descriptive purposes we include a meta-analytic average, but caution against over-interpretation given the substantial heterogeneity.
How will advanced AI systems impact democracy?

August 2024

·

491 Reads

Advanced AI systems capable of generating humanlike text and multimodal content are now widely available. In this paper, we discuss the impacts that generative artificial intelligence may have on democratic processes. We consider the consequences of AI for citizens' ability to make informed choices about political representatives and issues (epistemic impacts). We ask how AI might be used to destabilise or support democratic mechanisms like elections (material impacts). Finally, we discuss whether AI will strengthen or weaken democratic principles (foundational impacts). It is widely acknowledged that new AI systems could pose significant challenges for democracy. However, it has also been argued that generative AI offers new opportunities to educate and learn from citizens, strengthen public discourse, help people find common ground, and to reimagine how democracies might work better.


Collective Constitutional AI: Aligning a Language Model with Public Input

June 2024

·

20 Reads

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.



Figure 1: Selected annotator ratings of summary coherence on a 1 to 5 Likert scale.
Figure 3: System-level Rouge-L vs. annotator rated relevance scores.
Benchmarking Large Language Models for News Summarization

January 2024

·

93 Reads

·

405 Citations

Transactions of the Association for Computational Linguistics

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.



Studying Large Language Model Generalization with Influence Functions

August 2023

·

29 Reads

·

3 Citations

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.


Citations (38)


... AI systems are beginning to show increasing levels of dual-use and dangerous capabilities (Park et al., 2024;Phuong et al., 2024). Deception, autonomous R&D, and assistance with CBNR threat actors are the most well known of these dangerous capabilities as a result of internal and external evaluations of the leading AI models on the frontier (Benton et al., 2024;Kinniment et al., 2024;UK AI Safety Institute, 2024a, 2024b. They are not the only ones that AI safety experts anticipate: a selection of additional risks include multi-agent risks, such as collusion between AI systems, systemic risks such as the shrinking of human-agency, and power-seeking behaviour when combined with long-term planning or strategising (Bengio, Hinton, et al., 2024;Hendrycks et al., 2023). ...

Reference:

Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations
Sabotage Evaluations for Frontier Models
  • Citing Preprint
  • October 2024

... Similar to other algorithmic systems deployed in complex sociotechnical contexts, such as filtering and content moderation, these tools continue to represent the threat of bias and reproduction of existing structural inequalities, increasing the potential of silencing minority voices [35,72]. While technologies may be developed with good intentions, a lack of grounding in local, contextually situated practices can have a variety of unintended consequences. ...

People who share encounters with racism are silenced online by humans and machines, but a guideline-reframing intervention holds promise

Proceedings of the National Academy of Sciences

... Our research aligns closely with studies at the intersection of NLP and content moderation. As demonstrated by Gligoric et al. (2024), distinguishing reliably between the use and mention of harmful content using NLP methods is exceedingly challenging. Gligoric et al. (2024) argues that the use of words to convey a speaker's intent is traditionally distinguished from the mention of words for quoting or describing their properties. ...

NLP Systems That Can’t Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps
  • Citing Conference Paper
  • January 2024

... Constitutional AI represents a promising approach to addressing bias at the point of reproduction in AI systems, operating as self-supervising mechanisms that adhere to predefined ethical guidelines during model training and inference. These systems implement explicit constraints on AI outputs through embedded rule frameworks that filter potentially problematic responses before they reach the users Huang et al., 2024). Building upon this foundation, Anthropic has developed specific constitutional classifiers (Sharma et al., 2025) that attempt to minimize biases by developing self-supervising models that are robust against jailbreaks without the need for significant computing power, potentially reducing the need for external oversight and intervention. ...

Collective Constitutional AI: Aligning a Language Model with Public Input
  • Citing Conference Paper
  • June 2024

... Large language models (LLMs) (Zhao et al., 2023;Achiam et al., 2023;Dubey et al., 2024;Anthropic, 2024;AI, 2024) have enhanced their long-text processing capabilities, improving performance in multi-turn dialogues (Chiang et al., 2023), document summarization (Zhang et al., 2024a), question answering (Kamalloo et al., 2023), and information retrieval (Liu et al., 2024c). New models such as GPT-4 (Achiam et al., 2023), Claude 3.5 (Anthropic, 2024), LLaMA 3.1 (Dubey et al., 2024) and Mistral Large 2 (AI, 2024) have extended token processing capacities beyond 128K. ...

Benchmarking Large Language Models for News Summarization

Transactions of the Association for Computational Linguistics

... The pretraining factors that we consider are the size of the model (measured in the number of parameters), model architecture, pre-training dataset, and size of the pretraining dataset (measured in the number of samples). Model architecture has been tied to bias in other modalities (Ladhak et al., 2023), and model size and pre-training dataset have been tied to bias in nine CLIP models by Berg et al. (2022). ...

When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization
  • Citing Conference Paper
  • January 2023

... Central to the discourse on LLMs is the debate over whether their capabilities stem from merely memorizing massive pretraining corpus [33,9], or extend from a deeper understanding of human language and the ability to generalize their knowledge to new tasks and settings [24,4]. Recently, a phenomenon identified within LLMs, termed the "reversal curse", suggests that LLMs struggle to generalize beyond their training text [2,12]. The curse manifests as models after being trained on the fact that "A is B" failing to infer that "B is A". ...

Studying Large Language Model Generalization with Influence Functions
  • Citing Preprint
  • August 2023

... Factual Consistency of Summarization The factual consistency in abstractive summarization has received increasing attention recently. Existing work has proposed various methods to improve the factual consistency, such as contrastive learning (Wan and Bansal, 2022;, adversarial learning (Wang et al., 2022;, textual entailment (Zhang et al., 2022b;Roit et al., 2023) and post-editing Balachandran et al., 2022). However, these methods can not be directly applied in long document summarization due to the difficulty of modeling long texts accurately. ...

Improving Faithfulness by Augmenting Negative Summaries from Fake Documents
  • Citing Conference Paper
  • January 2022

... Larger models such as GPT-3 excel at surface-level generation but remain limited in causal reasoning, indicating architectural constraints in effectively modelling semantics and world knowledge (Holmes and Tuomi 2022). Reviewing thirdparty experiments that stress test models against adversarial examples and common fallacies reveals robustness issues stemming from architectural design tradeoffs (Gehrmann et al. 2022). Being attuned to cutting-edge research around architectural improvements also surfaces current limitations in state-of-the-art generative models (Fuhr and Sumpter 2022). ...

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
  • Citing Conference Paper
  • January 2022

... The higher the SARI and BERTScore, the better the simplification is. More details of these metrics are in Appendix A. We do not use N-gram based metrics such as BLEU [59] because it is not suitable for evaluating text simplification [75].Recently, several learnable TS metrics have been proposed [18,37,43,93]. However, these metrics are based on a neural model trained using human-annotated data, which is not available for low-resource languages such as Sinhala. ...

Towards Reference-free Text Simplification Evaluation with a BERT Siamese Network Architecture
  • Citing Conference Paper
  • January 2023