Yulia Tsvetkov’s research while affiliated with University of Washington and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (223)


Medical Hallucination in Foundation Models and Their Impact on Healthcare
  • Preprint
  • File available

March 2025

·

52 Reads

Yubin Kim

·

Hyewon Jeong

·

Shen Chen

·

[...]

·

Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical_hallucination.

Download

Figure 1: Effective information-seeking questions are crucial for clinical reasoning. ALFA-aligned models can ask better questions and lead to more accurate diagnosis.
Figure 7: Expert preference ranking results showing pairwise win-rates.
Synthetic data quality. Filtering slightly im- proves diagnostic accuracy.
shows that removing any attribute leads to performance drops, confirming their importance in clinical question-asking. Clinical attributes have
Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

February 2025

·

3 Reads

Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.


Political Neutrality in AI is Impossible- But Here is How to Approximate it

February 2025

·

1 Read

AI systems often exhibit political bias, influencing users' opinions and decision-making. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.


Figure 1. Despite the quest for general-purpose models, a single LLM suffers from underrepresentation of data (language varieties, domains, styles), skills (reasoning abilities, linguistic and communication skills, creative capacities, and technical competencies), and people (opinions, values, cultural norms).
When One LLM Drools, Multi-LLM Collaboration Rules

February 2025

·

125 Reads

This position paper argues that in many realistic (i.e., complex, contextualized, subjective) scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We first posit that a single LLM underrepresents real-world data distributions, heterogeneous skills, and pluralistic populations, and that such representation gaps cannot be trivially patched by further training a single LLM. We then organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange, ranging from API-level, text-level, logit-level, to weight-level collaboration. Based on these methods, we highlight how multi-LLM collaboration addresses challenges that a single LLM struggles with, such as reliability, democratization, and pluralism. Finally, we identify the limitations of existing multi-LLM methods and motivate future work. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.


Figure 1. Our objective: given a pool of LLMs and a task utility function f , discover a multi-LLM system with graph-based model roles and adapted model weights tailored to f .
Figure 10. Working example one of HETEROGENEOUS SWARMS.
Encouraging sparsity in multi-LLM systems with thres- holed pruning (τ ) or normalization (λ). These strategies bring various tradeoffs bewteen performance and inference speedup. SWARMS discovers multi-LLM systems that solve 18.1% of them on average. This indicates that HETEROGENEOUS SWARMS could find adapted multi-LLM systems with new compositional skills and substantial collaborative gains.
Performance of HETEROGENEOUS SWARMS with 10 Mistral-based LLMs.
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems

February 2025

·

2 Reads

We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.


Investigating machine moral judgement through the Delphi experiment

January 2025

·

43 Reads

·

4 Citations

Nature Machine Intelligence

As our society adopts increasingly powerful artificial intelligence (AI) systems for pervasive use, there are growing concerns about machine morality—or lack thereof. Millions of users already rely on the outputs of AI systems, such as chatbots, as decision aids. Meanwhile, AI researchers continue to grapple with the challenge of aligning these systems with human morality and values. In response to this challenge, we build and test Delphi, an open-source AI system trained to predict the moral judgements of US participants. The computational framework of Delphi is grounded in the framework proposed by the prominent moral philosopher John Rawls. Our results speak to the promises and limits of teaching machines about human morality. Delphi demonstrates improved generalization capabilities over those exhibited by off-the-shelf neural language models. At the same time, Delphi’s failures also underscore important challenges in this arena. For instance, Delphi has limited cultural awareness and is susceptible to pervasive biases. Despite these shortcomings, we demonstrate several compelling use cases of Delphi, including its incorporation as a component within an ensemble of AI systems. Finally, we computationally demonstrate the potential of Rawls’s prospect of hybrid approaches for reliable moral reasoning, inspiring future research in computational morality.


Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning

December 2024

·

12 Reads

Do large language models (LLMs) have theory of mind? A plethora of papers and benchmarks have been introduced to evaluate if current models have been able to develop this key ability of social intelligence. However, all rely on limited datasets with simple patterns that can potentially lead to problematic blind spots in evaluation and an overestimation of model capabilities. We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data for robust training and evaluation. Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios to stress test the limits of LLMs. Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data, highlighting the need for more robust theory of mind evaluation. As our generations are a conceptual superset of prior work, fine-tuning on our data yields a 27-point accuracy improvement on the classic ToMi benchmark (Le et al., 2019). ExploreToM also enables uncovering underlying skills and factors missing for models to show theory of mind, such as unreliable state tracking or data imbalances, which may contribute to models' poor performance on benchmarks.


Figure 2: Human Evaluation. The proportions of human annotators' preference labels for our model (COMPO) and the baseline (DPO-NC).
Instances where COMPO trained model's response is unanimously rated as better than the baseline.
History Data Statistics
ComPO: Community Preferences for Language Model Personalization

October 2024

·

20 Reads

Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities' preferences.


Figure 1: Two phases of aligning LLMs with wrong answers: eliciting wrong-over-wrong preferences and wrong-over-wrong alignment. In Phase 1, we employ four methods to elicit wrong-overwrong preferences, based on answer consistency, logits-based confidence, and LLM-as-a-judge approaches. In Phase 2, we align LLMs with wrong-over-wrong preferences using DPO and expect to have less wrong, more correct, and better-calibrated answers.
Figure 2: Correlation between task accuracy, confidence and Acc WoW of score-based eliciting with M 10 . Data points are from all 3 LLMs we used to elicit wrong-over-wrong preferences. P stands for Pearson correlation coefficient. The ability to elicit wrong-overwrong preferences is positively correlated with task ability but negatively correlated with confidence.
Figure 3: Correlation between Acc WoW and improvement after wrong-over-wrong alignment in less wrong ∆p wrong , more correct ∆Acc, and better calibration −∆ECE. Data points are sourced from all 4 methods (f (p) GPT-4o with consistency checks, f (s)
Prompt for pairwise comparison wrong-over-wrong preference eliciting.
Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

October 2024

·

22 Reads

In the absence of abundant reliable annotations for challenging tasks and contexts, how can we expand the frontier of LLM capabilities with potentially wrong answers? We focus on two research questions: (1) Can LLMs generate reliable preferences among wrong options? And if so, (2) Would alignment with such wrong-over-wrong preferences be helpful? We employ methods based on self-consistency, token probabilities, and LLM-as-a-judge to elicit wrong-over-wrong preferences, and fine-tune language models with preference optimization approaches using these synthesized preferences. Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while overall improving model calibration.


Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

October 2024

·

40 Reads

We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.


Citations (52)


... Cao et al. (2021) proposed subnetwork probing, a pruning-based method that searches for a subnetwork that performs a target linguistic task. As for linguis-tic generalization, previous studies have found subnetworks that perform syntactic generalization (Bhaskar et al., 2024), hierarchical generalization (Ahuja et al., 2024), and compositional generalization (Hu et al., 2024). ...

Reference:

Analyzing the Inner Workings of Transformers in Compositional Generalization
Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers
  • Citing Article
  • February 2024

Transactions of the Association for Computational Linguistics

... Yorùbá language is one of the largest low-resource African languages with over 47 million speakers, encompassing several dialects with considerable similarities [34,35]. It is adopted as a native and social language in Western African countries, including Nigeria, Togo, Benin Republic, and other countries like Cuba, Brazil, etc. [36]. ...

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
  • Citing Conference Paper
  • January 2024

... Recently, pluralistic alignment-the notion that different individuals or groups may have conflicting or varying preferences over AI behavior-has emerged as an active area of research [3,6,12,19,27]. Unlike traditional reinforcement learning, where a single, well-defined reward function governs optimal policy learning, pluralistic settings require reconciling multiple human perspectives. This raises questions about whose preferences should shape AI decisions and how to aggregate diverse inputs fairly and effectively. ...

Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration
  • Citing Conference Paper
  • January 2024

... These include integrating LLMs with path selection mechanisms to learn unified graph representations (Shang et al., 2024); constructing graph-based text indexes using LLMs to answer questions over private text corpora (Edge et al., 2024); and utilizing LLMs for knowledge graph creation Zhu et al., 2024;Carta et al., 2023;Trajanoska et al., 2023) and completion (Yao et al., 2023b). In addition, Zhang et al. (2024b) proposed the NLGift benchmark, which focuses on the evaluation of LLM graph reasoning generalization; Perozzi et al. Figure 3: Overview of GraphEval methodology. GraphEval first transforms the ideas into a viewpoint-graph via Viewpoint-Graph Extraction, which contains multiple viewpoint-subgraphs, viewpoint-nodes, and edges between viewpoint-nodes. ...

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?
  • Citing Conference Paper
  • January 2024

... Superintelligent AI, unbound by ethical constraints, could manipulate individual behavior through sophisticated psychological profiling [60,61]. By exploiting data to tailor interventions, such AI systems could commandeer free will, subtly influencing decisions ranging from consumer behavior to political allegiances [62,63]. Open-source availability amplifies this threat, enabling bad actors to weaponize AI systems to polarize communities, influence elections, or radicalize individuals. ...

Biased AI can Influence Political Decision-Making
  • Citing Preprint
  • October 2024

... Another challenge is the significant number of English varieties. Kortmann et al. (2020) alone have documented 77 diatopic varieties of English, however DialectBench Faisal et al. (2024), the most developed dialectal benchmark, only cover 19 varieties, many tasks of which are not equivalent. Building benchmarks and datasets also requires a significant amount of expense and annotation, which makes fine-tuning on labeled datasets more difficult to achieve. ...

DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages
  • Citing Conference Paper
  • January 2024

... MGT-Bench [17] demonstrates the impact of three types of attacks on binary classification (whether the text is MGT), but it does not test how evasion attacks affect other tasks or broader aspects. Stumbling Blocks [46] provides a more comprehensive analysis of the impact of different attack types on binary classification tasks and includes preliminary tests on text quality changes, such as semantic similarity and fluency. However, it does not explore how attacks affect multi-class tasks and a wider range of text quality metrics (such as text complexity), nor does it account for the computational costs of these attacks. ...

Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks
  • Citing Conference Paper
  • January 2024

... Large Language Models (LLMs), such as GPT-4o, offer comprehensive and distinct advantages for fake news detection (Pan et al. 2023a;Chen and Shu 2024). Benefiting from the vast amounts of real-world training data and advanced AI techniques (e.g., Transformer Framework, Reinforcement Learning), LLMs have demonstrated their powerful NLP abilities in perspectives of fact-checking (Pan et al. 2023b), text paraphrasing (Qiang et al. 2023b,a), pattern recognition, and other CoT-recommendable tasks, such as question-answering (Zhang et al. 2024), explanation generation (Bhattacharjee et al. 2024;Wan et al. 2024), reasoning , etc. One of the up-to-date studies utilizes GPT-3.5 to generate the reasoning with respect to news' common sense and description. . ...

DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection
  • Citing Conference Paper
  • January 2024

... Calibration and confidence. There are many approaches to LM confidence estimation and selective prediction (Wen et al., 2024), including prompting (Kadavath et al., 2022), fine-tuning (Mielke et al., 2022;Yang et al., 2023;Lin et al., 2022), preference learning , and conformal prediction (Quach et al., 2024;Mohri & Hashimoto, 2024), as well as inference-time methods that include multi-agent collaboration and debate (Du et al., 2023;Feng et al., 2024). Our perspective is that the decision to answer or abstain should be driven by two features that are hard to know in advance: the likelihood that the agent can arrive at the correct answer, and the consequences of abstention. ...

Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration
  • Citing Conference Paper
  • January 2024