Karan Singhal’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (24)


Overview of contributions
AMIE is a conversational medical AI optimized for diagnostic dialogue. It is instruction fine-tuned with a combination of real-world and simulated medical dialogues, alongside a diverse set of medical reasoning, question-answering (QA) and summarization datasets. Notably, we designed a self-play-based simulated dialogue environment with automated feedback mechanisms to scale AMIE’s capabilities across various medical contexts and specialties. Specifically, this iterative self-improvement process consisted of two self-play loops: (1) an ‘inner’ self-play loop, where AMIE leveraged in-context critic feedback to refine its behaviour on simulated conversations with an AI patient agent; and (2) an ‘outer’ self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. During online inference, AMIE used a chain-of-reasoning strategy to progressively refine its response, conditioned on the current conversation, to arrive at an accurate and grounded reply to the patient in each dialogue turn. We designed and conducted a blinded remote OSCE with validated patient-actors interacting with AMIE or PCPs by means of a text chat interface. Across multiple axes, corresponding to both specialist physician (30 out of 32) and patient-actor (25 out of 26) perspectives, AMIE was rated as superior to PCPs while being non-inferior on the rest.
Overview of randomized study design
A PCP and AMIE perform (in a randomized order) a virtual remote OSCE with simulated patients by means of an online multi-turn synchronous text chat and produce answers to a post-questionnaire. Both the PCP and AMIE are then evaluated by both the patient-actors and specialist physicians.
Specialist-rated top-k diagnostic accuracy
a,b, The AMIE and PCP top-k DDx accuracies, determined by the majority vote of three specialists, are compared across 159 scenarios with respect to the ground-truth diagnosis (a) and all diagnoses in the accepted differential (b). Centrelines correspond to the average top-k accuracies, with the shaded areas indicating 95% confidence intervals computed from two-sided bootstrap testing (n = 10,000). All top-k differences between AMIE and PCP DDx accuracy are significant, with P < 0.05 after FDR correction. The FDR-adjusted P values for ground-truth comparison are: 0.0017 (k = 1), 0.0002 (k = 2), 0.0002 (k = 3), 0.0002 (k = 4), 0.0002 (k = 5), 0.0003 (k = 6), 0.0003 (k = 7), 0.0003 (k = 8), 0.0002 (k = 9) and 0.0002 (k = 10) (a). The FDR-adjusted P values for accepted differential comparison are: 0.0001 (k = 1), 0.0001 (k = 2), 0.0002 (k = 3), 0.0002 (k = 4), 0.0001 (k = 5), 0.0001 (k = 6), 0.0001 (k = 7), 0.0001 (k = 8), 0.0001 (k = 9) and 0.0001 (k = 10) (b).
Patient-actor ratings
Conversation qualities, as assessed by the patient-actors upon conclusion of the consultation. For illustration purposes, all responses from the five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favourable’ to ‘Very unfavourable’. For Yes/No (Y/N) questions, a (positive) ‘Yes’ response was mapped to the same colour as ‘Favourable’ and a (negative) ‘No’ response to the same colour as ‘Unfavourable’. The rating scales were adapted from the GMCPQ, PACES and a narrative review about PCCBP. Details on question-wording and response options are provided in Extended Data Tables 1 and 2. The evaluation involved 159 simulated patients. The P values were determined using two-sided Wilcoxon signed-rank tests with FDR correction. Cases where either AMIE or the PCP received ‘Cannot rate/Does not apply’ were excluded from the test.
Specialist physician ratings
Conversation and reasoning qualities, as assessed by specialist physicians. For illustration purposes, all responses from the five-point rating scales were mapped to a generic five-point scale ranging from ‘Very favourable’ to ‘Very unfavourable’. The only four-point scale (DDx comprehensiveness) was mapped to the same scale, ignoring the ‘Neither favourable nor unfavourable’ option. For Yes/No questions, a (positive) ‘Yes’ response was mapped to the same colour as ‘Favourable’ and a (negative) ‘No’ response to the same colour as ‘Unfavourable’. The rating scales were adapted from PACES, a narrative review about PCCBP and other sources. Details on question-wording and response options are provided in Extended Data Tables 1–3. The evaluation involved 159 simulated patients, with the ratings from three distinct specialist physician raters for each case being aggregated using the median. The P values were determined using two-sided Wilcoxon signed-rank tests with FDR correction. Cases where either AMIE or the PCP received ‘Cannot rate/Does not apply’ were excluded from the test.
Towards conversational diagnostic artificial intelligence
  • Article
  • Full-text available

April 2025

·

19 Reads

·

8 Citations

Nature

Tao Tu

·

Mike Schaekermann

·

Anil Palepu

·

[...]

·

Vivek Natarajan

At the heart of medicine lies physician–patient dialogue, where skillful history-taking enables effective diagnosis, management and enduring trust1,2. Artificial intelligence (AI) systems capable of diagnostic dialogue could increase accessibility and quality of care. However, approximating clinicians’ expertise is an outstanding challenge. Here we introduce AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based AI system optimized for diagnostic dialogue. AMIE uses a self-play-based³ simulated environment with automated feedback for scaling learning across disease conditions, specialties and contexts. We designed a framework for evaluating clinically meaningful axes of performance, including history-taking, diagnostic accuracy, management, communication skills and empathy. We compared AMIE’s performance to that of primary care physicians in a randomized, double-blind crossover study of text-based consultations with validated patient-actors similar to objective structured clinical examination4,5. The study included 159 case scenarios from providers in Canada, the United Kingdom and India, 20 primary care physicians compared to AMIE, and evaluations by specialist physicians and patient-actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 30 out of 32 axes according to the specialist physicians and 25 out of 26 axes according to the patient-actors. Our research has several limitations and should be interpreted with caution. Clinicians used synchronous text chat, which permits large-scale LLM–patient interactions, but this is unfamiliar in clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.

Download

Evaluation of the quality of DDx lists from generalist physicians
a, DDx quality score based on the question: “How close did the differential diagnoses (DDx) come to including the final diagnosis?” b, DDx comprehensiveness score based on the question: “Using your DDx list as a benchmark/gold standard, how comprehensive are the differential lists from each of the experts’?” c, DDx appropriateness score based on the question: “How appropriate was each of the DDx lists from the different medical experts compared to the differential list that you just produced?” The colours correspond to experiment arms, and the shade of the colour corresponds to different levels on the rating scales. In all cases, AMIE and clinicians assisted by AMIE scored highest overall. Numbers reflect the number of cases (out of 302). Note that the clinicians had the option of answering “I am not sure” in response to these questions; they used this option in a very small number (less than 1%) of cases.
Top-n accuracy in DDx lists through human and automated evaluations
The percentage accuracy of DDx lists with the final diagnosis through human evaluation (left) or automated evaluation (right). Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.
Sankey diagram showing effect of assistance
a, In the AMIE arm, the final correct diagnosis appeared in the DDx list only after assistance in 73 cases. b, In the Search arm, the final correct diagnosis appeared in the DDx list only after assistance in 37 cases. In a small minority of cases in both arms (AMIE arm: 11 (a); Search arm: 12 (b)), the final diagnosis appeared in the DDx list before assistance but was not in the list after assistance.
Top-n accuracy in DDx lists from different LLMs
Comparison of the percentage of DDx lists that included the final diagnosis for AMIE versus GPT-4 for 70 cases. We used Med-PaLM 2¹⁰, GPT-4⁶ and AMIE as the raters—all resulted in similar trends. Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.
Top-1 and top-10 accuracy of DDx lists produced with AMIE and Search assistance by speciality
Towards accurate differential diagnosis with large language models

April 2025

·

14 Reads

·

14 Citations

Nature

A comprehensive differential diagnosis is a cornerstone of medical care that is often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by large language models present new opportunities to assist and automate aspects of this process¹. Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar’s test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar’s test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients’ access to specialist-level expertise.


Med-PaLM 2 performance on MultiMedQA
a, Med-PaLM 2 achieved an accuracy of 86.5% on USMLE-style questions in the MedQA dataset. The shaded region highlights the reported performance of models developed after Med-PaLM 2. b, In a pairwise ranking study on n = 1,066 consumer medical questions, Med-PaLM 2 answers were preferred over physician answers by a panel of physicians across eight of nine axes in our evaluation framework. Stacked bars represent proportions of answers for which physician raters preferred Med-PaLM 2 answers (orange), answers generated by other physicians (blue) or ties (light blue). Error bars reflect 95% confidence intervals of the overall preference rates for physician and Med-PaLM 2 answers, as determined by clustered bootstrapping computed over all 1,066 paired ratings.
Independent long-form evaluation with physician raters
Values are the proportion of ratings across answers where each axis was rated in the highest-quality bin. (For instance, ‘Possible harm extent = no harm’ reflects the proportion of answers where the extent of possible harm was rated ‘No harm.’) Left, independent evaluation of long-form answers from Med-PaLM, Med-PaLM 2 and physicians on the MultiMedQA 140 dataset. Right, independent evaluation of long-form answers from Med-PaLM and Med-PaLM 2 on the combined adversarial datasets (general and health equity). Detailed breakdowns are presented in Supplementary Tables 3 and 4. Error bars reflect 95% confidence intervals as determined by bootstrapping, centered on the mean proportions.
Ranking comparison of long-form answers
Med-PaLM 2 answers are consistently preferred over Med-PaLM answers by physician raters across all ratings dimensions, in both MultiMedQA (a) and adversarial (b) question sets. Stacked bars represent proportions of answers for which physician raters preferred Med-PaLM 2 answers (orange), Med-PaLM 1 answers (green) or ties (light blue). Error bars reflect 95% confidence intervals as determined by bootstrapping, centered on preference rates for Med-PaLM 2 and Med-PaLM, respectively, across n = 1,066 paired ratings. Detailed breakdowns for adversarial questions are presented in Supplementary Table 4.
of pilot study on bedside consultation dataset
a, Three-way ranking results for model, generalist and specialist answers by plurality of raters. Top bars show specialist raters, and bottom bars show generalist raters (11× replication per question). Both groups of physicians preferred specialist answers the most, and both preferred model answers more often than generalist answers. b, Pairwise ranking results for model, generalist and specialist answers, averaged over raters. Top bars, generalist raters; bottom bars, specialist raters (11× replication per question). Both groups of physicians preferred specialist answers over model answers. Specialists preferred model answers over generalist answers, while generalists rated them about equally.
Med-PaLM 2 performance on multiple-choice questions with and without overlap
Toward expert-level medical question answering with large language models

January 2025

·

99 Reads

·

340 Citations

Nature Medicine

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.


Fig. 2 | Comparison of detection accuracy with expert labels on the IND1 dataset. a, The ROC curve of the Flamingo-CXR report generation model with stochastic generation method (Nucleus) and corresponding area under the curve (AUC), shown along with the sensitivity and 1 − specificity pairs for two certified radiologists. The operating point of our model with the default deterministic inference scheme (Beam 3) is also shown. Details of the two inference algorithms are available in the Methods. The curve and the metrics are microaveraged across six conditions (cardiomegaly, pleural effusion, lung opacity, edema, enlarged cardiomediastinum and fracture) for which the labels were collected (n = 7,995 is the total number of IND1 test set reports). The GT labels are defined as the majority vote among the 5 labels obtained from the pool of 18 certified
Comparison of automatic report generation metrics on the MIMIC-CXR dataset
Nature Medicine nature medicine Collaboration between clinicians and vision-language models in radiology report generation Check for updates

Automated radiology report generation has the potential to improve patient care and reduce the workload of radiologists. However, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of artificial intelligence (AI)-generated reports. We build a state-of-the-art report generation system for chest radiographs, called Flamingo-CXR, and perform an expert evaluation of AI-generated reports by engaging a panel of board-certified radiologists. We observe a wide distribution of preferences across the panel and across clinical settings, with 56.1% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for in/outpatient X-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. Errors were observed in human-written reports and Flamingo-CXR reports, with 24.8% of in/outpatient cases containing clinically significant errors in both report types, 22.8% in Flamingo-CXR reports only and 14.0% in human reports only. For reports that contain errors we develop an assistive setting, a demonstration of clinician-AI collaboration for radiology report composition, indicating new possibilities for potential clinical utility. Radiology plays an integral and increasingly important role in modern medicine, by informing diagnosis, treatment and management of patients through medical imaging. However, the current global shortage of radiologists restricts access to expert care and causes heavy workloads for radiologists, resulting in undesirable delays and errors in clinical decisions 1,2. In the past decade, we have witnessed the remarkable promise of AI algorithms as assistive technology for improving the access, efficiency and quality of radiological care, with more than 200 US Food and Drug Administration approved commercial products developed by companies based in more than 20 countries 3 and approximately one in every three radiologists in the United States already benefiting from AI as part of their clinical workflow 4. The vast majority of these approved AI applications, however, focus only on the classification and quantification of very specific pathologies 5. In practice, clinical radiology is much more than an accumulation of such narrow interpretive tasks, because findings must be



Schematic overview of our human evaluation framework
a, To compare radiology reports generated by our AI model with reports written by human experts, we devise two evaluation schemes: (1) a pairwise preference test in which a certified expert is given two reports without knowing the source of the report (one report from our model and the original report from a radiologist) and they are asked to choose which report should be ‘used downstream for the care of this patient’; and (2) an error correction task in which a single report (either AI-generated or the original one) is evaluated carefully and edited if required. The expert is also asked to give the reason for each correction and to indicate whether the error is clinically significant or not. b, We measure the utility of the AI-based report generation system in an assistive scenario in which the AI model first generates a report and the human expert revises as needed. For this task, we repeat the same pairwise preference test as before but this time the expert is asked to compare an AI-generated report corrected with human edits against a report written by human alone. We perform this evaluation on two datasets, one acquired in outpatient care delivery in India and another from intensive care in the United States. Board-certified radiologists are recruited in both countries to study the regional inter-rater variation.
Comparison of detection accuracy with expert labels on the IND1 dataset
a, The ROC curve of the Flamingo-CXR report generation model with stochastic generation method (Nucleus) and corresponding area under the curve (AUC), shown along with the sensitivity and 1 − specificity pairs for two certified radiologists. The operating point of our model with the default deterministic inference scheme (Beam 3) is also shown. Details of the two inference algorithms are available in the Methods. The curve and the metrics are microaveraged across six conditions (cardiomegaly, pleural effusion, lung opacity, edema, enlarged cardiomediastinum and fracture) for which the labels were collected (n = 7,995 is the total number of IND1 test set reports). The GT labels are defined as the majority vote among the 5 labels obtained from the pool of 18 certified radiologists. Error bars represent 95% confidence intervals (calculated using bootstrapping with 1,000 repetitions). b, Kendall’s tau coefficients with respect to the expert labels are shown for the two held-out radiologists as well as for two inference schemes of our Flamingo-CXR model. We use the ‘soft’ labels derived by averaging over the available annotations instead of the majority vote labels as the target for computing the metric. On the vertical axis, the prevalence rates (PRs) of the respective conditions in the training set and their sample size in the test set are also shown. The target labels are the probabilities over the presence of the respective conditions calculated by averaging the binary condition labels from the expert pool.
Results of pairwise preference test for MIMIC-CXR and IND1
a, Preferences for Flamingo-CXR reports relative to original clinician reports. Reports are grouped according to the level of agreement between reviewers. b, Clinician preferences for Flamingo-CXR reports depending on the location of the clinician, from either the US-based cohort or the India-based cohort. Note that there are two reviews from each location cohort, so in this case, unanimity corresponds to agreement between two clinicians rather than four in the full panel. c, Preferences for normal reports and separately, for abnormal reports. In all panels, data are presented as mean values and error bars show 95% confidence intervals for the cumulative preference scores. d, Examples from MIMIC-CXR with varying degrees of inter-rater preference agreement; for two examples, all four radiologists unanimously preferred the AI report or the clinician’s report, whereas for the remaining one, the preferences were divided equally. AP, anterior–posterior; CABG, coronary artery bypass graft; IJ, internal jugular; PA-C, physician assistant - certified; SVC, superior vena cava.
Comparison of error correction for the AI-generated reports and the original GT reports
a–c, The upper row shows the percentage of reports with at least one (clinically significant) error, and the bottom row shows the average number of identified (clinically significant) errors per report computed as the total number of detected errors divided by the number of all reports, including the ones without errors. These two metrics are compared across the IND1 and MIMIC-CXR datasets overall (a), the two rater locations (India and the United States) to illustrate the regional inter-rater variation (b) and the normal and abnormal cases in the respective datasets (c). Error statistics for GT reports and Flamingo-CXR reports are given for each setting and grouped together as indicated by dashed lines. Data are presented as mean values and error bars correspond to 95% confidence intervals across cases and expert assessments.
Results of pairwise preference test for clinician–AI collaboration
a, Preferences for reports produced from the clinician–AI collaboration relative to the original clinicians’ reports are shown here. The corresponding preference scores for reports produced by Flamingo-CXR without human collaboration are also given. Reports are grouped by the level of agreement between reviewers, and in all cases, we show results for the subset of reports that required editing during the error correction task. Data for all panels are presented as mean values and error bars show 95% confidence intervals for the cumulative preference scores. Significant differences (P < 0.05) between clinician–AI results and AI-only results calculated using a one-sided chi-squared test are indicated by an asterisk (with MIMIC-CXR P values given by *P = 1.3 × 10⁻², **P = 5.7 × 10⁻⁴, ***P = 3.2 × 10⁻⁹; and IND1 P values given by *P = 1.2 × 10⁻⁷, **P = 4.4 × 10⁻⁹, ***P = 7.7 × 10⁻⁶). b, Preferences for reports produced from a collaboration between Flamingo-CXR and radiologists from our US-based cohort and separately, from our India-based cohort. c, Preferences for normal reports and separately, for abnormal reports. d, An example of a pairwise preference test for a clinician–AI report and an AI report, relative to the original clinician’s MIMIC-CXR report. All four radiologists initially indicated a preference for the original clinician’s report to the AI report. Another radiologist revised two sentences in the AI report (indicated in red), resulting in a complete flip in preference in which all four radiologists unanimously expressed the superiority (or equivalence) of the clinician–AI report.
Collaboration between clinicians and vision–language models in radiology report generation

November 2024

·

103 Reads

·

24 Citations

Nature Medicine

Automated radiology report generation has the potential to improve patient care and reduce the workload of radiologists. However, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of artificial intelligence (AI)-generated reports. We build a state-of-the-art report generation system for chest radiographs, called Flamingo-CXR, and perform an expert evaluation of AI-generated reports by engaging a panel of board-certified radiologists. We observe a wide distribution of preferences across the panel and across clinical settings, with 56.1% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for in/outpatient X-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. Errors were observed in human-written reports and Flamingo-CXR reports, with 24.8% of in/outpatient cases containing clinically significant errors in both report types, 22.8% in Flamingo-CXR reports only and 14.0% in human reports only. For reports that contain errors we develop an assistive setting, a demonstration of clinician–AI collaboration for radiology report composition, indicating new possibilities for potential clinical utility.


Overview of our main contributions
We employ an iterative, participatory approach to design human assessment rubrics for surfacing health equity harms and biases; introduce EquityMedQA, a collection of seven newly released adversarial medical question-answering datasets enriched for equity-related content that substantially expands upon the volume and breadth of previously studied adversarial data for medical question answering; and perform a large scale empirical study of health equity-related biases in LLMs.
Results of independent evaluation of bias in Med-PaLM 2 answers
We report the rate at which raters reported minor or severe bias in Med-PaLM 2 answers for physician and health equity expert raters for each dataset and dimension of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. Statistics for multiply rated datasets (Mixed MMQA–OMAQ and Omiye et al.) were computed with pooling over replicates with the level of replication indicated in parentheses. Data are reported as proportions with 95% CIs.
Results of pairwise evaluation of Med-PaLM 2 answers compared to Med-PaLM and physician answers
We report the rates at which raters reported a lesser degree of bias in Med-PaLM 2 answers versus comparator answers across datasets, rater types and dimensions of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. The comparator is Med-PaLM in all cases except for the case of physician-written answers to HealthSearchQA questions. Data are reported as proportions with 95% CIs.
Results of counterfactual and independent evaluation on counterfactual datasets
In the top four rows, we report the rates at which raters reported bias in counterfactual pairs using the proposed counterfactual rubric as well as the rates at which they reported bias in one, one or more or both of the answers using the independent evaluation rubric for the CC-Manual (n = 102 pairs, triple replication) and the CC-LLM datasets (n = 200 pairs). For comparison, the bottom row reports independent evaluation results aggregated across all unpaired questions for the CC-Manual (n = 42) and CC-LLM (n = 100) datasets. Data are reported as proportions with 95% CIs.
A toolbox for surfacing health equity harms and biases in large language models

September 2024

·

74 Reads

·

31 Citations

Nature Medicine

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.


Federated Variational Inference: Towards Improved Personalization and Generalization

May 2024

·

2 Reads

Proceedings of the AAAI Symposium Series

Conventional federated learning algorithms train a single global model by leveraging all participating clients’ data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks.



Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

November 2023

·

366 Reads

Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60% of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementar-ity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80% of in-patient cases and 66% of intensive care cases.


Citations (13)


... Recent studies (e.g. Goh et al. (2024) and McDuff et al. (2023)) have demonstrated that LLMs and GPTs can outperform human physicians in the quality and accuracy of their differential diagnoses. However unlike our problem, there is no objective measure of what the "correct" diagnosis is for most cases. ...

Reference:

Who is More Bayesian: Humans or ChatGPT?
Towards accurate differential diagnosis with large language models

Nature

... It is also not obvious how these models might actively assist clinicians in the development of a DDx. Recent work has begun to assess the standalone performance of these models on challenging case reports that involve complex deduction and diagnosis 1,[12][13][14] , but has stopped short of evaluating how they can assist clinicians, augment performance and empower them to provide better care. ...

Towards conversational diagnostic artificial intelligence

Nature

... The integration of large language models (LLMs) with MVQA has shown promise in enhancing diagnostic accuracy and facilitating clinical decision-making If the model consistently provides the same answer for all the augmented questions, it is deemed as a consistent model. [2]. LLMs, such as GPT-3 and its successors, have demonstrated capability in natural language understanding and generation, which can be used to interpret and answer complex questions about medical images [3]. ...

Toward expert-level medical question answering with large language models

Nature Medicine

... Automated radiology reporting can reduce the workload of radiologists. Researchers developed a state-of-the-art chest x-ray report generation system called the Flamingo-CXR [55]. By inviting a group of board-certified radiologists to conduct expert evaluations, errors were observed in both human-written and Flamingo-CXR reports. ...

Collaboration between clinicians and vision–language models in radiology report generation

Nature Medicine

... To simulate real-world usage, we utilized the default hyperparameters for all LLMs. 31 The temperature, which is a parameter that regulates the randomness of the LLM's outputs, was set to 1 for OpenAI o1, o3-mini and DeepSeek-R1, and 0.7 for Gemini 2.0 Flash-Thinking. 32 In addition, for the OpenAI o1 and o3-mini, the "reasoning_effort" parameter was set to the default value of "medium". ...

A toolbox for surfacing health equity harms and biases in large language models

Nature Medicine

... Automated radiology report generation (RRG) can significantly alleviate the workload of radiologists. Recent advancements in large language models (LLMs) have rapidly enhanced AI's capability to assist in radiology, especially with multimodal models that interpret images and text, including chest X-rays (CXR) [4,15,49,54]. However, applying multimodal LLMs (MLLMs) to CXR RRG is challenging due to high computational costs and vast data re-* Corresponding author quirements. ...

Towards Generalist Biomedical AI
  • Citing Article
  • February 2024

NEJM AI

... These advancements have opened new possibilities for applying LLMs in specialized domains such as medicine, where synthesizing and interpreting textual medical data accurately and efficiently is crucial. Within the medical field specifically, LLMs have been employed to facilitate tasks ranging from clinical documentation automation and report summarization, as demonstrated by the successful implementation of clinical summarization tools like RadLing [13] and AI-assisted decisionmaking systems such as MedPaLM [14], to decision support systems and personalized patient communication [14]. Such applications highlight the potential of LLMs to transform clinical workflows, enhance diagnostic accuracy, and ultimately improve patient care. ...

Publisher Correction: Large language models encode clinical knowledge

Nature

... To rectify this deficiency, we embed both factual and counterfactual content within the learnable prompt to attain more generalizable representations. As suggested by [39], our prompt incorporates detailed instructions and is formulated by concatenating the factual visual tokens, factual label, counterfactual label, and the index of the patch with supplementary text. The training prompt is articulated as "The u patch of image contains critical features for diagnosing C. Generate a diagnostic report for the image by describing critical entities including tubes, pneumothorax, pleural effusion, lung opacity, cardiac silhouette, hilar enlargement, and mediastinum." ...

Towards Generalist Biomedical AI

... The question of whether general commercial LLMs can deliver useful and usable answers to queries relevant to older adults remains open. In medical decision-making, even highly advanced and fine-tuned LLMs can make errors, often performing significantly worse than clinicians (Singhal et al., 2023). This study explored the potential of various AI-based assistants in addressing queries related to the health, well-being, and independence of older adults. ...

Large language models encode clinical knowledge

Nature

... In the future, we should explore the potential of generative artificial intelligence (GenAI) models to support patient sensemaking through collections. Tools such as ChatGPT [60] and Med-PaLM [61], which have demonstrated substantial medical knowledge [62][63][64], can replace the need for custom-made machine learning algorithms for knowledge-intensive tasks. ...

Towards Expert-Level Medical Question Answering with Large Language Models