Leo Anthony Celi’s research while affiliated with Massachusetts Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (538)


of risk of bias in four domains assessed by PROBAST
Heatmap depicting common areas of deficiencies in reporting standards as assessed by TRIPOD+AI
* Publication has same first author and year as another paper listed; PMID of each *  in ascending order: Yang and colleagues (2022): 35430680, 35607360 [58,59]. Luo and colleagues (2023): 36653317, 36773821 [65,66]. Zhang and colleagues (2023): 36902504, 36964219, 37196588 [69–71].
Basic characteristics of included studies
High-priority areas in methodology and reporting that could be improved
A systematic review of machine learning-based prognostic models for acute pancreatitis: Towards improving methods and reporting quality
  • Literature Review
  • Full-text available

February 2025

·

41 Reads

Brian Critelli

·

Amier Hassan

·

Ila Lahooti

·

[...]

·

Background An accurate prognostic tool is essential to aid clinical decision-making (e.g., patient triage) and to advance personalized medicine. However, such a prognostic tool is lacking for acute pancreatitis (AP). Increasingly machine learning (ML) techniques are being used to develop high-performing prognostic models in AP. However, methodologic and reporting quality has received little attention. High-quality reporting and study methodology are critical for model validity, reproducibility, and clinical implementation. In collaboration with content experts in ML methodology, we performed a systematic review critically appraising the quality of methodology and reporting of recently published ML AP prognostic models. Methods/findings Using a validated search strategy, we identified ML AP studies from the databases MEDLINE and EMBASE published between January 2021 and December 2023. We also searched pre-print servers medRxiv, bioRxiv, and arXiv for pre-prints registered between January 2021 and December 2023. Eligibility criteria included all retrospective or prospective studies that developed or validated new or existing ML models in patients with AP that predicted an outcome following an episode of AP. Meta-analysis was considered if there was homogeneity in the study design and in the type of outcome predicted. For risk of bias (ROB) assessment, we used the Prediction Model Risk of Bias Assessment Tool. Quality of reporting was assessed using the Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis—Artificial Intelligence (TRIPOD+AI) statement that defines standards for 27 items that should be reported in publications using ML prognostic models. The search strategy identified 6,480 publications of which 30 met the eligibility criteria. Studies originated from China (22), the United States (4), and other (4). All 30 studies developed a new ML model and none sought to validate an existing ML model, producing a total of 39 new ML models. AP severity (23/39) or mortality (6/39) were the most common outcomes predicted. The mean area under the curve for all models and endpoints was 0.91 (SD 0.08). The ROB was high for at least one domain in all 39 models, particularly for the analysis domain (37/39 models). Steps were not taken to minimize over-optimistic model performance in 27/39 models. Due to heterogeneity in the study design and in how the outcomes were defined and determined, meta-analysis was not performed. Studies reported on only 15/27 items from TRIPOD+AI standards, with only 7/30 justifying sample size and 13/30 assessing data quality. Other reporting deficiencies included omissions regarding human–AI interaction (28/30), handling low-quality or incomplete data in practice (27/30), sharing analytical codes (25/30), study protocols (25/30), and reporting source data (19/30). Conclusions There are significant deficiencies in the methodology and reporting of recently published ML based prognostic models in AP patients. These undermine the validity, reproducibility, and implementation of these prognostic models despite their promise of superior predictive accuracy. Registration Research Registry (reviewregistry1727)

Download


Do Language Models Think Like Doctors?

February 2025

·

15 Reads

Background: While large language models (LLMs) are being increasingly deployed for clinical decision support, existing evaluation methods like medical licensing exams fail to capture critical aspects of clinical reasoning including reasoning in dynamic clinical circumstances. Script Concordance Testing (SCT), a decades-old medical assessment tool, offers a nuanced way to assess how new information influences diagnostic and therapeutic decisions under uncertainty. Methods: We developed a comprehensive and publicly available benchmark comprising 750 SCT questions from 10 internationally diverse medical datasets (9 previously unreleased) spanning multiple specialties and institutions. Each question presents a clinical scenario then asks how new information affects the likelihood of a diagnosis or management decision, scored against expert panels (Figure 1). We evaluated four state-of-the-art LLMs against the combined responses of 1070 medical students, 193 resident physicians, and 300 attending physicians in total across all datasets. Results: LLMs demonstrated markedly lower performance on SCTs compared to their typical achievement on medical multiple choice benchmarks. GPT-4o achieved the highest performance (63.6% +/- 1.2%), significantly outperforming other models (Claude-3.5 Sonnet: 58.8% +/- 1.2%, o1-preview: 58.5% +/- 1.3%, Gemini-1.5-Pro: 54.4% +/- 1.4%). Models matched or exceeded student performance on multiple examinations, but did not reach the level of senior residents or attending physicians (Figure 2). Surprisingly, the integrated-chain-of-thought o1-preview model underperformed compared to GPT-4o, a contrast with their relative performance on other medical benchmarks. Conclusions: SCT represents a challenging and distinctive benchmark for evaluating LLM clinical reasoning capabilities, revealing limitations not apparent in traditional MCQ-based assessments. This work demonstrates the value of SCT in providing a more nuanced evaluation of medical AI systems and highlights specific areas where current models may fall short in clinical reasoning tasks. We are making our benchmark publicly available in a secure format to foster collaborative improvement of clinical reasoning capabilities in LLMs.


Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium

The 4th Machine Learning for Health (ML4H) symposium was held in person on December 15-16, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with interest in the session’s topic.


Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium

February 2025

·

35 Reads

The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session's topic.


Step-by-step causal analysis of EHRs to ground decision-making

February 2025

·

22 Reads

Causal inference enables machine learning methods to estimate treatment effects of medical interventions from electronic health records (EHRs). The prevalence of such observational data and the difficulty for randomized controlled trials (RCT) to cover all population/treatment relationships make these methods increasingly attractive for studying causal effects. However, researchers should be wary of many pitfalls. We propose and illustrate a framework for causal inference estimating the effect of albumin on mortality in sepsis using an Intensive Care database (MIMIC-IV) and comparing various sensitivity analyses to results from RCTs as gold-standard. The first step is study design, using the target trial concept and the PICOT framework: Population (patients with sepsis), Intervention (combination of crystalloids and albumin for fluid resuscitation), Control (crystalloids only), Outcome (28-day mortality), Time (intervention start within 24h of admission). We show that too large treatment-initiation times induce immortal time bias. The second step is selection of the confounding variables based on expert knowledge. Increasingly adding confounders enables to recover the RCT results from observational data. As the third step, we assess the influence of multiple models with varying assumptions, showing that a doubly robust estimator (AIPW) with random forests proved to be the most reliable estimator. Results show that these steps are all important for valid causal estimates. A valid causal model can then be used to individualize decision making: subgroup analyses showed that treatment efficacy of albumin was better for patients >60 years old, males, and patients with septic shock. Without causal thinking, machine learning is not enough for optimal clinical decision on an individual patient level. Our step-by-step analytic framework helps avoiding many pitfalls of applying machine learning to EHR data, building models that avoid shortcuts and extract the best decision-making evidence.


The Data Artifacts Glossary: a community-based repository for bias on health datasets

February 2025

·

22 Reads

Journal of Biomedical Sciences

Background The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups. Objective This paper introduces the “Data Artifacts Glossary”, a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities. Methods Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary’s structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure. Results The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding. Conclusion The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.





Citations (42)


... Even the production of scientific evidence involves a "garden of forking paths" of countless decisions with equally valid alternatives (Gelman and Loken, 2013), as exemplified by models used to guide the COVID-19 response (Harvard et al., 2021). This understanding suggests that it matters who does the research (Charpignon et al., 2025). As such, epistemic humility, plurality, and incorporation of diverse viewpoints and lived experiences should be at the heart of the fairness movement in AI. ...

Reference:

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium
Diversity in the medical research ecosystem: a descriptive scientometric analysis of over 49 000 studies and 150 000 authors published in high-impact medical journals between 2007 and 2022

BMJ Open

... We extract retinal layouts using open-source models for L segmentation [52] and CD segmentation [13,17]. For AV segmentation, we retrained a SwinV2 tiny -based model on our annotated datasets with data augmentation techniques such as random color jitter, flips, and rotations. ...

Deep learning generalization for diabetic retinopathy staging from fundus images

Physiological Measurement

... This raises the question of how to derive the SOFA score when death occurs, which is a relevant issue for clinical trials that use this score as an endpoint. (2) In addition, assessing the SOFA score at only two time points (Day 1 and Day 7) may miss critical changes in organ dysfunction that occur between these points. More frequent assessments (e.g., daily SOFA scores or at least 48 hours) could provide a more granular understanding of how organ dysfunction evolves and its impact on mortality. ...

Analyzing how the components of the SOFA score change over time in their contribution to mortality

Critical Care Science

... We evaluate TXAGENT's ability to generalize across different drug representations. LLM-based models are sensitive to variations in how drugs are referenced [24], such as brand versus generic names. To test generalization, we construct three modified versions of the DrugPC benchmark: ...

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

... However, despite these advantages, the broader application of LLMs in clinical practice faces several challenges, particularly in terms of cost, training, and infrastructure [32]. One critical concern is the ethical implications related to patient privacy and data security. ...

Economics and Equity of Large Language Models: Health Care Perspective

Journal of Medical Internet Research

... 4 Their outputs are limited by the quality of the data they are trained on, meaning that any biases and inaccuracies in these data can perpetuate errors and harmful outcomes. 5 Moreover, equating AI to a human brain should be avoided. AI does not reason, think, or intuit like humans; it processes patterns and performs specific tasks exceptionally well, but does not have the depth of understanding and contextual judgment inherent to human cognition, which can lead to oversights in complex clinical scenarios. ...

Artificial intelligence and global health equity
  • Citing Article
  • October 2024

The BMJ

... Moreover, the combination of artificial intelligence (AI) algorithms with portable devices has the potential to revolutionize medical care by streamlining screening, diagnosis, and monitoring processes, especially in resource-constrained settings 7 . However, concerns regarding the accuracy and fairness of AI algorithms persist, primarily due to a lack of representative data and generalizable algorithms 8 . ...

Unmasking biases and navigating pitfalls in the ophthalmic artificial intelligence lifecycle: A narrative review

... Limited patient engagement, with AI outputs often requiring significant interpretation by HCPs. 19 Facilitates clear, patient-centred communication, empowering shared decisionmaking and incorporating patient values. ...

Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review

... This systematic review and meta-analysis of diagnostic test accuracy was conducted and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [13], and the Cochrane Handbook for Diagnostic Test Accuracy Reviews [14]. The study protocol was registered at the International Prospective Register of systematic reviews (PROSPERO) on registration number CRD42019146781 [15]. Due to the reviewing nature of this study, institutional review board ethical approval was not needed. ...

ASSESSMENT OF FLUID RESPONSIVENESS IN PATIENTS UNDER MECHANICAL VENTILATION: A SYSTEMATIC REVIEW AND META-ANALYSIS
  • Citing Article
  • October 2024

Chest

... However, they may perpetuate or amplify existing disparities if trained on skewed or incomplete data (14). Earlier research revealed biased behavior in LLMs in clinical contexts (14)(15)(16)(17). Recent studies have explored demographic biases related to race (16), social and gender identity (18), or how model training and prompts affect outputs (19). ...

A toolbox for surfacing health equity harms and biases in large language models

Nature Medicine