Daniel Shu Wei Ting’s research while affiliated with Singapore Eye Research Institute and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (95)


Screenshot of chatbot user interface, for assessment of MRI scan appropriateness based on patient profile.
Diagrammatic representation of steps for knowledge base construction for contextualizing GPT-4, based on tables and text material in the ACR Appropriateness Guidelines.
A schematic representation of steps for information retrieval, as part of the Retrieval Augmented Generation (RAG) framework.
Comparison of various LLM versions. A brief description of the optimization techniques are as follows: Chain-of-Thought = A prompt to guide reasoning sequence, by outlining a step-by-step process to extract relevant information for MRI appropriateness classification, analyze the imaging recommendation with reference to contexts, identify any superior imaging procedure than the indicated procedure, and to output the final classification category. Retrieval Augmented Generation = Domain specific facts are retrieved from a reference data source; LLM was prompted to determine the most relevant ACR guidelines which the patient query pertained to. Retrieval of text and tabular data = Optimization of free-text chunk sizes and number of retrieved free-text documents. Metadata filtering = All ingested documents were tagged with metadata which specified the ACR guideline that they belonged to. In the retrieval process, GPT-4 was prompted to determine the potentially relevant conditions that the request pertained to, and to filter the knowledge base for relevant documents based on the metadata labels. Embeddings content optimization = To improve the performance of retrieving tabular data, only table descriptions for embeddings generation was used, instead of using both descriptions and table contents.
ChatGPT performance in assessing musculoskeletal MRI scan appropriateness based on ACR appropriateness criteria
  • Article
  • Full-text available

February 2025

·

8 Reads

Jin Rong Tan

·

Daniel Y. Z. Lim

·

Quan Le

·

[...]

·

Yusheng Keefe Lai

Large Language Models (LLMs) hold potential as clinical decision support tools, particularly when integrated with domain-specific knowledge. In radiology, there is limited research on LLMs for assessing imaging appropriateness. This study evaluates a contextualized GPT-4-based LLM’s performance in assessing the appropriateness of musculoskeletal MRI scan requests with standard models and different versions of optimization. The LLMs’ performances was also compared against human clinicians with varying experience (two radiology residents, two subspecialist attendings, an orthopaedic surgeon). Using a retrieval-augmented generation framework, the LLM was provided with a domain-specific knowledge base from 33 American College of Radiology Appropriateness Criteria guidelines. A test dataset of 70 fictional case scenarios was created, including cases with insufficient clinical information. Quantitative analysis using the McNemar mid-P test revealed that the optimized LLM achieved 92.86% accuracy, significantly outperforming the baseline model (61.43%, P < .001) and the standard GPT-4 model (51.29%, P < .001). The optimized model also excelled in identifying cases with insufficient clinical information. In comparison to human clinicians, the optimized LLM performed better than all but one radiologist. This study demonstrates that with contextualization and optimization, GPT-4-based LLMs can improve performance in assessing imaging appropriateness and show promise as clinical decision support tools in radiology.

Download


Fig. 1: A typical retrieval-augmented generation framework.
Fig. 2: Retrieval-augmented generation could contribute to health care in terms of equity, reliability, and personalization.
Retrieval-augmented generation for generative artificial intelligence in health care

January 2025

·

66 Reads

·

4 Citations

Generative artificial intelligence has brought disruptive innovations in health care but faces certain challenges. Retrieval-augmented generation (RAG) enables models to generate more reliable content by leveraging the retrieval of external knowledge. In this perspective, we analyze the possible contributions that RAG could bring to health care in equity, reliability, and personalization. Additionally, we discuss the current limitations and challenges of implementing RAG in medical scenarios.



Figure 2: Inter-rater reliability (Cohen's Kappa) between PEACH iterations and attending consultants. The diagonal values demonstrate perfect agreement within individual decision sets (Kappa = 1).
Real-world Deployment and Evaluation of PErioperative AI CHatbot (PEACH) -- a Large Language Model Chatbot for Perioperative Medicine

December 2024

·

35 Reads

Large Language Models (LLMs) are emerging as powerful tools in healthcare, particularly for complex, domain-specific tasks. This study describes the development and evaluation of the PErioperative AI CHatbot (PEACH), a secure LLM-based system integrated with local perioperative guidelines to support preoperative clinical decision-making. PEACH was embedded with 35 institutional perioperative protocols in the secure Claude 3.5 Sonet LLM framework within Pair Chat (developed by Singapore Government) and tested in a silent deployment with real-world data. Accuracy, safety, and usability were assessed. Deviations and hallucinations were categorized based on potential harm, and user feedback was evaluated using the Technology Acceptance Model (TAM). Updates were made after the initial silent deployment to amend one protocol. In 240 real-world clinical iterations, PEACH achieved a first-generation accuracy of 97.5% (78/80) and an overall accuracy of 96.7% (232/240) across three iterations. The updated PEACH demonstrated improved accuracy of 97.9% (235/240), with a statistically significant difference from the null hypothesis of 95% accuracy (p = 0.018, 95% CI: 0.952-0.991). Minimal hallucinations and deviations were observed (both 1/240 and 2/240, respectively). Clinicians reported that PEACH expedited decisions in 95% of cases, and inter-rater reliability ranged from kappa 0.772-0.893 within PEACH and 0.610-0.784 among attendings. PEACH is an accurate, adaptable tool that enhances consistency and efficiency in perioperative decision-making. Future research should explore its scalability across specialties and its impact on clinical outcomes.



Figure 1. Different roles in the multi-agent conversation framework.
Figure 2. Schematic illustration of multi-agent (Framework 4-C) discussion dynamics leading to accurate differential diagnosis. ED: emergency department; LLM: large language model.
Figure 3. Flow diagram for identification of case reports with cognitive bias.
Number of correct responses across different multi-agent frameworks and humans.
Clinical scenarios with the initial wrong and the final correct diagnosis are given in the scenarios.
Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study

November 2024

·

30 Reads

·

4 Citations

Journal of Medical Internet Research

Background Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective This study aimed to explore the role of large language models (LLMs) in mitigating these biases through the use of the multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy compared with humans. Methods A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 (OpenAI) to facilitate interactions among different simulated agents to replicate clinical team dynamics. Each agent was assigned a distinct role: (1) making the final diagnosis after considering the discussions, (2) acting as a devil’s advocate to correct confirmation and anchoring biases, (3) serving as a field expert in the required medical subspecialty, (4) facilitating discussions to mitigate premature closure bias, and (5) recording and summarizing findings. We tested varying combinations of these agents within the framework to determine which configuration yielded the highest rate of correct final diagnoses. Each scenario was repeated 5 times for consistency. The accuracy of the initial diagnoses and the final differential diagnoses were evaluated, and comparisons with human-generated answers were made using the Fisher exact test. Results A total of 240 responses were evaluated (3 different multi-agent frameworks). The initial diagnosis had an accuracy of 0% (0/80). However, following multi-agent discussions, the accuracy for the top 2 differential diagnoses increased to 76% (61/80) for the best-performing multi-agent framework (Framework 4-C). This was significantly higher compared with the accuracy achieved by human evaluators (odds ratio 3.49; P=.002). Conclusions The multi-agent framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. In addition, the LLM-driven, multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios.



Large Language Models in Randomized Controlled Trials Design (Preprint)

October 2024

·

15 Reads

BACKGROUND Randomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored. OBJECTIVE This study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards. METHODS We conducted a non-interventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 ongoing studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by comparing them to clinically validated ground truth data from ClinicalTrials.gov. Qualitative assessments were performed using Likert scale ratings (1–3) for domains such as safety, accuracy, objectivity, pragmatism, inclusivity, and diversity. RESULTS The LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). Qualitative evaluations showed that LLM-generated designs scored above 2 points across all domains, indicating strong clinical alignment. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates. CONCLUSIONS LLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.


Comparison of the different large language models used in the study.
Retrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

October 2024

·

264 Reads

·

1 Citation

Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.


Citations (59)


... RAG systems improve traditional LLMs by integrating them with external knowledge bases, ensuring the responses generated are grounded in verified content. RAG can mitigate issues such as hallucinations and inaccuracies, thus improving the trustworthiness of AI-driven health chatbots [25,26]. ...

Reference:

Transforming Medical Data Access: The Role and Challenges of Recent Language Models in SQL Query Automation
Retrieval-augmented generation for generative artificial intelligence in health care

... This cognitive phenomenon has been observed in AI-based systems for mammography classification 21 and cerebral aneurysm detection 22 , and could lead to systematic errors in physicians who fail to critically evaluate LLM suggestions. In contrast, LLMs could also play a role in reducing cognitive biases if they are intentionally utilized to provide different perspectives and uncover common fallacies 23 . ...

Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study

Journal of Medical Internet Research

... Addressing these concerns calls for initiatives like the proposed CARE-AI (Collaborative Assessment for Responsible and Ethical AI Implementation) framework, which seeks to align AI technologies with rigorous ethical standards and practical safeguards [44]. CARE-AI emphasizes risk assessment for misinformation, data privacy, fairness across diverse patient populations, and transparent declarations of an AI system's non-human nature. ...

An ethics assessment tool for artificial intelligence implementation in healthcare: CARE-AI
  • Citing Article
  • October 2024

Nature Medicine

... Non-mydriatic retinal color fundus photography (CFP) is widely used in various fundus disease analyses due to the advantage of not requiring pupillary dilation [8,21,32,37,40,41]. However, it commonly suffers low quality due to artifacts, uneven illumination, deficient ocular media transparency, poor focus, or inappropriate imaging [9,24]. ...

A Competition for the Diagnosis of Myopic Maculopathy by Artificial Intelligence Algorithms
  • Citing Article
  • September 2024

Jama Ophthalmology

... However, unknown risks associated with using the technology to augment or automate critical processes warrants close scrutiny by the research community and regulators.We believe that LLMs can bring about significant value in reducing administrative burden within regulatory agencies, and increase sensitivity and reliability of post-marketing surveillance and evaluation. Studies suggest promising performance in early identification and classification of adverse drug events and automation of drug approval processes.56 Such tools are of low clinical risk, but bring about significant returns in reducing documentation burden and turnaround time for new drug applications. ...

Generative AI and Large Language Models in Reducing Medication Related Harm and Adverse Drug Events – A Scoping Review

... Health. With the increasing use of AI and GenAI in mental health support (e.g., screening for mental health issues [88,89], LLM-powered psychotherapy [47,67], mental education [89]), several ethical challenges emerge [18,68,84]. These ethical considerations have centered around (1) accountability that encompasses governance, legal responsibilities, and liability, ensuring that actions and decisions by AI are traceable and justifiable [13,100]; (2) autonomy that demands respect for human decision-making, emphasizing informed consent and human oversight so that individuals retain control over their mental health treatment [31,53,83]; (3) equity that seeks to eliminate biases and ensure fairness and justice in AI interactions [56,83,102,103,112]; (4) integrity that relates to the honesty and ethical conduct in mental health research and psychotherapy delivery [94,103]; (5) non-maleficence focusing on preventing harm, avoiding misleading information, and ensuring the safety and mental well-being of users [28,86]; (6) privacy that focuses on handling mental health data and protection of client confidentiality [56,83,102]; (7) security that aims to protect sensitive data from unauthorized access and breaches, emphasizing confidentiality and safety [16,48]; (8) transparency that involves reasoning behind AI-driven mental health recommendations be explainable and accessible to clients and practitioners [16,56]; and (9) trust, cultivated through consistent reliability and therapeutic value of AI tools within mental health care [94]. ...

Generative artificial intelligence and ethical considerations in health care: a scoping review and ethics checklist

The Lancet Digital Health

... For example, encrypted communication between traffic sensors and control centers prevents interception, while firewalls block unauthorized access to critical systems [140]. Additionally, regular security audits proactively identify and patch vulnerabilities in software and hardware, ensuring systems remain resilient against evolving threats [141]. ...

Cybersecurity in the Generative Artificial intelligence Era
  • Citing Article
  • August 2024

Asia-Pacific Journal of Ophthalmology

... Facial transitions change as a person ages with wrinkles, sagging skin and structural changes. Those changes can give facial recognition algorithms a hard time, resulting in false positives or negatives, in particular when the system relies on images taken years earlier [14]. To deal with this problem, reference age estimation techniques are employed simultaneously with recognition systems, to predict an individual's age range. ...

Latest Developments of Generative Artificial Intelligence and Applications in Ophthalmology
  • Citing Article
  • August 2024

Asia-Pacific Journal of Ophthalmology

... RAG has shown promising results, particularly in the generation of clinical trial documentation, further demonstrating its utility across diverse healthcare contexts [15]. Additionally, leading LLMs such as those in the GPT series, Google's Gemini, and Claude-3-Opus, are demonstrating significant potential when integrated with RAG techniques [16]. In addition to document generation, RAG is proving to be a highly effective tool for improving question-answering systems, enhancing their accuracy and relevance in medical contexts [17], [18]. ...

Retrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

... Training on a wide range of cases, including varying demographics, disease stages, and tissue types, ensures that models generalize well across different patient populations. Diversity is important in terms of giving the program a wide range of racial and ethnic groups, as well as disease progression from a large and unique spectrum of cases [24]. This approach also helps reduce biases and improve diagnostic accuracy. ...

Artificial Intelligence-Based Disease Activity Monitoring to Personalized Neovascular Age-Related Macular Degeneration Treatment: A Feasibility Study

Ophthalmology Science