ArticleLiterature Review
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Such a scalable corpus can improve the performance of LLMs in analyzing unstructured documents and enable flexible implementation [5]. As LLMs are typically trained on various sources of literature, adapting them to perform well in a specific domain is required, typically through a process called 'fine-tuning' [9,10]. Another approach is 'prompt engineering', which involves sophisticated refinement of prompts for LLMs to perform specific tasks effectively [11]. ...
... Another approach is 'prompt engineering', which involves sophisticated refinement of prompts for LLMs to perform specific tasks effectively [11]. These approaches are expected to optimize general-purpose LLMs to perform better in specialized tasks within the field of medicine [10]. For example, a fine-tuned chat model based on an open-source LLM showed significant improvement in both qualitative evaluation and quantitative measures (e.g., accuracy improved from 0.837 to 0.844) [12]. ...
... The use of LLMs in clinical research is explored in various areas [10]. Recently, there have been several attempts to use LLMs to evaluate clinical research literature [16,17]. ...
Article
Full-text available
Despite interest in clinical trials with decentralized elements (DCTs), analysis of their trends in trial registries is lacking due to heterogeneous designs and unstandardized terms. We explored Llama 3, an open‐source large language model, to efficiently evaluate these trends. Trial data were sourced from Aggregate Analysis of ClinicalTrials.gov, focusing on drug trials conducted between 2018 and 2023. We utilized three Llama 3 models with a different number of parameters: 8b (model 1), fine‐tuned 8b (model 2) with curated data, and 70b (model 3). Prompt engineering enabled sophisticated tasks such as classification of DCTs with explanations and extracting decentralized elements. Model performance, evaluated on a 3‐month exploratory test dataset, demonstrated that sensitivity could be improved after fine‐tuning from 0.0357 to 0.5385. Low positive predictive value in the fine‐tuned model 2 could be improved by focusing on trials with DCT‐associated expressions from 0.5385 to 0.9167. However, the extraction of decentralized elements was only properly performed by model 3, which had a larger number of parameters. Based on the results, we screened the entire 6‐year dataset after applying DCT‐associated expressions. After the subsequent application of models 2 and 3, we identified 692 DCTs. We found that a total of 213 trials were classified as phase 2, followed by 162 phase 4 trials, 112 phase 3 trials, and 92 phase 1 trials. In conclusion, our study demonstrated the potential of large language models for analyzing clinical trial information not structured in a machine‐readable format. Managing potential biases during model application is crucial.
... These models typically employ architectures based on multi-head attention decoders with billions of learnable parameters and are trained autoregressively on web-derived datasets containing trillions of tokens [3]. Such substantial scaling equips LLMs to tackle a wide array of complex linguistic tasks, thereby demonstrating remarkable capabilities that span diverse domains such as programming, creative writing, and scientific research [4,5,6,7]. ...
... Overall, as these angles widen beyond 90 • , the magnitudes of the negative component g ∥ increases. It indicates that a greater portion of this components should be removed to fulfill the constraint specified in Eq. (7). In an extreme case where the angles approach 180 • , nearly the entire g (t) u is inverted relative to g (t) ...
Preprint
Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. In examining the update process for unlearning dynamically, we find gradients hold essential information for revealing this trade-off. In particular, we look at the varying relationship between retention performance and directional disparities between gradients during unlearning. It motivates the sculpting of an update mechanism derived from gradients from two sources, i.e., harmful for retention and useful for unlearning. Accordingly, we propose Gradient Rectified Unlearning (GRU), an enhanced unlearning framework controlling the updating gradients in a geometry-focused and optimization-driven manner such that their side impacts on other, unrelated responses can be minimized. Specifically, GRU derives a closed-form solution to project the unlearning gradient onto the orthogonal space of that gradient harmful for retention, ensuring minimal deviation from its original direction under the condition that overall performance is retained. Comprehensive experiments are conducted to demonstrate that GRU, as a general framework, is straightforward to implement and efficiently enhances a range of baseline methods through its adaptable and compatible characteristics. Additionally, experimental results show its broad effectiveness across a diverse set of benchmarks for LLM unlearning.
... In contrast, large language models (LLMs) have demonstrated remarkable progress in natural language processing, excelling in semantic understanding and generation [4][5] [6]. They have been widely applied in medical tasks such as text summarization, clinical decision support, and patient question-answering, showcasing significant practical value [7][8] [9]. Recent studies also suggest that LLMs can enhance medical imaging report analysis by extracting disease features, generating structured reports [10] [11], and assessing image quality [12] [13]. ...
... Developing domain-specific datasets and evaluation frameworks to enhance model performance is also crucial. Techniques such as instruction prompt tuning, RLHF, and adversarial training can further improve model accuracy and generalization [9] [36]. Additionally, exploring explainable AI techniques to develop models with transparent decision-making processes could increase their clinical acceptability. ...
Preprint
Full-text available
Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
... In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities in the domain of natural language processing. LLMs are capable of generating humanlike text responses without being specifically trained for particular tasks [13], which enables them to excel in activities such as sentiment analysis [14]named entity recognition [15]. For instance, they have demonstrated exceptional performance in content generation [16], text mining [17,18], and knowledge extraction [19]. ...
... From the perspective of practical management significance (PMS), the findings of this study provide substantial value to educational management practices. By enhancing the Qmatrix to improve the accuracy of cognitive diagnostic mod-VOLUME , 13 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. ...
Article
Full-text available
Cognitive diagnosis is a key component of intelligent education, aimed at assessing students’ comprehension of specific knowledge concepts. Current methodologies predominantly rely on students’ historical performance records and manually, annotated knowledge concepts for analysis. However, the extensive semantic information embedded in exercises, including latent knowledge concepts, has not been fully utilized. This paper presents a novel cognitive diagnosis model based on the LLAMA3-70B framework (referred to as LLM-CDM), which integrates prompt engineering with the rich semantic information inherent in exercise texts to uncover latent knowledge concepts and improve diagnostic accuracy. Specifically, this study first inputs exercise texts into a large language model and develops an innovative prompting method to facilitate deep mining of implicit knowledge concepts within these texts by the model. Following the integration of these newly extracted knowledge concepts into the existing Q matrix, this paper employs a neural network to diagnose students’ understanding of knowledge concepts while applying the monotonicity assumption to ensure the interpretability of model factors. Experimental results from an examination data set for course completion assessments demonstrate that LLM-CDM exhibits superior performance in both accuracy and explainability.
... Questions were developed to specifically correspond to one or more of the guideline statements (Supplemental Table 1). Questions were formulated for each of the sections of the guideline including preoperative testing (questions 1-3), treatment of ureteral stones (questions 4-9), treatment of renal stones (questions [10][11][12][13][14][15][16], and treatment for all patients (questions [17][18][19][20]. ...
... Despite the weaknesses noted in our study, LLMs have made significant strides in the medical field over the past few years. Particular success in the areas of clinical documentation assistance, medical literature review, diagnostic assistance, and within medical education should be applauded [10][11][12][13]15]. These advances have the protentional to greatly improve the quality of clinical care and medical education. ...
Article
Full-text available
Purpose Artificial intelligence (AI) technology will inevitably permeate healthcare. Bing Chat is an AI chatbot with different conservation styles. We evaluated each of these response mode answers regarding management of nephrolithiasis. Methods A total of 20 questions were created based on the AUA Surgical Management of Stones guidelines. Bing Chat’s responses were evaluated across Precise, Balanced, and Creative conversation style chat modes by three physicians using the Brief DISCERN tool. Consensus scoring was employed to assess appropriateness, guideline adherence, empathy, recommendation for physician consultation, and inability to answer the inquiry. Responses were also assessed for their directness and the presence of superfluous information. Chat modes were compared using descriptive statistics as well as ANOVA, Chi-Squared tests, and Fisher exact tests. Results The median Brief DISCERN Score in Precise, Balanced, and Creative modes were: 22, 21, and 21, respectively. There was no significant difference in Brief DISCERN scores between the three chat modes (p = 0.68). Guideline adherence by chatbot conversation style was similar (p = 0.37), as was response appropriateness (p = 0.62), directly answering the question asked (p = 0.26) and providing a recommendation to consult with a healthcare provider (p = 0.07). Creative and balanced modes outperformed precise mode when evaluating response empathy. Creative mode was more likely to include superfluous information and less likely to answer the question. Conclusion In its current iteration, Bing Chat provides low quality urologic healthcare information for nephrolithiasis queries, regardless of the conversation style utilized.
... LLMs first became a topic of public interest with the publication of ChatGPT in November of 2022 [38]. Since then, they have been used successfully in a number of different domains [39][40][41][42] and a considerable amount of resources has been invested to improve this technology, supported by large corporations [43][44][45][46][47]. This led to significant scientific progress being made in a short amount of time. ...
... Now, with models like GPT4 [48] and Llama3 [44] at the forefront of these developments, many tasks that might have been considered difficult to automate a few years ago, can be done to a satisfying degree by computers. This includes writing code [42] and creative writing [41]. In addition, supporting architectures are being developed, such as Retrieval-Augmented Generation (RAG) [49], further improving the quality of model outputs by providing them with the relevant context to answer specific questions, while reducing the likelihood of bad outcomes, such as hallucinations [50]. ...
Article
Full-text available
Alleviating high workloads for teachers is crucial for continuous high quality education. To evaluate if Large Language Models (LLMs) can alleviate this problem in the quantum computing domain, we conducted two complementary studies exploring the use of GPT-4 to automatically generate tips for students. (1) A between-subject survey in which students (N = 46) solved four multiple-choice quantum computing questions with either the help of expert-created or LLMgenerated tips. To correct for possible biases, we additionally introduced two deception conditions. (2) Experienced educators and students (N = 23) directly compared the LLM-generated and expert-created tips. Our results show that the LLM-generated tips were significantly more helpful and pointed better towards relevant concepts while also giving away more of the answers. Furthermore, we found that participants in the first study performed significantly better in answering the quantum computing questions when given tips labeled as LLM-generated, even if they were expert-created. This points towards a placebo effect induced by the participants’ biases for LLM-generated content. Ultimately, we contribute that LLM-generated tips can be used instead of expert tips to support teaching of quantum computing basics.
... LLMs are foundation models built with billions or trillions of parameters, trained on extensive text data to recognize complex language patterns 29 . Recently, LLMs have garnered significant attention in the field of medicine for their capability to use natural language to generate coherent, contextually relevant responses. ...
Preprint
Full-text available
Sepsis is a dysregulated host response to infection with high mortality and morbidity. Early detection and intervention have been shown to improve patient outcomes, but existing computational models relying on structured electronic health record data often miss contextual information from unstructured clinical notes. This study introduces COMPOSER-LLM, an open-source large language model (LLM) integrated with the COMPOSER model to enhance early sepsis prediction. For high-uncertainty predictions, the LLM extracts additional context to assess sepsis-mimics, improving accuracy. Evaluated on 2,500 patient encounters, COMPOSER-LLM achieved a sensitivity of 72.1%, positive predictive value of 52.9%, F-1 score of 61.0%, and 0.0087 false alarms per patient hour, outperforming the standalone COMPOSER model. Prospective validation yielded similar results. Manual chart review found 62% of false positives had bacterial infections, demonstrating potential clinical utility. Our findings suggest that integrating LLMs with traditional models can enhance predictive performance by leveraging unstructured data, representing a significant advance in healthcare analytics.
... Therefore, even with the advent of more advanced models, it is reasonable to deduce that the multiagent collaboration model will continue to offer significant advantages in solving complex medical tasks. Although existing benchmarks offered comprehensive evaluation of LLMs' medical knowledge, challenges remain in assessing their application in clinical senarios 1,37 . Our study tried to address this issue by obtaining a collection of standardized rare disease data, manually curated by professional physicians to meet the need in clinical practice. ...
Article
Full-text available
Large Language Models (LLMs) show promise in healthcare tasks but face challenges in complex medical scenarios. We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.
... Modern NLP has undergone significant advancements across various domains in recent years [2,9,29,42,57,61,74]. Notable examples include healthcare diagnostics [60], sentiment analysis [71], and machine translation [70]. These breakthroughs are driven by large language models (LLMs), which are pre-trained on extensive public text corpora [2,7,9,33,42]. ...
Preprint
Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often employed and has emerged as the de facto paradigm. However, federated fine-tuning is prohibitively inefficient due to the tension between LLM complexity and the resource constraint of end devices, incurring unaffordable fine-tuning overhead. Existing literature primarily utilizes parameter-efficient fine-tuning techniques to mitigate communication costs, yet computational and memory burdens continue to pose significant challenges for developers. This work proposes DropPEFT, an innovative federated PEFT framework that employs a novel stochastic transformer layer dropout method, enabling devices to deactivate a considerable fraction of LLMs layers during training, thereby eliminating the associated computational load and memory footprint. In DropPEFT, a key challenge is the proper configuration of dropout ratios for layers, as overhead and training performance are highly sensitive to this setting. To address this challenge, we adaptively assign optimal dropout-ratio configurations to devices through an exploration-exploitation strategy, achieving efficient and effective fine-tuning. Extensive experiments show that DropPEFT can achieve a 1.3-6.3\times speedup in model convergence and a 40%-67% reduction in memory footprint compared to state-of-the-art methods.
... Large language models (LLM) are probabilistic models that have been trained to predict the next token in a text sequence (Naveed et al., 2024). High capabilities, ease of use, and commercial promotion have caused a widespread adoption of these systems in various fields, such as medicine (Thirunavukarasu et al., 2023), science , and education (Valova et al., 2024). Countries including South Korea (GEM Report, 2025) and ...
Preprint
Full-text available
In this study, we examined whether a short-form AI literacy intervention could reduce the adoption of incorrect recommendations from large language models. High school seniors were randomly assigned to either a control or an intervention group, which received an educational text explaining ChatGPT's working mechanism, limitations, and proper use. Participants solved math puzzles with the help of ChatGPT's recommendations, which were incorrect in half of the cases. Results showed that students adopted incorrect suggestions 52.1% of the time, indicating widespread over-reliance. The educational intervention did not significantly reduce over-reliance. Instead, it led to an increase in ignoring ChatGPT's correct recommendations. We conclude that the usage of ChatGPT is associated with over-reliance and it is not trivial to increase AI literacy to counter over-reliance.
... This is the problem we believe could be solved by LLMs. discovery, clinical diagnosis, treatment recommendations, outcome predictions etc. (Ji et al., 2021;Lee et al., 2020;Thirunavukarasu et al., 2023). Unfortunately, the problem of hallucination is a major concern with LLMs use in the field of text generation (Farquhar et al., 2024). ...
Preprint
Full-text available
IAN is an R package that addresses the challenge of integrating, analyzing and interpreting high-throughput ″omics″ data, using a multi-agent artificial intelligence (AI) system. IAN leverages popular pathway and regulatory datasets (KEGG, WikiPathways, Reactome, GO, ChEA) and the STRING database for protein-protein interactions to perform standard enrichment analysis. The individual enrichment results are then used to generate insightful summaries, for each of the datasets, using a large language model (LLM) through a multi-agent architecture. These summaries are then contextually integrated and interpreted by the LLM, guided by carefully engineered prompts and grounding instructions, to provide insightful explanations, system overview, key regulators, novel observations etc. We demonstrate IAN's potential to facilitate biological discovery from complex omics data, by reanalyzing two already published data and evaluating the results. We also show remarkable performance of IAN, in terms of avoiding hallucination. IAN package, along with installation instructions and example usage, is available on https://github.com/NIH-NEI/IAN .
... In recent years, large language models (LLMs), empowered by massive text corpora and deep learning techniques, have demonstrated breakthrough advancements in cross-domain knowledge transfer and human-machine dialogue interactions [1]. Within the healthcare domain, LLMs are increasingly deployed across nine core application scenarios, including intelligent diagnosis, personalized treatment, and drug discovery, garnering significant attention from both academia and industry [2,3]. A particularly important area of focus is the development and evaluation of Chinese medical LLMs, which face unique challenges due to the specialized nature of medical knowledge and the high-stakes implications of clinical decision-making. ...
Preprint
The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.
... Medical data, by nature, are inherently multimodal, encompassing diverse formats such as X-ray images, physiological time series (e.g., ECG signals), and textual reports. These modalities collectively form the foundation for clinical decision-making, driving the advancement of MLLMs in medical applications (He et al., 2025;Li et al., 2023a;Thirunavukarasu et al., 2023;Kim et al., 2024;Moor et al., 2023;Liu et al., 2023;Fu et al., 2024;Zhu et al., 2024). Different from these works, our focus is on empowering MLLMs with the grounded ECG understanding capability. ...
Preprint
While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters (e.g., QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN 7.4%7.4\% \uparrow), explainability (22.7%22.7\% \uparrow), and grounding (24.8%24.8\% \uparrow), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git
... These models, built on transformer architectures and trained on extensive datasets, have shown remarkable abilities in medical text summarization, clinical documentation, decision support, and patient communication. 4,48,49 Despite these advancements, challenges remain in aligning their design and deployment with healthcare needs. 1,5 Learning from healthcare's historical challenges with IT system adoption-where insufficient clinical input led to significant workflow inefficiencies [1]the critical importance of proactive clinician involvement in LLM development has become increasingly apparent. ...
Preprint
Full-text available
Large language models (LLMs) hold promise for transforming healthcare, from streamlining administrative and clinical workflows to enriching patient engagement and advancing clinical decision-making. However, their successful integration requires rigorous development, adaptation, and evaluation strategies tailored to clinical needs. In this Review, we highlight recent advancements, explore emerging opportunities for LLM-driven innovation, and propose a framework for their responsible implementation in healthcare settings. We examine strategies for adapting LLMs to domain-specific healthcare tasks, such as fine-tuning, prompt engineering, and multimodal integration with electronic health records. We also summarize various evaluation metrics tailored to healthcare, addressing clinical accuracy, fairness, robustness, and patient outcomes. Furthermore, we discuss the challenges associated with deploying LLMs in healthcare--including data privacy, bias mitigation, regulatory compliance, and computational sustainability--and underscore the need for interdisciplinary collaboration. Finally, these challenges present promising future research directions for advancing LLM implementation in clinical settings and healthcare.
... LLMs have emerged as powerful tools in natural language processing (NLP), demonstrating state-of-the-art capabilities in understanding and generating human-like text (Wei et al. 2022;Zhao et al. 2024). These models become proficient due to their training on massive amounts of text data, which enables them to capture intricate linguistic patterns and semantic nuances (Thirunavukarasu et al. 2023;Kasneci et al. 2023). LLMs are typically built on transformerbased architectures, such as OpenAI's GPT (Generative Pre-trained Transformer) series (Brown et al. 2020), which leverage the self-attention mechanism to process input sequences and generate output sequences with remarkable fluency and coherence while retaining semantic knowledge between word types (Kalyan 2023 Page 10 of 30 machine translation. ...
Article
Full-text available
The advent of large language models (LLMs) has marked a new era in the transformation of computational social science (CSS). This paper dives into the role of LLMs in CSS, particularly exploring their potential to revolutionize data analysis and content generation and contribute to a broader understanding of social phenomena. We begin by discussing the applications of LLMs in various computational problems in social science including sentiment analysis, hate speech detection, stance and humor detection, misinformation detection, event understanding, and social network analysis, illustrating their capacity to generate nuanced insights into human behavior and societal trends. Furthermore, we explore the innovative use of LLMs in generating social media content. We also discuss the various ethical, technical, and legal issues these applications pose, and considerations required for responsible LLM usage. We further present the challenges associated with data bias, privacy, and the integration of these models into existing research frameworks. This paper aims to provide a solid background on the potential of LLMs in CSS, their past applications, current problems, and how they can pave the way for revolutionizing CSS.
... With their remarkable adaptability, general-purpose AI models for text and vision excel at extracting meaningful features for a wide range of deep-learning tasks, such as sentiment analysis, question answering and document summarisation for language models [30], or object detection, image classification and segmentation for vision models [8]. However, when applied to domain-specific tasks, such as the medical domain, these models often fall short [11,24,20]. This performance gap requires either the development of dedicated domain-specific models, or the application of knowledge transfer techniques to adapt generalpurpose models to the desired domain [11]. ...
Preprint
Full-text available
General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
... LLMs are a paradigm-shifting technology that has changed many aspects of daily life and research domains. For example, it can create short movies [4], provide medical suggestions [5], analyze financial statements [6], write codes [7], and design new materials [8,9,10]. As its name indicates, LLMs are characterized by their tremendous amount of hyperparameters, currently as large as several trillion. ...
Preprint
Full-text available
Quantum language models are the alternative to classical language models, which borrow concepts and methods from quantum machine learning and computational linguistics. While several quantum natural language processing (QNLP) methods and frameworks exist for text classification and generation, there is a lack of systematic study to compare the performance across various ans\"atze, in terms of their hyperparameters and classical and quantum methods to implement them. Here, we evaluate the performance of quantum natural language processing models based on these ans\"atze at different levels in text classification tasks. We perform a comparative study and optimize the QNLP models by fine-tuning several critical hyperparameters. Our results demonstrate how the balance between simplification and expressivity affects model performance. This study provides extensive data to improve our understanding of QNLP models and opens the possibility of developing better QNLP algorithms.
... Hallucination refers to instances that VLM models generate outputs that are seemingly plausible but fundamentally incorrect [17]. As Fig. 1 illustrates, in medical contexts, such inaccuracies can easily lead to misdiagnoses and improper treatment decisions, posing serious risks to patient treatment [19]. ...
Preprint
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.
... D EEP Neural Networks (DNNs) have achieved remarkable performance in multiple areas, such as Medical Imaging [1], [2], Natural Language Processing [3], [4], and Active Speaker Detection [5]- [7]. This accomplishment led to the wide adoption of Artificial Intelligence in the daily lives of many people, either in work or leisure scenarios, increasing the attractiveness and susceptibility of DNNs to attackers. ...
Preprint
Full-text available
State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.
... Various sectors, including healthcare [13], finance [14], travel and hospitality [15], education [16] and marketing [17] have already adopted chat interfaces to enhance user experiences and streamline operations. This widespread adoption is indicative of their utility and potential to improve user interaction in the SSA domain. ...
Preprint
Full-text available
Space Situational Awareness (SSA) focuses on the observation and monitoring of satellites primarily by earth-based sensors. An important component of SSA is the detection of satellite manoeuvres. Given the large amount of data produced by manoeuvre detection systems, the interpretation of this data by end users can be challenging. In recent years, the ability of conversational agents to facilitate user interaction with large datasets has increased dramatically. They are now widely deployed across domains such as healthcare, education, government service centres, and retail. To enhance the ability of users to best interact with manoeuvre detection systems, we have developed SatChat: an intelligent conversational agent for querying the results of satellite manoeuvre detection. SatChat is a text-based chat interface built that allows users to query the results of a manoeuvre detection system using natural language. The underlying models are open source. Experimental evaluations demonstrate SatChat’s accuracy (88.24%) and its ability to respond to complex queries. Moreover, our system can handle sensitive and confidential data without breaching privacy. This paper presents the results of deploying SatChat in conjunction with a particle filter based manoeuvre detection system.
... Regulatory agencies, such as the U.S. Food and Drug Administration (FDA), are working to address AI-related challenges in medical imaging. However, there are currently no specific guidelines for using ChatGPT for educational purposes or supplementary diagnostics [22,32,45]. ...
Article
Full-text available
Since the development of Artificial Intelligence, there has been a continuous effort to utilize it in the medical field. Medical imaging was one of the first areas where Artificial Intelligence was put into practice. ChatGPT is a large language model that operates using an AI algorithm. It is accessible and affordable. This study investigates the capabilities and limitations of ChatGPT in diagnosing long bone fractures. A total of 200 X-ray images of upper and lower limb fractures were evaluated separately by two independent orthopaedic surgeons and ChatGPT. While ChatGPT's image analysis showed some promise, our study revealed multiple critical concerns regarding its diagnostic accuracy. We noted that integrating patient history and clinical signs would significantly enhance the accuracy of ChatGPT's diagnoses. Our study results do not support the use of current ChatGPT models for interpreting or diagnosing upper and lower limb fractures based on radiologic images. However, with the rapid advancement of large language models and anticipated future improvements in ChatGPT, accurate fracture diagnosis using this model may become achievable.
Preprint
Full-text available
This paper explores the integration of artificial intelligence (AI) and cognitive behavioral therapy (CBT) to address anxiety disorders. AI-driven platforms such as LyssnCBT and the eXtended-reality Artificial Intelligence Assistant (XAIA) have introduced innovative approaches to mental health care by offering automated feedback to therapists and virtual therapy avatars for users. The paper highlights the benefits, challenges, and ethical considerations of AI-enhanced CBT and introduces a practical AI-based CBT chatbot. This chatbot enhances accessibility, scalability, and personalized therapy for users while offering innovative features like mood tracking and session history.
Article
Background Education and enhancing the knowledge of adolescents who will undergo kidney transplantation are among the primary objectives of their care. While there are specific interventions in place to achieve this, they require extensive resources. The rise of large language models like ChatGPT‐3.5 offers potential assistance for providing information to patients. This study aimed to evaluate the accuracy, relevance, and safety of ChatGPT‐3.5's responses to patient‐centered questions about pediatric kidney transplantation. The objective was to assess whether ChatGPT‐3.5 could be a supplementary educational tool for adolescents and their caregivers in a complex medical context. Methods A total of 37 questions about kidney transplantation were presented to ChatGPT‐3.5, which was prompted to respond as a health professional would to a layperson. Five pediatric nephrologists independently evaluated the outputs for accuracy, relevance, comprehensiveness, understandability, readability, and safety. Results The mean accuracy, relevancy, and comprehensiveness scores for all outputs were 4.51, 4.56, and 4.55, respectively. Out of 37 outputs, four were rated as completely accurate, and seven were completely relevant and comprehensive. Only one output had an accuracy, relevancy, and comprehensiveness score below 4. Twelve outputs were considered potentially risky, but only three had a risk grade of moderate or higher. Outputs that were considered risky had an accuracy and relevancy below the average. Conclusion Our findings suggest that ChatGPT could be a useful tool for adolescents or caregivers of individuals waiting for kidney transplantation. However, the presence of potentially risky outputs underscores the necessity for human oversight and validation.
Article
Aim of the Study Artificial intelligence (AI) such as large language models (LLMs) tools are potential sources of information on hypothermic cardiac arrest (HCA). The aim of our study was to determine whether, for patients with HCA, LLMs provide information consistent with expert consensus on criteria that would usually contraindicate extracorporeal cardiopulmonary resuscitation (eCRP) in patients with normothermic cardiac arrest (NCA), but not HCA. Methods Based on Extracorporeal Life Support Organization guidelines, selected factors were identified that may be contraindications to eCPR in NCA but not in HCA. Four questions were created and entered into AI software (GPT‐3.5 turbo, GPT‐4o, GPT‐4o‐mini, Claude 3.5 Sonnet, Claude 3 Haiku, Mistral Large, Mistral Small, Gemini Pro and Gemini Flash). The responses obtained and citations returned were assessed by an international panel of experts for consistency with current knowledge. Results Complete agreement of responses with expert consensus was obtained for 5/10 AI tools. In total, all AI tools presented 101 items in the literature. No reference was rated as “correct”; 45 citations (45%) “existed but did not answer the question”; and 56 citations (55%) were considered “hallucinatory”. Conclusion Use of artificial intelligence in decision‐making for extracorporeal cardiopulmonary resuscitation in patients with hypothermic cardiac arrest risks unjustifiably withdrawing treatment from patients who have a chance of survival with a good neurological outcome. Large language models should not be used as the only tool for decision‐making.
Article
Introduction Large language models (LLMs) are becoming ubiquitous and widely implemented. LLMs could also be used for diagnosis and treatment. National antibiotic prescribing guidelines are customized and informed by local laboratory data on antimicrobial resistance. Methods Based on 24 vignettes with information on type of infection, gender, age group and comorbidities, GPs and LLMs were prompted to provide a treatment. Four countries (Ireland, UK, USA and Norway) were included and a GP from each country and six LLMs (ChatGPT, Gemini, Copilot, Mistral AI, Claude and Llama 3.1) were provided with the vignettes, including their location (country). Responses were compared with the country’s national prescribing guidelines. In addition, limitations of LLMs such as hallucination, toxicity and data leakage were assessed. Results GPs’ answers to the vignettes showed high accuracy in relation to diagnosis (96%–100%) and yes/no antibiotic prescribing (83%–92%). GPs referenced (100%) and prescribed (58%–92%) according to national guidelines, but dose/duration of treatment was less accurate (50%–75%). Overall, the GPs’ accuracy had a mean of 74%. LLMs scored high in relation to diagnosis (92%–100%), antibiotic prescribing (88%–100%) and the choice of antibiotic (59%–100%) but correct referencing often failed (38%–96%), in particular for the Norwegian guidelines (0%–13%). Data leakage was shown to be an issue as personal information was repeated in the models’ responses to the vignettes. Conclusions LLMs may be safe to guide antibiotic prescribing in general practice. However, to interpret vignettes, apply national guidelines and prescribe the right dose and duration, GPs remain best placed.
Article
Large Language Models (LLMs) like ChatGPT, Llama and Claude are transforming healthcare by interpreting complex text, extracting information, and providing guideline-based support. Radiology, with its high patient volume and digital workflows, is a ideal field for LLM integration. Assessment of the potential of LLMs to enhance efficiency, standardization, and decision support in radiology, while addressing ethical and regulatory challenges. Pilot studies at Freiburg and Basel university hospitals evaluated local LLM systems for tasks like prior report summarization and guideline-driven reporting. Integration with Picture Archiving and Communication System (PACS) and Electronic Health Record (EHR) systems was achieved via Digital Imaging and Communications in Medicine (DICOM) and Fast Healthcare Interoperability Resources (FHIR) standards. Metrics included time savings, compliance with the European Union (EU) Artificial Intelligence (AI) Act, and user acceptance. LLMs demonstrate significant potential as a support tool for radiologists in clinical practice by reducing reporting times, automating routine tasks, and ensuring consistent, high-quality results. They also support interdisciplinary workflows (e.g., tumor boards) and meet data protection requirements when locally implemented. Local LLM systems are feasible and beneficial in radiology, enhancing efficiency and diagnostic quality. Future work should refine transparency, expand applications, and ensure LLMs complement medical expertise while adhering to ethical and legal standards.
Preprint
Full-text available
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.
Preprint
Full-text available
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.
Preprint
Full-text available
Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a single\textit{single} carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.
Article
Full-text available
Background: For patients with drug-resistant focal epilepsy (DRE), surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology poses challenges because it relies heavily on expert knowledge and is often based on inconsistent and incoherent descriptions, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)—with ChatGPT being a notable example—offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and assist in accurately localizing the EZ.
Article
Full-text available
Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy of large language models, but non-model creator-affiliated red teaming is scant in healthcare. We convened teams of clinicians, medical and engineering students, and technical professionals (80 participants total) to stress-test models with real-world clinical cases and categorize inappropriate responses along axes of safety, privacy, hallucinations/accuracy, and bias. Six medically-trained reviewers re-analyzed prompt-response pairs and added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were inappropriate (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 with Internet: 17.8%). Subsequently, we show the utility of our benchmark by testing GPT-4o, a model released after our event (20.4% inappropriate). 21.5% of responses appropriate with GPT-3.5 were inappropriate in updated models. We share insights for constructing red teaming prompts, and present our benchmark for iterative model assessments.
Article
Optimizing the deployment of large language models (LLMs) in edge computing environments is critical for enhancing privacy and computational efficiency. In the path toward efficient wireless LLM inference in edge computing, this study comprehensively analyzes the impact of different splitting points in mainstream open-source LLMs. Accordingly, this study introduces a framework taking inspiration from model-based reinforcement learning to determine the optimal splitting point across the edge and user equipment. By incorporating a reward surrogate model, our approach significantly reduces the computational cost of frequent performance evaluations. Extensive simulations demonstrate that this method effectively balances inference performance and computational load under varying network conditions, providing a robust solution for LLM deployment in decentralized settings.
Article
Brain stimulation has become a widely accepted treatment for neurological disorders such as epilepsy and Parkinson’s disease. These devices not only deliver therapeutic stimulation but also record brain activity, offering valuable insights into neural dynamics. However, brain recordings during stimulation are often blanked or contaminated by artifact, posing significant challenges for analyzing the acute effects of stimulation. To address these challenges, we propose a transformer-based model, Stim-BERT, trained on a large intracranial EEG (iEEG) dataset to reconstruct brain activity lost during stimulation blanking. To train the Stim-BERT model, 4,653,720 iEEG channels from 380 RNS system patients were tokenized into 3 (or 4) frequency band bins using 1 s non-overlapping windows resulting in a total vocabulary size of 1,000 (or 10,000). Stim-BERT leverages self-supervised learning with masked tokens, inspired by BERT’s success in natural language processing, and shows significant improvements over traditional interpolation methods, especially for longer blanking periods. These findings highlight the potential of transformer models for filling in missing time-series neural data, advancing neural signal processing and our efforts to understand the acute effects of brain stimulation.
Article
Background Sentiment analysis of alternative tobacco products discussed on social media is crucial in tobacco control research. Large language models (LLMs) are artificial intelligence models that were trained on extensive text data to emulate the linguistic patterns of humans. LLMs may hold the potential to streamline the time-consuming and labor-intensive process of human sentiment analysis. Objective This study aimed to examine the accuracy of LLMs in replicating human sentiment evaluation of social media messages relevant to heated tobacco products (HTPs). Methods GPT-3.5 and GPT-4 Turbo (OpenAI) were used to classify 500 Facebook (Meta Platforms) and 500 Twitter (subsequently rebranded X) messages. Each set consisted of 200 human-labeled anti-HTPs, 200 pro-HTPs, and 100 neutral messages. The models evaluated each message up to 20 times to generate multiple response instances reporting its classification decisions. The majority of the labels from these responses were assigned as a model’s decision for the message. The models’ classification decisions were then compared with those of human evaluators. Results GPT-3.5 accurately replicated human sentiment evaluation in 61.2% of Facebook messages and 57% of Twitter messages. GPT-4 Turbo demonstrated higher accuracies overall, with 81.7% for Facebook messages and 77% for Twitter messages. GPT-4 Turbo’s accuracy with 3 response instances reached 99% of the accuracy achieved with 20 response instances. GPT-4 Turbo’s accuracy was higher for human-labeled anti- and pro-HTP messages compared with neutral messages. Most of the GPT-3.5 misclassifications occurred when anti- or pro-HTP messages were incorrectly classified as neutral or irrelevant by the model, whereas GPT-4 Turbo showed improvements across all sentiment categories and reduced misclassifications, especially in incorrectly categorized messages as irrelevant. Conclusions LLMs can be used to analyze sentiment in social media messages about HTPs. Results from GPT-4 Turbo suggest that accuracy can reach approximately 80% compared with the results of human experts, even with a small number of labeling decisions generated by the model. A potential risk of using LLMs is the misrepresentation of the overall sentiment due to the differences in accuracy across sentiment categories. Although this issue could be reduced with the newer language model, future efforts should explore the mechanisms underlying the discrepancies and how to address them systematically.
Preprint
Full-text available
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
Preprint
Full-text available
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Article
Full-text available
Importance: The rapid expansion of virtual health care has caused a surge in patient messages concomitant with more work and burnout among health care professionals. Artificial intelligence (AI) assistants could potentially aid in creating answers to patient questions by drafting responses that could be reviewed by clinicians. Objective: To evaluate the ability of an AI chatbot assistant (ChatGPT), released in November 2022, to provide quality and empathetic responses to patient questions. Design, setting, and participants: In this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit's r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose "which response was better" and judged both "the quality of information provided" (very poor, poor, acceptable, good, or very good) and "the empathy or bedside manner provided" (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians. Results: Of the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). Chatbot responses were rated of significantly higher quality than physician responses (t = 13.3; P < .001). The proportion of responses rated as good or very good quality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses (t = 18.9; P < .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot. Conclusions: In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.
Article
Full-text available
Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model's answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners' reports from 2018 to 2022. Novel explanations from ChatGPT-defined as information provided that was not inputted within the question or multiple answer choices-were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT's strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT's performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=-0.241 and -0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert-level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.
Preprint
Full-text available
Transformer is the latest deep neural network (DNN) architecture for sequence data learning that has revolutionized the field of natural language processing. This success has motivated researchers to explore its application in the healthcare domain. Despite the similarities between longitudinal clinical data and natural language data, clinical data presents unique complexities that make adapting Transformer to this domain challenging. To address this issue, we have designed a new Transformer-based DNN architecture, referred to as Hybrid Value-Aware Transformer (HVAT), which can jointly learn from longitudinal and non-longitudinal clinical data. HVAT is unique in the ability to learn from the numerical values associated with clinical codes/concepts such as labs, and also the use of a flexible longitudinal data representation called clinical tokens. We trained a prototype HVAT model on a case-control dataset, achieving high performance in predicting Alzheimer's disease and related dementias as the patient outcome. The result demonstrates the potential of HVAT for broader clinical data learning tasks.
Article
Full-text available
Fabricating research within the scientific community has consequences for one’s credibility and undermines honest authors. We demonstrate the feasibility of fabricating research using an AI-based language model chatbot. Human detection versus AI detection will be compared to determine accuracy in identifying fabricated works. The risks of utilizing AI-generated research works will be underscored and reasons for falsifying research will be highlighted.
Preprint
Full-text available
ChatGPT is a large language model trained on text corpora and reinforced with human supervision. Because ChatGPT can provide human-like responses to complex questions, it could become an easily accessible source of medical advice for patients. However, its ability to answer medical questions appropriately and equitably remains unknown. We presented ChatGPT with 96 advice-seeking vignettes that varied across clinical contexts, medical histories, and social characteristics. We analyzed responses for clinical appropriateness by concordance with guidelines, recommendation type, and consideration of social factors. Ninety-three (97%) responses were appropriate and did not explicitly violate clinical guidelines. Recommendations in response to advice-seeking questions were completely absent (N=34, 35%), general (N=18, 18%), or specific (N=44, 46%). Fifty-three (55%) explicitly considered social factors like race or insurance status, which in some cases changed clinical recommendations. ChatGPT consistently provided background information in response to medical questions but did not reliably offer appropriate and personalized medical advice.
Preprint
Full-text available
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
Article
Full-text available
Background: Artificial intelligence (AI)-based chatbots can offer personalized, engaging, and on-demand health promotion interventions. Objective: The aim of this systematic review was to evaluate the feasibility, efficacy, and intervention characteristics of AI chatbots for promoting health behavior change. Methods: A comprehensive search was conducted in 7 bibliographic databases (PubMed, IEEE Xplore, ACM Digital Library, PsycINFO, Web of Science, Embase, and JMIR publications) for empirical articles published from 1980 to 2022 that evaluated the feasibility or efficacy of AI chatbots for behavior change. The screening, extraction, and analysis of the identified articles were performed by following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Results: Of the 15 included studies, several demonstrated the high efficacy of AI chatbots in promoting healthy lifestyles (n=6, 40%), smoking cessation (n=4, 27%), treatment or medication adherence (n=2, 13%), and reduction in substance misuse (n=1, 7%). However, there were mixed results regarding feasibility, acceptability, and usability. Selected behavior change theories and expert consultation were used to develop the behavior change strategies of AI chatbots, including goal setting, monitoring, real-time reinforcement or feedback, and on-demand support. Real-time user-chatbot interaction data, such as user preferences and behavioral performance, were collected on the chatbot platform to identify ways of providing personalized services. The AI chatbots demonstrated potential for scalability by deployment through accessible devices and platforms (eg, smartphones and Facebook Messenger). The participants also reported that AI chatbots offered a nonjudgmental space for communicating sensitive information. However, the reported results need to be interpreted with caution because of the moderate to high risk of internal validity, insufficient description of AI techniques, and limitation for generalizability. Conclusions: AI chatbots have demonstrated the efficacy of health behavior change interventions among large and diverse populations; however, future studies need to adopt robust randomized control trials to establish definitive conclusions.
Preprint
Full-text available
Objective: To explore the use of ChatGPT by educators and students in a medical school setting. Method: This study used the public version of ChatGPT launched by OpenAI on November 30, 2022 (https://openai.com/blog/chatgpt/). We employed prompts to ask ChatGPT to 1) generate a content outline for a session on the topics of cholesterol, lipoproteins, and hyperlipidemia for medical students; 2) produce a list of learning objectives for the session; and 3) write assessment questions with and without clinical vignettes related to the identified learning objectives. We assessed the responses by ChatGPT for accuracy and reliability to determine the potential of the chatbot as an aid to educators and as a know-it-all medical information provider for students. Results: ChatGPT can function as an aid to educators, but it is not yet suitable as a reliable information resource for educators and medical students. Conclusion: ChatGPT can be a useful tool to assist medical educators draft course and session content outlines and create assessment questions. At the same time, caution must be taken as ChatGPT is prone to providing incorrect information; expert oversight and caution are necessary to ensure the information generated is accurate and beneficial to students. Therefore, it is premature for medical students to use the current version of ChatGPT as a know-it-all information provider. In the future, medical educators should work with programming experts to explore and grow the full potential of AI in medical education.
Article
Full-text available
While still in its infancy, ChatGPT (Generative Pretrained Transformer), introduced in November 2022, is bound to hugely impact many industries, including healthcare, medical education, biomedical research, and scientific writing. Implications of ChatGPT, that new chatbot introduced by OpenAI on academic writing, is largely unknown. In response to the Journal of Medical Science (Cureus) Turing Test - call for case reports written with the assistance of ChatGPT, we present two cases one of homocystinuria-associated osteoporosis, and the other is on late-onset Pompe disease (LOPD), a rare metabolic disorder. We tested ChatGPT to write about the pathogenesis of these conditions. We documented the positive, negative, and rather troubling aspects of our newly introduced chatbot's performance.
Preprint
Full-text available
Although recent advances in scaling large language models (LLMs) have resulted in improvements on many NLP tasks, it remains unclear whether these models trained primarily with general web text are the right tool in highly specialized, safety critical domains such as clinical text. Recent results have suggested that LLMs encode a surprising amount of medical knowledge. This raises an important question regarding the utility of smaller domain-specific language models. With the success of general-domain LLMs, is there still a need for specialized clinical models? To investigate this question, we conduct an extensive empirical analysis of 12 language models, ranging from 220M to 175B parameters, measuring their performance on 3 different clinical tasks that test their ability to parse and reason over electronic health records. As part of our experiments, we train T5-Base and T5-Large models from scratch on clinical notes from MIMIC III and IV to directly investigate the efficiency of clinical tokens. We show that relatively small specialized clinical models substantially outperform all in-context learning approaches, even when finetuned on limited annotated data. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text. We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
Preprint
Full-text available
In the past few years we have seen the meteoric appearance of dozens of foundation models of the Transformer family, all of which have memorable and sometimes funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovations in Transformer models. Our catalog will include models that are trained using self-supervised learning (e.g., BERT or GPT3) as well as those that are further trained using a human-in-the-loop (e.g. the InstructGPT model used by ChatGPT).
Article
Full-text available
We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.
Preprint
Full-text available
Objective To determine the capabilities of ChatGPT for rapidly generating, rewriting, and evaluating (via diagnostic and triage accuracy) sets of clinical vignettes. Design We explored the capabilities of ChatGPT for generating and rewriting vignettes. First, we gave it natural language prompts to generate 10 new sets of 10 vignettes, each set for a different common childhood illness. Next, we had it generate 10 sets of 10 vignettes given a set of symptoms from which to draw. We then had it rewrite 15 existing pediatric vignettes at different levels of health literacy. Fourth, we asked it to generate 10 vignettes written as a parent, and rewrite these vignettes as a physician, then at a grade 8 reading level, before rewriting them from the original parent's perspective. Finally, we evaluated ChatGPT for diagnosis and triage for 45 clinical vignettes previously used for evaluating symptom checkers. Setting and participants ChatGPT, a publicly available, free chatbot. Main outcome measures Our main outcomes for de novo vignette generation were whether ChatGPT followed vignette creation instructions consistently, correctly, and listed reasonable symptoms for the disease being described. For generating vignettes from pre-existing symptom sets, we examined whether the symptom sets were used without introducing extra symptoms. Our main outcome for rewriting existing standardized vignettes to match patient demographics, and rewriting vignettes between styles, was whether symptoms were dropped or added outside the original vignette. Finally, our main outcomes examining diagnostic and triage accuracy on 45 standardized patient vignettes were whether the correct diagnosis was listed first, and if the correct triage recommendation was made. Results ChatGPT was able to quickly produce varied contexts and symptom profiles when writing vignettes based on an illness name, but overused some core disease symptoms. It was able to use given symptom lists as the basis for vignettes consistently, adding one additional (though appropriate) symptom from outside the list for one disease. Pediatric vignettes rewritten at different levels of health literacy showed more complex symptoms being dropped when writing at low health literacy in 87.5% of cases. While writing at high health literacy, it added a diagnosis to 80% of vignettes (91.7% correctly diagnosed). Symptoms were retained in 90% of cases when rewriting vignettes between viewpoints. When presented with 45 vignettes, ChatGPT identified illnesses with 75.6% (95% CI, 62.6% to 88.5%) first-pass diagnostic accuracy and 57.8% (95% CI, 42.9% to 72.7%) triage accuracy. Its use does require monitoring and has caveats, which we discuss. Conclusions ChatGPT was capable, with caveats and appropriate review, of generating, rewriting, and evaluating clinical vignettes.
Article
Full-text available
Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. ResultsOf the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P
Preprint
Full-text available
Importance: Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown. Objective: Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model's diagnostic and triage performance to attending physicians and lay adults who use the Internet. Design: We compared the accuracy of GPT-3's diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3's confidence was for diagnosis and triage. Setting and Participants: The GPT-3 model, a nationally representative sample of lay people, and practicing physicians. Exposure: Validated case vignettes (<60 words; <6th grade reading level). Main Outcomes and Measures: Correct diagnosis, correct triage. Results: Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22). Conclusions and Relevance: A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.
Article
Full-text available
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase. A generative deep-learning model designs artificial proteins with desired enzymatic activities.
Preprint
Full-text available
Importance Chatbots could play a role in answering patient questions, but patients’ ability to distinguish between provider and chatbot responses, and patients’ trust in chatbots’ functions are not well established. Objective To assess the feasibility of using ChatGPT or a similar AI-based chatbot for patient-provider communication. Design Survey in January 2023 Setting Survey Participants A US representative sample of 430 study participants aged 18 and above was recruited on Prolific, a crowdsourcing platform for academic studies. 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. 53.2% of respondents analyzed were women; their average age was 47.1. Exposure(s) Ten representative non-administrative patient-provider interactions were extracted from the EHR. Patients’ questions were placed in ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider’s response. In the survey, each patient’s question was followed by a provider- or ChatGPT-generated response. Participants were informed that five responses were provider-generated and five were chatbot-generated. Participants were asked, and incentivized financially, to correctly identify the response source. Participants were also asked about their trust in chatbots’ functions in patient-provider communication, using a Likert scale of 1-5. Main Outcome(s) and Measure(s) Main outcome: Proportion of responses correctly classified as provider- vs chatbot-generated. Secondary outcomes: Average and standard deviation of responses to trust questions. Results The correct classification of responses ranged between 49.0% to 85.7% for different questions. On average, chatbot responses were correctly identified 65.5% of the time, and provider responses were correctly distinguished 65.1% of the time. On average, responses toward patients’ trust in chatbots’ functions were weakly positive (mean Likert score: 3.4), with lower trust as the health-related complexity of the task in questions increased. Conclusions and Relevance ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in healthcare. Conclusions and Relevance AI in Medicine; ChatGPT; Generative AI; Healthcare AI; Turing Test;
Preprint
Full-text available
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
Article
Full-text available
There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model—GatorTron—using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og .
Preprint
Full-text available
Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
Article
Full-text available
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Article
Full-text available
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the “black box” issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Preprint
Full-text available
Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.
Preprint
Full-text available
We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.
Article
Full-text available
Recent progress in targeting KRASG12C has provided both insight and inspiration for targeting alternative KRAS mutants. In this study, we evaluated the mechanism of action and anti-tumor efficacy of MRTX1133, a potent, selective and non-covalent KRASG12D inhibitor. MRTX1133 demonstrated a high-affinity interaction with GDP-loaded KRASG12D with KD and IC50 values of ~0.2 pM and <2 nM, respectively, and ~700-fold selectivity for binding to KRASG12D as compared to KRASWT. MRTX1133 also demonstrated potent inhibition of activated KRASG12D based on biochemical and co-crystal structural analyses. MRTX1133 inhibited ERK1/2 phosphorylation and cell viability in KRASG12D-mutant cell lines, with median IC50 values of ~5 nM, and demonstrated >1,000-fold selectivity compared to KRASWT cell lines. MRTX1133 exhibited dose-dependent inhibition of KRAS-mediated signal transduction and marked tumor regression (≥30%) in a subset of KRASG12D-mutant cell-line-derived and patient-derived xenograft models, including eight of 11 (73%) pancreatic ductal adenocarcinoma (PDAC) models. Pharmacological and CRISPR-based screens demonstrated that co-targeting KRASG12D with putative feedback or bypass pathways, including EGFR or PI3Kα, led to enhanced anti-tumor activity. Together, these data indicate the feasibility of selectively targeting KRAS mutants with non-covalent, high-affinity small molecules and illustrate the therapeutic susceptibility and broad dependence of KRASG12D mutation-positive tumors on mutant KRAS for tumor cell growth and survival.
Article
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.
Article
GPT-4 automates the transformation of various free-text radiology reports into structured templates with minor effort, overcoming the challenges of implementing structured reporting while improving standardization and data extraction for research.
Preprint
IMPORTANCE: Large language model (LLM) artificial intelligence (AI) chatbots direct the power of large training datasets towards successive, related tasks, as opposed to single-ask tasks, for which AI already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as virtual physicians, has not yet been evaluated. OBJECTIVE: To evaluate ChatGPT′s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. DESIGN: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. SETTING: ChatGPT, a publicly available LLM PARTICIPANTS: Clinical vignettes featured hypothetical patients with a variety of age and gender identities, and a range of Emergency Severity Indices (ESIs) based on initial clinical presentation. EXPOSURES: MSD Clinical Manual vignettes MAIN OUTCOMES AND MEASURES: We measured the proportion of correct responses to the questions posed within the clinical vignettes tested. RESULTS: ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%, p<0.001) and clinical management (β=-7.4%, p=0.02) type questions. CONCLUSIONS AND RELEVANCE: ChatGPT achieves impressive accuracy in clinical decision making, with particular strengths emerging as it has more clinical information at its disposal.
Article
This study examines the appropriateness of artificial intelligence model responses to fundamental cardiovascular disease prevention questions.
Article
In less than 2 months, the artificial intelligence (AI) program ChatGPT has become a cultural sensation. It is freely accessible through a web portal created by the tool's developer, OpenAI. The program-which automatically creates text based on written prompts-is so popular that it's likely to be "at capacity right now" if you attempt to use it. When you do get through, ChatGPT provides endless entertainment. I asked it to rewrite the first scene of the classic American play Death of a Salesman, but to feature Princess Elsa from the animated movie Frozen as the main character instead of Willy Loman. The output was an amusing conversation in which Elsa-who has come home from a tough day of selling-is told by her son Happy, "Come on, Mom. You're Elsa from Frozen. You have ice powers and you're a queen. You're unstoppable." Mash-ups like this are certainly fun, but there are serious implications for generative AI programs like ChatGPT in science and academia.
Article
At least four articles credit the AI tool as a co-author, as publishers scramble to regulate its use. At least four articles credit the AI tool as a co-author, as publishers scramble to regulate its use. Credit: Iryna Imago/Shutterstock Hands typing on a laptop keyboard with screen showing artificial intelligence chatbot ChatGPT Hands typing on a laptop keyboard with screen showing artificial intelligence chatbot ChatGPT
Article
Researchers cannot always differentiate between AI-generated and original abstracts. Researchers cannot always differentiate between AI-generated and original abstracts. Credit: Ted Hsu/Alamy Webpage of ChatGPT is seen on OpenAI's website on a computer monitor Webpage of ChatGPT is seen on OpenAI's website on a computer monitor
Preprint
Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta gradients according to the demonstration examples, and then these meta gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.
Article
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
Book
The mathematician and engineer Charles Babbage (1791–1871) is best remembered for his 'calculating machines', which are considered the forerunner of modern computers. Over the course of his life he wrote a number of books based on his scientific investigations, but in this volume, published in 1864, Babbage writes in a more personal vein. He points out at the beginning of the work that it 'does not aspire to the name of autobiography', though the chapters sketch out the contours of his life, beginning with his family, his childhood and formative years studying at Cambridge, and moving through various episodes in his scientific career. However, the work also diverges into his observations on other topics, as indicated by chapter titles such as 'Street Nuisances' and 'Wit'. Babbage's colourful recollections give an intimate portrait of the life of one of Britain's most influential inventors.
Preprint
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.