Article

Large language models should be used as scientific reasoning engines, not knowledge databases

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Data analysis techniques originate from multiple application fields [21,148,47,175,142]. Though loosely grouped, they have the same ability to find useful structures in all kinds of data records [81]. ...
... As a result, data analysts across various industries prefer interpretable and editable machine learning methods [148,47,175,142]. The question then arises: how can PFMs aid in the development of interpretable and editable machine learning? ...
Preprint
Full-text available
Data analysis focuses on harnessing advanced statistics, programming, and machine learning techniques to extract valuable insights from vast datasets. An increasing volume and variety of research emerged, addressing datasets of diverse modalities, formats, scales, and resolutions across various industries. However, experienced data analysts often find themselves overwhelmed by intricate details in ad-hoc solutions or attempts to extract the semantics of grounded data properly. This makes it difficult to maintain and scale to more complex systems. Pre-trained foundation models (PFMs), grounded with a large amount of grounded data that previous data analysis methods can not fully understand, leverage complete statistics that combine reasoning of an admissible subset of results and statistical approximations by surprising engineering effects, to automate and enhance the analysis process. It pushes us to revisit data analysis to make better sense of data with PFMs. This paper provides a comprehensive review of systematic approaches to optimizing data analysis through the power of PFMs, while critically identifying the limitations of PFMs, to establish a roadmap for their future application in data analysis.
... This comparison includes benchmarks in image-and language-pretraining, such as those reported in the recent work by Huang et al 23,24 . While this suggests a promising direction for the application of LLMs in medical image analysis, it also highlights the need for further research and validation [25][26][27] . Our study aims to contribute to the ongoing dialogue on the utility of LLMs in medical science, particularly in integrating and interpreting complex visual and textual data -a prerequisite for foundational models [28][29][30] . ...
... Given the general sparsity of medical training data and the high costs of labeling data with domain experts, the use of models such as Flamingo-80B possesses great potential. In addition, their inherent knowledge and ability to process information from other domains can facilitate the linking of different domains within the medical field and the incorporation of existing knowledge 18,26 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint ...
Preprint
Full-text available
Advanced multimodal large language models (LLM), such as GPT-4V(ision) and Gemini Ultra, have shown promising results in the diagnosis of complex pathological conditions. This raises questions about their knowledge base: Do these models deeply understand medical cases, including images, or do they simply recognize superficial patterns from extensive pre-training? We aimed to determine whether LLMs can develop useable internal representations of images, and if these representations improve the classification of medical images. We rigorously tested the performance of the open-source Flamingo-80B model, which is not specifically tailored for medical tasks, against traditional pre-training methods. The tests covered eight distinct image classification tasks in pathology, dermatology, ophthalmology, and radiology, using CLIP, Flamingo-80B, and 9B multimodal models. These tasks ranged from tissue and nuclear classification in histopathology to lesion detection in dermatology and disease grading in radiology. We systematically evaluated the model's internal image representations to determine their relevance and usefulness in medical diagnosis. Our analysis showed that the internal representation of these images in the largest model, Flamingo-80B, was more accurate in classifying medical images than in all other methods. These results held even when the number of samples available for training was small. Our results show that multimodal LLMs acquire structured knowledge in medical domains. This suggests that these models are evolving from mere pattern recognition tools into entities with broader medical generalist capabilities. This evolution underscores the potential for these models to make contributions to medical diagnosis and research, although it is important to continue to evaluate their capabilities and limitations in real-world medical settings.
... Remarkable achievements have been made in transferring the above pathology vision-language models to a wide range of downstream tasks, including patch-level histopathology image classification, segmentation, captioning, and retrieval 18 . Meanwhile, large language models (LLMs) have shown great potential in performing logic, analogy, causal reasoning, extrapolation, and evidence evaluation for medical and scientific applications 19 . Researchers have found that when treated as reasoning machines or inference engines rather than knowledge databases, LLMs are less likely to generate false statements that do not reflect scientific facts 20 . ...
... This approach can enhance performance in knowledge-intensive tasks [7] and open-domain QA. Especially in medical QA systems where questions are knowledge-intensive, LLMs excel as generators rather than knowledge databases [8]. Retrieval quality is crucial for RAG performance [9] due to the "distraction phenomenon," where irrelevant retrieval results in the prompt degrade response quality [10]. ...
Preprint
Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.
... These accounts of GenAI non-use are strongly aligned with and confirm the findings of other research studies which similarly evidence the negative implications of and warnings for GenAI use in university settings as they correspond inter alia to concerns of privacy, ethics, data bias, and poisoning (Arora et al. 2023;Mikalef et al. 2022;Verde et al. 2021); over-reliance, misuse, and responsible use (Chan and Colloton 2024); scientific misinformation (Truhn et al. 2023); and irreproducible research (Haibe-Kains et al. 2020). ...
Article
Full-text available
The disruptive potential of generative AI (GenAI) tools to academic labour is potentially vast. Yet as we argue herein, such tools also represent a continuation of the inequities inherent to academia’s prestige economy and the intensified hierarchy and labour precarisation endemic to universities as prestige institutions. In a recent survey of n = 284 UK-based academics, reasons were put forward for avoiding GenAI tools. These responses surface concerns about automative technologies corrupting academic identity and inauthenticating scholarly practice; concerns that are salient to all who participate within and benefit from the work of scholarly communities. In discussion of these survey results, we explore ambivalence about whether GenAI tools expedite the acquisition or depletion of prestige demanded of academics, especially where GenAI tools are adopted to increase scholarly productivity. We also appraise whether, far from helping academics cope with a work climate of hyper-intensifcation, GenAI tools ultimately exacerbate their vulnerability, status-based peripheralisation, and self-estrangement.
... Transformer-based LMs can store knowledge to some extent, 39 however, LLMs build on diverse and mostly unverified training sources and may not store reliable healthcare knowledge. 40,41 Additionally, without modification (e.g., allowing the LLM to access search engines) or retraining, LLMs are restricted to the data available at the time of their training. ...
Preprint
Full-text available
At the heart of radiological practice is the challenge of integrating complex imaging data with clinical information to produce actionable insights. Nuanced application of language is key for various activities, including managing requests, describing and interpreting imaging findings in the context of clinical data, and concisely documenting and communicating the outcomes. The emergence of large language models (LLMs) offers an opportunity to improve the management and interpretation of the vast data in radiology. Despite being primarily general-purpose, these advanced computational models demonstrate impressive capabilities in specialized language-related tasks, even without specific training. Unlocking the potential of LLMs for radiology requires basic understanding of their foundations and a strategic approach to navigate their idiosyncrasies. This review, drawing from practical radiology and machine learning expertise and recent literature, provides readers insight into the potential of LLMs in radiology. It examines best practices that have so far stood the test of time in the rapidly evolving landscape of LLMs. This includes practical advice for optimizing LLM characteristics for radiology practices along with limitations, effective prompting, and fine-tuning strategies.
... Remarkable achievements have been made in transferring the above pathology vision-language models to a wide range of downstream tasks, including patch-level histopathology image classification, segmentation, captioning, and retrieval [18]. Meanwhile, large language models (LLMs) have shown great potential in performing logic, analogy, causal reasoning, extrapolation, and evidence evaluation for medical and scientific applications [19]. Researchers have found that when treated as reasoning machines or inference engines rather than knowledge databases, LLMs are less likely to generate false statements that do not reflect scientific facts [20]. ...
Preprint
Full-text available
Due to the large size and lack of fine-grained annotation, Whole Slide Images (WSIs) analysis is commonly approached as a Multiple Instance Learning (MIL) problem. However, previous studies only learn from training data, posing a stark contrast to how human clinicians teach each other and reason about histopathologic entities and factors. Here we present a novel knowledge concept-based MIL framework, named ConcepPath to fill this gap. Specifically, ConcepPath utilizes GPT-4 to induce reliable diseasespecific human expert concepts from medical literature, and incorporate them with a group of purely learnable concepts to extract complementary knowledge from training data. In ConcepPath, WSIs are aligned to these linguistic knowledge concepts by utilizing pathology vision-language model as the basic building component. In the application of lung cancer subtyping, breast cancer HER2 scoring, and gastric cancer immunotherapy-sensitive subtyping task, ConcepPath significantly outperformed previous SOTA methods which lack the guidance of human expert knowledge.
... (ii) Large language models: A.I.-based models trained on text-based data to perform language-related tasks; for example, chatbots and text generation. These are now typically available to the general public [5]. (iii) Deep learning: A form of machine learning utilizing artificial neural networks (ANNs) to recognize data patterns such as image and speech recognition [6]. ...
Article
Full-text available
Data flow-based strategies that seek to improve the understanding of A.I.-based results are examined here by carefully curating and monitoring the flow of data into, for example, artificial neural networks and random forest supervised models. While these models possess structures and related fitting procedures that are highly complex, careful restriction of the data being utilized by these models can provide insight into how they interpret data structures and associated variables sets and how they are affected by differing levels of variation in the data. The goal is improving our understanding of A.I.-based supervised modeling-based results and their stability across different data sources. Some guidelines are suggested for such first-stage adjustments and related data issues.
... This addresses a critical limitation of conventional image classifiers, as textual feedback provides a more comprehensible way of understanding and interpretability for humans compared to visual tools such as Grad-CAM 25 . This aspect is crucial for reliable AI systems in medical applications 26 . ...
Article
Full-text available
Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.
... One approach to addressing these concerns is to leverage an LLM's strength for language reasoning, while severely limiting its use as a reliable knowledge source (Truhn et al. 2023). Instead, knowledge external to the LLM can be made readily available using retrieval-augmented generation (RAG) techniques (Lewis et al. 2021). ...
Article
Full-text available
Motivation Large Language Models (LLMs) have provided spectacular results across a wide variety of domains. However, persistent concerns about hallucination and fabrication of authoritative sources raise serious issues for their integral use in scientific research. Retrieval-augmented generation (RAG) is a technique for making data and documents, otherwise unavailable during training, available to the LLM for reasoning tasks. In addition to making dynamic and quantitative data available to the LLM, RAG provides the means by which to carefully control and trace source material, thereby ensuring results are accurate, complete and authoritative. Results Here we introduce LmRaC, an LLM-based tool capable of answering complex scientific questions in the context of a user’s own experimental results. LmRaC allows users to dynamically build domain specific knowledge-bases from PubMed sources (RAGdom). Answers are drawn solely from this RAG with citations to the paragraph level, virtually eliminating any chance of hallucination or fabrication. These answers can then be used to construct an experimental context (RAGexp) that, along with user supplied documents (e.g., design, protocols) and quantitative results, can be used to answer questions about the user’s specific experiment. Questions about quantitative experimental data are integral to LmRaC and are supported by a user-defined and functionally extensible REST API server (RAGfun). Availability and implementation Detailed documentation for LmRaC along with a sample REST API server for defining user functions can be found at https://github.com/dbcraig/LmRaC. The LmRaC web application image can be pulled from Docker Hub (https://hub.docker.com) as dbcraig/lmrac.
... They often lack depth and technical specificity in specialized domains unless they undergo further training, as noted by [12]. They are prone to producing hallucinations, factual errors, and biases, which stem from their original training data [13], and they lack transparency regarding the sources and credibility of the information they generate [14]. Unlike search engines, which are optimized for information retrieval, LLMs require internet connectivity and struggle with evaluating relevance without a clear search intent, as discussed by [15]. ...
... In particular, the LLM Generative Pretrained Transformer (GPT) and its user interface ChatGPT, have demonstrated remarkable proficiency in structuring text and extracting relevant information in a quantitative way 18 . Their capabilities could revolutionize the way we comprehend and process vast quantities of healthcare data [19][20][21] . For example, GPT-4 has been used to extract structured clinical information from free text reports in radiology 18 , pathology and medicine 22 . ...
Article
Full-text available
Most clinical information is encoded as free text, not accessible for quantitative analysis. This study presents an open-source pipeline using the local large language model (LLM) "Llama 2" to extract quantitative information from clinical text and evaluates its performance in identifying features of decompensated liver cirrhosis. The LLM identified five key clinical features in a zero- and one-shot manner from 500 patient medical histories in the MIMIC IV dataset. We compared LLMs of three sizes and various prompt engineering approaches, with predictions compared against ground truth from three blinded medical experts. Our pipeline achieved high accuracy, detecting liver cirrhosis with 100% sensitivity and 96% specificity. High sensitivities and specificities were also yielded for detecting ascites (95%, 95%), confusion (76%, 94%), abdominal pain (84%, 97%), and shortness of breath (87%, 97%) using the 70 billion parameter model, which outperformed smaller versions. Our study successfully demonstrates the capability of locally deployed LLMs to extract clinical information from free text with low hardware requirements.
... The medical questions used in MIRAGE are knowledgeintensive, which are difficult to answer without external knowledge. Moreover, due to the problem of hallucination, letting LLMs be reasoning engines instead of knowledge databases could be a better practice in medicine (Truhn et al., 2023). Thus, RAG is needed to collect external information for accurate and reliable answer generation. ...
... Irrelevant (but true) information and missing (but true and relevant) information naturally diminish the quality of an evaluation. Moreover, to be comprehensive, we must not only consider the data but also the analysis and the reasoning steps 17 . Aging is a process affecting all aspects of life, so we must consider a wide range of connections with already established knowledge, which may be findable in principle by in-depth searches of the literature and in-depth comparison with other molecular data, as deposited in public databases. ...
Preprint
Full-text available
The field of aging and longevity research is overwhelmed by vast amounts of data, calling for the use of Artificial Intelligence (AI), including Large Language Models (LLMs), for the evaluation of geroprotective interventions. Such evaluations should be correct, useful, comprehensive, explainable, and they should consider causality, interdisciplinarity, adherence to standards, longitudinal data and known aging biology. In particular, comprehensive analyses should go beyond comparing data based on canonical biomedical databases, suggesting the use of AI to interpret changes in biomarkers and outcomes. Our requirements motivate the use of LLMs with Knowledge Graphs and dedicated workflows employing, e.g., Retrieval-Augmented Generation. While naive trust in the responses of AI tools can cause harm, adding our requirements to LLM queries can improve response quality, calling for benchmarking efforts and justifying the informed use of LLMs for advice on longevity interventions.
... Large language models (LLMs) are artificial intelligence (AI) systems that understand and generate human-like natural language responses to text prompts. [1][2][3] These models, trained on vast datasets, have shown remarkable clinical reasoning capabilities [4][5][6] in passing medical licensing examinations 7 8 and generating prevention and treatment recommendations for various conditions including cardiovascular disease 9 10 and breast cancer. 11 They can produce clinical notes, 3 generate radiology reports, 12 13 and even assist in writing research articles. ...
Article
Full-text available
Background A study was undertaken to assess the effectiveness of open-source large language models (LLMs) in extracting clinical data from unstructured mechanical thrombectomy reports in patients with ischemic stroke caused by a vessel occlusion. Methods We deployed local open-source LLMs to extract data points from free-text procedural reports in patients who underwent mechanical thrombectomy between September 2020 and June 2023 in our institution. The external dataset was obtained from a second university hospital and comprised consecutive cases treated between September 2023 and March 2024. Ground truth labeling was facilitated by a human-in-the-loop (HITL) approach, with time metrics recorded for both automated and manual data extractions. We tested three models—Mixtral, Qwen, and BioMistral—assessing their performance on precision, recall, and F1 score across 15 clinical categories such as National Institute of Health Stroke Scale (NIHSS) scores, occluded vessels, and medication details. Results The study included 1000 consecutive reports from our primary institution and 50 reports from a secondary institution. Mixtral showed the highest precision, achieving 0.99 for first series time extraction and 0.69 for occluded vessel identification within the internal dataset. In the external dataset, precision ranged from 1.00 for NIHSS scores to 0.70 for occluded vessels. Qwen showed moderate precision with a high of 0.85 for NIHSS scores and a low of 0.28 for occluded vessels. BioMistral had the broadest range of precision, from 0.81 for first series times to 0.14 for medication details. The HITL approach yielded an average time savings of 65.6% per case, with variations from 45.95% to 79.56%. Conclusion This study highlights the potential of using LLMs for automated clinical data extraction from medical reports. Incorporating HITL annotations enhances precision and also ensures the reliability of the extracted data. This methodology presents a scalable privacy-preserving option that can significantly support clinical documentation and research endeavors.
... To address the challenge of hallucinations in LLMs, a high priority has been placed on the detection and prevention of such phenomena. In the entire automated review generation process, we adopted a multi-level filtering and verification quality control strategy, similar to the concept of retrieval-augmented generation (RAG) [44,45] , to mitigate and correct hallucinations: ...
Preprint
Full-text available
Literature research, vital for scientific advancement, is overwhelmed by the vast ocean of available information. Addressing this, we propose an automated review generation method based on Large Language Models (LLMs) to streamline literature processing and reduce cognitive load. In case study on propane dehydrogenation (PDH) catalysts, our method swiftly generated comprehensive reviews from 343 articles, averaging seconds per article per LLM account. Extended analysis of 1041 articles provided deep insights into catalysts' composition, structure, and performance. Recognizing LLMs' hallucinations, we employed a multi-layered quality control strategy, ensuring our method's reliability and effective hallucination mitigation. Expert verification confirms the accuracy and citation integrity of generated reviews, demonstrating LLM hallucination risks reduced to below 0.5% with over 95% confidence. Released Windows application enables one-click review generation, aiding researchers in tracking advancements and recommending literature. This approach showcases LLMs' role in enhancing scientific research productivity and sets the stage for further exploration.
... This is due to the fact that LLMs have to rely on their internal knowledge which is incomplete and may be biased. Rather, it was proposed that LLMs should be used as reasoning engines 16 with access to external sources that they can access. This approach is called retrieval augmented generation (RAG) 17 and may remedy two problems: firstly, the risk of hallucinating information is reduced, since source material can be used and cited 18 . ...
Preprint
Full-text available
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG's effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
... If what matters is not the source of reasons, but their relevance and the ability of patients and doctors to rationally endorse them, then we propose that "the loop" can be extended to include certain types of medical AI, particularly LLMs, as providers of reasons. These AI systems, when properly designed and implemented, can generate relevant reasons and considerations that could significantly enrich the coreasoning process (see Demaree-Cotton et al., 2022;Haupt & Marks, 2023;Truhn et al. 2023). Support for this idea comes from a recent study (Chu et al., under review) where we asked LLM models to consider moral dilemmas, make judgments, and provide reasons for those judgments. ...
Article
Full-text available
Salloch and Eriksen (2024) present a compelling case for including patients as co-reasoners in medical decision-making involving artificial intelligence (AI). Drawing on O'Neill's neo- Kantian framework (1989), they argue that the "human in the loop" concept should extend beyond physicians to encompass patients as active participants in the reasoning process. While we commend this perspective, we suggest it can be extended further: certain types of medical AI, particularly large language models (LLMs), should also be considered as potential co-reasoners.
... Future AI-powered assistive technologies should be designed with a focus on action-oriented reasoning and task-specific guidance. Although this could be achieved through further user inquiries [51], integrating knowledge bases [52] and event/behavior reasoning engines [53] to enable contextual inference of actions and intentions, and Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design associating visual elements' feedback with reasoning, would greatly reduce the cognitive burden on PVI and enhance the user experience. By leveraging this knowledge, assistive technologies can provide more relevant and actionable guidance to PVI, helping them effectively navigate and interact with their environment to complete desired tasks. ...
Preprint
Full-text available
People with visual impairments perceive their environment non-visually and often use AI-powered assistive tools to obtain textual descriptions of visual information. Recent large vision-language model-based AI-powered tools like Be My AI are more capable of understanding users' inquiries in natural language and describing the scene in audible text; however, the extent to which these tools are useful to visually impaired users is currently understudied. This paper aims to fill this gap. Our study with 14 visually impaired users reveals that they are adapting these tools organically -- not only can these tools facilitate complex interactions in household, spatial, and social contexts, but they also act as an extension of users' cognition, as if the cognition were distributed in the visual information. We also found that although the tools are currently not goal-oriented, users accommodate this limitation and embrace the tools' capabilities for broader use. These findings enable us to envision design implications for creating more goal-oriented, real-time processing, and reliable AI-powered assistive technology.
... Truhn et al. argued that instead of single-shot Q&A, the true potential of LLMs lies in interactive reasoning [22], which our findings clearly corroborate. (Fig. 1C) However, discrepancies between correctness and sufficiency rates in our results indicate that an LLM's performance is tied to the underlying sources of information. ...
Preprint
Full-text available
Background: The increasing complexity of medical knowledge necessitates efficient and reliable information access systems in clinical settings. For quality purposes, most hospitals use standard operating procedures (SOPs) for information management and implementation of local treatment standards. However, in clinical routine, this information is not always easily accessible. Customized Large Language Models (LLMs) may offer a tailored solution, but need thorough evaluation prior to clinical implementation. Objective: To customize an LLM to retrieve information from hospital-specific SOPs, to evaluate its accuracy for clinical use and to compare different prompting strategies and large language models. Methods: We customized GPT-4 with a predefined system prompt and 10 SOPs from four departments at the University Hospital Dresden. Model performance was evaluated through 30 predefined clinical questions of varying degree of detail, which were assessed by five observers with different levels of medical expertise through simple and interactive question-and-answering (Q&A). We assessed answer completeness, correctness and sufficiency for clinical use and the impact of prompt design on model performance. Finally, we compared the performance of GPT-4 with Claude-3-opus. Results: Interactive Q&A yielded the highest rate of completeness (80%), correctness (83%) and sufficiency (60%). Acceptance of the LLM answer was higher among early-career medical staff. Degree of detail of the question prompt influenced answer accuracy, with intermediate-detail prompts achieving the highest sufficiency rates. Comparing LLMs, Claude-3-opus outperformed GPT-4 in providing sufficient answers (70.0% vs. 36.7%) and required fewer iterations for satisfactory responses. Both models adhered to the system prompt more effectively in the self-coded pipeline than in the browser application. All observers showed discrepancies between correctness and accuracy of the answers, which rooted in the representation of information in the SOPs. Conclusion: Interactively querying customized LLMs can enhance clinical information retrieval, though expert oversight remains essential to ensure a safe application of this technology. After broader evaluation and with basic knowledge in prompt engineering, customized LLMs can be an efficient, clinically applicable tool.
... Content created by generative AI can have mistakes that do not align with science. 27 Another concern is the use of LLMs as medical databases; LLMs may provide persuasive yet fake citations when asked about data sources. 5 Unfortunately, clinical validation for generative AI is extremely challenging right now, primarily due to the massive content the model can generate. ...
Article
Full-text available
As the healthcare community increasingly harnesses the power of generative artificial intelligence (AI), critical issues of security, privacy and regulation take centre stage. In this paper, we explore the security and privacy risks of generative AI from model-level and data-level perspectives. Moreover, we elucidate the potential consequences and case studies within the domain of ophthalmology. Model-level risks include knowledge leakage from the model and model safety under AI-specific attacks, while data-level risks involve unauthorised data collection and data accuracy concerns. Within the healthcare context, these risks can bear severe consequences, encompassing potential breaches of sensitive information, violating privacy rights and threats to patient safety. This paper not only highlights these challenges but also elucidates governance-driven solutions that adhere to AI and healthcare regulations. We advocate for preparedness against potential threats, call for transparency enhancements and underscore the necessity of clinical validation before real-world implementation. The objective of security and privacy improvement in generative AI warrants emphasising the role of ophthalmologists and other healthcare providers, and the timely introduction of comprehensive regulations.
... Given the temporal nature of the knowledge held by LLMs and their susceptibility to generating hallucinations, it is more appropriate to employ LLMs as scientific reasoning engines, rather than knowledge bases. 27 Exploring whether and how LLMs understand scientific claims specific to a particular clinical question and whether they can contribute to summarizing actionable clinical insights constitutes a compelling area of investigation. Hence, the primary aim of our study here is to evaluate the capability of a LLM (eg, ChatGPT) in differentiating conflicting evidence for a specific question and in summarizing key unsolved problems by analyzing disparate research results. ...
Article
Abstract Objective Synthesizing and evaluating inconsistent medical evidence is essential in evidence-based medicine. This study aimed to employ ChatGPT as a sophisticated scientific reasoning engine to identify conflicting clinical evidence and summarize unresolved questions to inform further research. Materials and Methods We evaluated ChatGPT’s effectiveness in identifying conflicting evidence and investigated its principles of logical reasoning. An automated framework was developed to generate a PubMed dataset focused on controversial clinical topics. ChatGPT analyzed this dataset to identify consensus and controversy, and to formulate unsolved research questions. Expert evaluations were conducted 1) on the consensus and controversy for factual consistency, comprehensiveness, and potential harm and, 2) on the research questions for relevance, innovation, clarity, and specificity. Results The gpt-4-1106-preview model achieved a 90% recall rate in detecting inconsistent claim pairs within a ternary assertions setup. Notably, without explicit reasoning prompts, ChatGPT provided sound reasoning for the assertions between claims and hypotheses, based on an analysis grounded in relevance, specificity, and certainty. ChatGPT’s conclusions of consensus and controversies in clinical literature were comprehensive and factually consistent. The research questions proposed by ChatGPT received high expert ratings. Discussion Our experiment implies that, in evaluating the relationship between evidence and claims, ChatGPT considered more detailed information beyond a straightforward assessment of sentimental orientation. This ability to process intricate information and conduct scientific reasoning regarding sentiment is noteworthy, particularly as this pattern emerged without explicit guidance or directives in prompts, highlighting ChatGPT’s inherent logical reasoning capabilities. Conclusion This study demonstrated ChatGPT’s capacity to evaluate and interpret scientific claims. Such proficiency can be generalized to broader clinical research literature. ChatGPT effectively aids in facilitating clinical studies by proposing unresolved challenges based on analysis of existing studies. However, caution is advised as ChatGPT’s outputs are inferences drawn from the input literature and could be harmful to clinical practice.
... Fine-tuned with human feedback via reinforcement learning from human feedback (RHLF) or other procedures, LLMs can generate responses to prompts considered plausible to humans, powering conversational agents 17 . They also demonstrate a remarkable ability to structure and categorize information 13 , including in medicine [18][19][20] . However, LLMs exhibit bias, hallucinations, and inaccuracies, which, when twinned with plausible responses presented with ostensible certainty, can mislead users, casting doubts about their suitability for many tasks in clinical medicine including the interoperability and linking of medical knowledge 21,22 . ...
Article
Full-text available
Reliably processing and interlinking medical information has been recognized as a critical foundation to the digital transformation of medical workflows, and despite the development of medical ontologies, the optimization of these has been a major bottleneck to digital medicine. The advent of large language models has brought great excitement, and maybe a solution to the medicines’ ‘communication problem’ is in sight, but how can the known weaknesses of these models, such as hallucination and non-determinism, be tempered? Retrieval Augmented Generation, particularly through knowledge graphs, is an automated approach that can deliver structured reasoning and a model of truth alongside LLMs, relevant to information structuring and therefore also to decision support.
... These are AI models that can understand and synthesize text at human-level performance, and therefore have potential to unlock text as a quantitative data resource. LLMs encode clinical knowledge 67 , can process diverse medical textual data 68 , and can be used for logical reasoning 69 and beyond that for many tasks in clinical research, practice, and education 70 . LLMs are a new tool and will have substantial effects on almost any area of medicine, including liver oncology. ...
Article
Liver cancer has high incidence and mortality globally. Artificial intelligence (AI) has advanced rapidly, influencing cancer care. AI systems are already approved for clinical use in some tumour types (for example, colorectal cancer screening). Crucially, research demonstrates that AI can analyse histopathology, radiology and natural language in liver cancer, and can replace manual tasks and access hidden information in routinely available clinical data. However, for liver cancer, few of these applications have translated into large-scale clinical trials or clinically approved products. Here, we advocate for the incorporation of AI in all stages of liver cancer management. We present a taxonomy of AI approaches in liver cancer, highlighting areas with academic and commercial potential, and outline a policy for AI-based liver cancer management, including interdisciplinary training of researchers, clinicians and patients. The potential of AI in liver cancer is immense, but effort is required to ensure that AI can fulfil expectations.
... Large Language Models (LLMs) have demonstrated substantial potential across diverse applications in the medical field. We agree with such viewpoint that, it is more appropriate to employ LLMs as scientific reasoning engines, rather than knowledge bases since it often fall short when it comes to factual accuracy of the generated scientific knowledge (18). We assume LLMs would perform well in such a complex task of argument quality analysis, given the multitude of logical reasoning processes involved, including discerning premises and conclusions, assessing the varying strength of evidence across different sentences, and ensuring strong correlation in extracted triplets within sentences, thus minimizing information loss. ...
Preprint
Full-text available
The Semantic MEDLINE Database (SemMedDB) has limited performance in identifying entities and relations, while also neglects variations in argument quality, especially persuasive strength across different sentences. The present study aims to utilize large language models (LLMs) to evaluate the contextual argument quality of triples in SemMedDB to improve the understanding of disease mechanisms. Using argument mining methods, we first design a quality evaluation framework across four major dimensions, triples' accuracy, triple-sentence correlation, research object, and evidence cogency, to evaluate the argument quality of the triple-based claim according to their contextual sentences. Then we choose a sample of 66 triple-sentence pairs for repeated annotations and framework optimization. As a result, the predicted performances of GPT-3.5 and GPT-4 are excellent with an accuracy up to 0.90 in the complex cogency evaluation task. The tentative case evaluating whether there exists an association between gestational diabetes and periodontitis reveals accurate predictions (GPT-4, accuracy, 0.88). LLMs-enabled argument quality evaluation is promising for evidence integration in understanding disease mechanisms, especially how evidence in two stances with varying levels of cogency evolves over time.
... Die Entwickler von ChatGPT gehen sogar soweit, dies als erste Vorzeichen von generalisierter künstlicher Intelligenz zu sehen, und sprechen von "sparks of artificial general intelligence" [3]. Diese Fähigkeit eröffnet den Nutzen von LLM als Expertensysteme, die die Ärztin/den Arzt im klinischen Alltag und in der Forschung unterstützen können [22]. ...
Article
Large language models such as Chat-GPT (generative pretrained transformer) have improved dramatically in their capabilities within the past 2 years. These models can now “reason” and understand language to an extent such that their use in clinical medicine comes within reach. This article aims to provide an overview over the underlying working principles of large language models and their potential application in medicine, in particular in oncology. Large language models have the potential to support oncologists in clinical practice in a variety of settings and can lead to better quality of care and more efficient processes, leaving more time for effective patient care.
... Of course, there is a possibility that a Large Language Model is hallucinating, producing false statements and presenting them as facts. For this reason, at the present state of development, LLMs should be used to combine content and create new scientific hypotheses but not as knowledge databases supposed to contain fact-checked information (Truhn et al. 2023). ...
Article
Full-text available
In this article, I explore the evolving affordances of artificial intelligence technologies. Through an evocative dialogue with ChatGPT, a form of a postdigital duoethnography between a human and an artificial intelligence algorithm, I discuss issues of knowledge production, research methods, epistemology, creativity, entropy, and self-organization. By reflecting on my own lived experience during this dialogue, I explore how human-artificial intelligence synergies can facilitate new insights and amplify human creative potential. As human-artificial intelligence entanglements activate multiple possibilities, I emphasize how understanding the impact of technology on individuals and communities becomes a critical challenge. In an era where the postdigital becomes the dominant narrative of science and education, the human mind will never be the same again. However, it is not given how human beings and artificial intelligence technologies are going to coevolve as parts of a complex postdigital confluence. Although I make no specific prediction of the future, I make the call for a relationship between humans and technology, informed by complex living systems epistemology, that will promote a more empowering postdigital narrative for individuals and communities. To this direction, this article introduces a methodological framework for the practice of postdigital duoethnography.
... 18 Their capabilities could revolutionize the way we comprehend and process vast quantities of healthcare data. [19][20][21] For example, GPT-4 has been used to extract structured clinical information from free text reports in radiology, 18 pathology and medicine. 22 All rights reserved. ...
Preprint
Background and Aims Most clinical information is encoded as text, but extracting quantitative information from text is challenging. Large Language Models (LLMs) have emerged as powerful tools for natural language processing and can parse clinical text. However, many LLMs including ChatGPT reside in remote data centers, which disqualifies them from processing personal healthcare data. We present an open-source pipeline using the local LLM 'Llama 2' for extracting quantitative information from clinical text and evaluate its use to detect clinical features of decompensated liver cirrhosis. Methods We tasked the LLM to identify five key clinical features of decompensated liver cirrhosis in a zero- and one-shot way without any model training. Our specific objective was to identify abdominal pain, shortness of breath, confusion, liver cirrhosis, and ascites from 500 patient medical histories from the MIMIC IV dataset. We compared LLMs with three different sizes and a variety of pre-specified prompt engineering approaches. Model predictions were compared against the ground truth provided by the consent of three blinded medical experts. Results Our open-source pipeline yielded in highly accurate extraction of quantitative features from medical free text. Clinical features which were explicitly mentioned in the source text, such as liver cirrhosis and ascites, were detected with a sensitivity of 100% and 95% and a specificity of 96% and 95%, respectively from the 70 billion parameter model. Other clinical features, which are often paraphrased in a variety of ways, such as the presence of confusion, were detected only with a sensitivity of 76% and a specificity of 94%. Abdominal pain was detected with a sensitivity of 84% and a specificity of 97%. Shortness of breath was detected with a sensitivity of 87% and a specificity of 96%. The larger version of Llama 2 with 70b parameters outperformed the smaller version with 7b parameters in all tasks. Prompt engineering improved zero-shot performance, particularly for smaller model sizes. Conclusion Our study successfully demonstrates the capability of using locally deployed LLMs to extract clinical information from free text. The hardware requirements are so low that not only on-premise, but also point-of-care deployment of LLMs are possible.
... This discrepancy indicates that while GPT-4V is skilled in identification tasks, further refinement is necessary for its decisionmaking abilities and planning, a known limitation of current LLMs. 3 The adaptability and universality of these models across various domains is highlighted by the fact that GPT models did not go through initial training specifically for the medical domain. Although the findings are promising, caution should be exercised as diagnostic accuracy is just one aspect of clinical practice. ...
Preprint
Full-text available
Importance: Artificial intelligence will become an integral part of clinical medicine. Large Language Models are promising to candidates, in particular with their multimodal ability. These models need to be evaluated in real clinical cases. Objective: To test whether GPT-4V can consistently comprehend complex diagnostic scenarios. Design: A selection of 140 clinical cases from the JAMA Clinical Challenge and 348 from the NEJM Image Challenge were used. Each case, comprising a clinical image and corresponding question, was processed by GPT-4V, and responses were documented. The significance of imaging information was assessed by comparing GPT-4V's performance with that of four other leading-edge large language models (LLMs). Main Outcomes and Measures: The accuracy of responses was gauged by juxtaposing the model's answers with the established ground truths of the challenges. The confidence interval for the model's performance was calculated using bootstrapping methods. Additionally, human performance on the NEJM Image Challenge was measured by the accuracy of challenge participants. Results: GPT-4V demonstrated superior accuracy in analyses of both text and images, achieving an accuracy of 73.3% for JAMA and 88.7% for NEJM, notably outperforming text-only LLMs such as GPT-4, GPT-3.5, Llama2, and Med-42. Remarkably, both GPT-4V and GPT-4 exceeded average human participants' performance at all complexity levels within the NEJM Image Challenge. Conclusions and Relevance: GPT-4V has exhibited considerable promise in clinical diagnostic tasks, surpassing the capabilities of its predecessors as well as those of human raters who participated in the challenge. Despite these encouraging results, such models should be adopted with prudence in clinical settings, augmenting rather than replacing human judgment.
Article
Artificial intelligence (AI) is rapidly advancing in hepatocellular carcinoma (HCC) management, offering promising applications across diagnosis, prognosis, and treatment. In histopathology, deep learning models have shown impressive accuracy in differentiating liver lesions and extracting prognostic information from tissue samples. For biomarker discovery, AI techniques applied to multi-omics data have identified novel prognostic signatures and predictors of immunotherapy response. In radiology, convolutional neural networks have demonstrated high performance in classifying hepatic lesions, grading tumors, and predicting microvascular invasion from computed tomography (CT) and magnetic resonance imaging (MRI) images. Multimodal AI models integrating histopathology, genomics, and clinical data are emerging as powerful tools for risk stratification. Large language models (LLMs) show potential to support clinical decision making and patient education, though concerns about accuracy remain. While AI holds immense promise, several challenges must be addressed, including algorithmic bias, data privacy, and regulatory compliance. The successful implementation of AI in HCC care will require ongoing collaboration between clinicians, data scientists, and ethicists. As AI technologies continue to evolve, they are expected to enable more personalized approaches to HCC management, potentially improving diagnosis, treatment selection, and patient outcomes. However, it is crucial to recognize that AI tools are designed to assist, not replace, clinical expertise. Continuous validation in diverse, real-world settings will be essential to ensure the reliability and generalizability of AI models in HCC care.
Article
Purpose This study aims to evaluate the performance of LLMs with various prompt engineering strategies in the context of health fact-checking. Design/methodology/approach Inspired by Dual Process Theory, we introduce two kinds of prompts: Conclusion-first (System 1) and Explanation-first (System 2), and their respective retrieval-augmented variations. We evaluate the performance of these prompts across accuracy, argument elements, common errors and cost-effectiveness. Our study, conducted on two public health fact-checking datasets, categorized 10,212 claims as knowledge, anecdotes and news. To further analyze the reasoning process of LLM, we delve into the argument elements of health fact-checking generated by different prompts, revealing their tendencies in using evidence and contextual qualifiers. We conducted content analysis to identify and compare the common errors across various prompts. Findings Results indicate that the Conclusion-first prompt performs well in knowledge (89.70%,66.09%), anecdote (79.49%,79.99%) and news (85.61%,85.95%) claims even without retrieval augmentation, proving to be cost-effective. In contrast, the Explanation-first prompt often classifies claims as unknown. However, it significantly boosts accuracy for news claims (87.53%,88.60%) and anecdote claims (87.28%,90.62%) with retrieval augmentation. The Explanation-first prompt is more focused on context specificity and user intent understanding during health fact-checking, showing high potential with retrieval augmentation. Additionally, retrieval-augmented LLMs concentrate more on evidence and context, highlighting the importance of the relevance and safety of retrieved content. Originality/value This study offers insights into how a balanced integration could enhance the overall performance of LLMs in critical applications, paving the way for future research on optimizing LLMs for complex cognitive tasks. Peer review The peer review history for this article is available at: https://publons.com/publon/10.1108/OIR-02-2024-0111
Article
Over the past two decades, oncological treatment approaches have evolved from a generalized treatment paradigm to an individualized treatment concept. Despite substantial progress, major challenges persist particularly in managing rare tumor entities (24% of all cancers) and heavily pretreated patients with complex resistance mechanisms. Increasingly more patients benefit from extended molecular pathological diagnostics. These data are interpreted by molecular tumor boards (MTB) and individually tailored treatment plans are developed. The process of translating genomic data into clinical treatment plans is complicated. The implementation necessitates substantial effort: approximately 80% are implemented in off-label use. Fragmented data and manual curation of data collectively hinder the broader implementation of MTBs in clinical practice. Artificial intelligence (AI)-assisted decision support systems can analyze large datasets and identify clinically relevant patterns. While AI is able to access annotated medical images in dermatology, pathology and radiology, unstructured clinical text data from personalized medicine are difficult to process. Moreover, the lack of standardized precision oncological recommendations has further constrained the integration of AI technologies by machine learning into the MTB workflow. The domain-specific AI system MEREDITH, which employs a retrieval-augmented generation architecture, seeks to address these limitations. A proof of concept study showed that MEREDITH exhibits strong concordance with expert clinical recommendations but further evaluation studies are needed to validate the clinical utility of MEREDITH in real-world MTB practice. The Model Project Genome Sequencing is anticipated to further increase the complexity of oncological care, emphasizing the need for the integration of innovative technologies.
Article
PURPOSE Rapidly expanding medical literature challenges oncologists seeking targeted cancer therapies. General-purpose large language models (LLMs) lack domain-specific knowledge, limiting their clinical utility. This study introduces the LLM system Medical Evidence Retrieval and Data Integration for Tailored Healthcare (MEREDITH), designed to support treatment recommendations in precision oncology. Built on Google's Gemini Pro LLM, MEREDITH uses retrieval-augmented generation and chain of thought . METHODS We evaluated MEREDITH on 10 publicly available fictional oncology cases with iterative feedback from a molecular tumor board (MTB) at a major German cancer center. Initially limited to PubMed -indexed literature (draft system), MEREDITH was enhanced to incorporate clinical studies on drug response within the specific tumor type, trial databases, drug approval status, and oncologic guidelines. The MTB provided a benchmark with manually curated treatment recommendations and assessed the clinical relevance of LLM-generated options (qualitative assessment). We measured semantic cosine similarity between LLM suggestions and clinician responses (quantitative assessment). RESULTS MEREDITH identified a broader range of treatment options (median 4) compared with MTB experts (median 2). These options included therapies on the basis of preclinical data and combination treatments, expanding the treatment possibilities for consideration by the MTB. This broader approach was achieved by incorporating a curated medical data set that contextualized molecular targetability. Mirroring the approach MTB experts use to evaluate MTB cases improved the LLM's ability to generate relevant suggestions. This is supported by high concordance between LLM suggestions and expert recommendations (94.7% for the enhanced system) and a significant increase in semantic similarity from the draft to the enhanced system (from 0.71 to 0.76, P = .01). CONCLUSION Expert feedback and domain-specific data augment LLM performance. Future research should investigate responsible LLM integration into real-world clinical workflows.
Article
Full-text available
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the weights of the LLM, we can deliberately inject incorrect biomedical facts. The erroneous information is then propagated in the model’s output while maintaining performance on other biomedical tasks. We validate our findings in a set of 1025 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
Article
Full-text available
The application of ChatGPTin the medical field has sparked debate regarding its accuracy. To address this issue, we present a Multi-Role ChatGPT Framework (MRCF), designed to improve ChatGPT's performance in medical data analysis by optimizing prompt words, integrating real-world data, and implementing quality control protocols. Compared to the singular ChatGPT model, MRCF significantly outperforms traditional manual analysis in interpreting medical data, exhibiting fewer random errors, higher accuracy, and better identification of incorrect information. Notably, MRCF is over 600 times more time-efficient than conventional manual annotation methods and costs only one-tenth as much. Leveraging MRCF, we have established two user-friendly databases for efficient and straightforward drug repositioning analysis. This research not only enhances the accuracy and efficiency of ChatGPT in medical data science applications but also offers valuable insights for data analysis models across various professional domains.
Article
Artificial intelligence (AI) has been commoditized. It has evolved from a specialty resource to a readily accessible tool for cancer researchers. AI-based tools can boost research productivity in daily workflows, but can also extract hidden information from existing data, thereby enabling new scientific discoveries. Building a basic literacy in these tools is useful for every cancer researcher. Researchers with a traditional biological science focus can use AI-based tools through off-the-shelf software, whereas those who are more computationally inclined can develop their own AI-based software pipelines. In this article, we provide a practical guide for non-computational cancer researchers to understand how AI-based tools can benefit them. We convey general principles of AI for applications in image analysis, natural language processing and drug discovery. In addition, we give examples of how non-computational researchers can get started on the journey to productively use AI in their own work.
Article
This study compares 2 large language models and their performance vs that of competing open-source models.
Article
Full-text available
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model¹ (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM² on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA³, MedMCQA⁴, PubMedQA⁵ and Measuring Massive Multitask Language Understanding (MMLU) clinical topics⁶), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
Preprint
Full-text available
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
  • Anonymous
  • S Bubeck
  • Anonymous
  • Anonymous