ArticlePDF Available

Abstract and Figures

While still in its infancy, ChatGPT (Generative Pretrained Transformer), introduced in November 2022, is bound to hugely impact many industries, including healthcare, medical education, biomedical research, and scientific writing. Implications of ChatGPT, that new chatbot introduced by OpenAI on academic writing, is largely unknown. In response to the Journal of Medical Science (Cureus) Turing Test - call for case reports written with the assistance of ChatGPT, we present two cases one of homocystinuria-associated osteoporosis, and the other is on late-onset Pompe disease (LOPD), a rare metabolic disorder. We tested ChatGPT to write about the pathogenesis of these conditions. We documented the positive, negative, and rather troubling aspects of our newly introduced chatbot's performance.
Content may be subject to copyright.
Review began 02/06/2023
Review ended 02/12/2023
Published 02/19/2023
© Copyright 2023
Alkaissi et al. This is an open access article
distributed under the terms of the Creative
Commons Attribution License CC-BY 4.0.,
which permits unrestricted use, distribution,
and reproduction in any medium, provided
the original author and source are credited.
Artificial Hallucinations in ChatGPT:
Implications in Scientific Writing
Hussam Alkaissi , Samy I. McFarlane
1. Internal Medicine, Kings County Hospital Center, Brooklyn, USA 2. Internal Medicine, Veterans Affairs Medical
Center, Brooklyn, USA 3. Internal Medicine, State University of New York Downstate Medical Center, Brooklyn, USA
Corresponding author: Hussam Alkaissi, hussam.alkaissi@downstate.edu
Abstract
While still in its infancy, ChatGPT (Generative Pretrained Transformer), introduced in November 2022, is
bound to hugely impact many industries, including healthcare, medical education, biomedical research, and
scientific writing. Implications of ChatGPT, that new chatbot introduced by OpenAI on academic writing, is
largely unknown. In response to the Journal of Medical Science (Cureus) Turing Test - call for case reports
written with the assistance of ChatGPT, we present two cases one of homocystinuria-associated
osteoporosis, and the other is on late-onset Pompe disease (LOPD), a rare metabolic disorder. We tested
ChatGPT to write about the pathogenesis of these conditions. We documented the positive, negative, and
rather troubling aspects of our newly introduced chatbot’s performance.
Categories: Endocrinology/Diabetes/Metabolism, Internal Medicine, Healthcare Technology
Keywords: artificial intelligence and writing, artificial intelligence and education, chatgpt, chatbot, artificial
intelligence in medicine
Editorial
Although large language models such as ChatGPT can produce increasingly realistic text, the accuracy and
integrity of using these models in scientific writing are unknown. In this paper, we present the case of
ChatGPT, a new chatbot introduced by OpenAI as a natural language generator (NLG) yet able to produce
artificial hallucinations. We hoped to investigate ChatGPT's ability to generate factually correct scientific
writing by asking it to provide short paragraphs on specific medical and non-medical topics and evaluating
the generated text.
By the time Cureus Medical Journal called for reports written with the assistance of ChatGPT, we were
working on several projects, one on pathophysiological mechanisms of homocysteine and another on liver
involvement in late-onset Pompe disease (LOPD).
We asked ChatGPT to provide a short paragraph on the mechanism of homocysteine-induced osteoporosis.
It was a stunning moment when ChatGPT provided a paragraph that touched base on three main aspects,
osteoblast inhibition, osteoclasts over activity, and, surprisingly, their mechanism on vitamin K-related
carboxylation of osteocalcin (Figure 1).
FIGURE 1: Initial response of ChatGPT to provide a paragraph on the
molecular mechanism and pathogenesis of homocystinuria-induced
osteoporosis.
1, 2, 3 3
Open Access
Editorial DOI: 10.7759/cureus.35179
How to cite this article
Alkaissi H, McFarlane S I (February 19, 2023) Artificial Hallucinations in ChatGPT: Im plications in Scientific Writing. Cureus 15(2): e35179. DOI
10.7759/cureus.35179
A thorough review of the literature on bone metabolism and homocysteine, the first two facts provided by
ChatGPT are correct regarding osteoblast and osteoclast imbalance and the progression of osteoporosis.
Similarly, when taken alone, the biochemistry of undercarboxylated osteocalcin and osteoporosis is the valid
mechanism by which vitamin K deficiency is associated with osteoporosis. Homocysteine can reduce
osteocalcin production but has nothing to do with post-translational carboxylation of osteocalcin glutamate
residues.
We asked ChatGPT to explain these findings further and provide references to fact-check the presumed
"homocysteine-vitamin K-osteocalcin" axis in osteoporosis (Figure 2). Hence, it provided five reference
dating to the early 2000s. None of the provided paper titles existed, and all provided PubMed IDs (PMIDs)
were of different unrelated papers. For example, the citation "Kallajoki M, et al. Homocysteine and bone
metabolism. Osteoporos Int. 2002 Oct;13(10):822-7. PMID: 12352394" proposed by ChatGPT has the PMID:
12352394. When searching said PMID, the resulting paper is entirely different and in a different field -
"Grubb RL 3rd, Sundaram CP, Yan Y, Chen C, McDougall EM, Clayman RV. Use of titanium staples during
upper tract laparoscopic reconstructive surgery: initial experience. J Urol. 2002 Oct;168(4 Pt 1):1366-9. doi:
10.1097/01.ju.0000025337.09758.3c. PMID: 12352394."
FIGURE 2: References provided by ChatGPT. PMID numbers correlated
to different papers.
PMID: PubMed ID
We then requested ChatGPT to provide more recent references from the last 10 years. The list provided was
the same as the first list but with different years and similarly with PMID numbers that belong to different
papers.
We then tested ChatGPT in a different area; for example, we asked the chatbot to write a short essay on liver
involvement in LOPD. Of note, liver involvement is known to happen rarely in the infantile, more severe
form but not the LOPD. ChatGPT, with apparent confidence, provided an essay on liver involvement which,
in reality, has not been reported yet (Figure 3). We do not exclude the possibility that such reports may exist
in non-English languages; in fact, we tested ChatGPT in LOPD and liver disease because we have
unpublished data that such a connection may exist.
2023 Alkaissi et al. Cureus 15(2): e35179. DOI 10.7759/cureus.35179 2 of 4
FIGURE 3: Assay on a non-existing link between late-onset Pompe
disease (LOPD) and liver involvement.
We found two areas helpful for the chatbot's current version to aid in academic writing. First, if the authors
do all the literature review, and the bullet points short notes are provided from each reference, ChatGPT can
make a linguistically coherent text out of the small scattered bullet points, almost like assembling a jigsaw
puzzle.
The second area in ChatGPT that can be helpful in academic writing is references and citation sorting and
management. For example, we wrote a lengthy discussion section, and references were mentioned as PMID
followed each sentence or section. Therefore, we needed to identify which PMIDs are recurrent to label such
references with the same reference number and avoid citing the same reference paper twice. On a first take,
ChatGPT is unable to identify recurrent PMID within the text. However, when asked to write a Python code
to identify such recurrent large numbers (five integers or more), the code successfully identified all recurrent
PMID numbers within the text.
The new chatbot ChatGPT presents a leap in artificial intelligence and academic writing, and arguments on
its use as an aid in academic manuscript preparation have been raised. Here we tested ChatGPT's ability to
write short essays on familiar topics, followed by scrutiny of provided text and fact-checking. ChatGPT
provided confident responses that seemed faithful and non-sensical when viewed in light of the
common knowledge in these areas. Such a phenomenon has been described as “artificial hallucination” [1].
ChatGPT defines artificial hallucination in the following section. “Artificial hallucination refers to the
phenomenon of a machine, such as a chatbot, generating seemingly realistic sensory experiences that do not
correspond to any real-world input. This can include visual, auditory, or other types of
hallucinations. Artificial hallucination is not common in chatbots, as they are typically designed to respond
based on pre-programmed rules and data sets rather than generating new information. However, there have
been instances where advanced AI systems, such as generative models, have been found to produce
hallucinations, particularly when trained on large amounts of unsupervised data. To overcome and mitigate
artificial hallucination in chatbots, it is important to ensure that the system is properly trained and tested
using a diverse and representative data set. Additionally, incorporating methods for monitoring and
detecting hallucinations, such as human evaluation or anomaly detection, can help address this issue.”
In a recent experiment done by Gao et al., 50 abstracts from five scientific journals were used, and ChatGPT
was asked to provide abstracts based on titles. Plagiarism, AI detector, and blinded human reviewers then
reviewed both sets of abstracts. Of ChatGPT’s generated abstracts, 68% were detected as such (true positive),
and 14% of the real abstracts were missed as chatbot generated (false positive). Interestingly, the human
reviewers stated that it was difficult to identify whether the abstract was written by a human author or a
chatbot [1,2].
2023 Alkaissi et al. Cureus 15(2): e35179. DOI 10.7759/cureus.35179 3 of 4
Another use for chatGPT can be in medical education. A study assessed ChatGPT's ability to handle complex
medical and clinical information by testing its performance on the US Medical Licensing Examination
(USMLE) Step 1, Step 2 CK, and Step 3, as open-ended and multiple-choice questions (MCQ). In the first,
ChatGPT scores range from 43% to 68%, and in the MCQ range from 40% to 65%. The accuracy was lowest on
Step 1 of the USMLE, regarded as the most difficult exam. This indicates that the AI's performance is tied to
human perception and understanding of the subject matter. High internal concordance was also noticed,
especially in correctly answering questions, with a concordance rate of up to 99% [3]. The integration of
ChatGPT in academic writing has sparked a polarizing debate among scholars. While some see it as a
valuable tool for streamlining the writing process, others view it as a threat to the integrity of authorship [4].
While ChatGPT can write credible scientific essays, the data it generates is a mix of true and completely
fabricated ones. This raises concerns about the integrity and accuracy of using large language models in
academic writing, such as ChatGPT. We propose that policy and practice for evaluating scientific
manuscripts for journals and medical conferences be modified to maintain rigorous scientific standards. We
also advocate for including AI output detectors in the editorial process and clear disclosure if these
technologies are used. The use of large language models in scientific writing is still debatable regarding
ethics and acceptability, together with the potential of creating false experts in the medical field with the
potential of causing harm due to a lack of real experience and the generation of expert opinions through AI-
ChatGPT.
Additional Information
Disclosures
Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the
following: Payment/services info: All authors have declared that no financial support was received from
any organization for the submitted work. Financial relationships: All authors have declared that they have
no financial relationships at present or within the previous three years with any organizations that might
have an interest in the submitted work. Other relationships: All authors have declared that there are no
other relationships or activities that could appear to have influenced the submitted work.
Acknowledgements
Some of the text within the manuscript has been generated with the aid of ChatGPT and has been put
between quotation marks.
References
1. Ji Z, Lee N, Frieske R, et al.: Survey of hallucination in natural language generation . ACM Comput Surv.
2022, 10.1145/3571730
2. Gao CA, Howard FM, Nikolay S: Abstracts written by ChatGPT fool scientists. [Preprint] . bioRxiv. 2022,
10.1101/2022.12.23.521610
3. Kung TH, Cheatham M, ChatGPT, et al.: Performance of ChatGPT on USMLE: potential for AI-assisted
medical education using large language models. [Preprint]. medRxiv. 2022, 10.1101/2022.12.19.22283643
4. Stokel-Walker C: ChatGPT listed as author on research papers: many scientists disapprove . Nature. 2023,
613:620-1. 10.1038/d41586-023-00107-z
2023 Alkaissi et al. Cureus 15(2): e35179. DOI 10.7759/cureus.35179 4 of 4
... Hallucinations | Researchers have highlighted that ChatGPT hallucinations diminish its reliability as a tool in academic research. Alkaissi & McFarlane (2023) evaluate ChatGPT as an academic writing tool. While the authors highlight that ChatGPT can aid in assembling a coherent text from provided bullet points and in managing citation references, they note that ChatGPT generates a mix of true and fabricated sources, which raises concerns about integrity and accuracy. ...
... First, previous studies assessing the ability of ChatGPT have shown an appreciable error rates, as was observed in this study. 58,59 During our analysis, both ChatGPT 3.5 and 4o were found to confidently provide recommendations in situations where guidelines lacked sufficient evidence to offer them. An example of this kind of response has been provided in the supplementary materials (Appendix 6). ...
Preprint
Introduction: Physicians treating multisystem diseases face challenges in consulting expanding, complex clinical guidelines. Large language models like ChatGPT may help consolidate this information, providing quick access to guideline recommendations. The objective of this study was to assess the accuracy of ChatGPT 3.5 and 4o responses to questions based on specialist-level guideline recommendations. Methods: A framework was developed for authors to pose questions, based on a guideline recommendation, to ChatGPT. A validation tool graded responses as concordant, partially concordant, or discordant to the guideline recommendation. A total of 581 recommendations from three guidelines were analyzed. The primary outcome was overall accuracy. Subgroup analyses assessed accuracy based on number of criteria, strength of evidence, and type of recommendation. Results: For ChatGPT 3.5, 347 recommendations were concordant (59.72%), 128 partially concordant (22.03%), and 106 discordant (18.24%). Questions seeking a single response (Z = 5.289, p < .001) and questions based on recommendations with strong levels of evidence (OR 2.23, p = .001) generated higher levels of concordance. For ChatGPT 4o, 474 recommendations were concordant (81.6%), 82 partially concordant (14.1%), and 25 discordant (4.3%). Mean concordance ratings for single questions were significantly higher compared to multipart questions (Z = 3.08, p = .002). Mean concordance ratings for ChatGPT 4o were substantially higher compared to ChatGPT 3.5 (Z = 8.66, p < .00001). Discussion: ChatGPT 3.5 had a moderate level of accuracy. There remain weaknesses in its ability to answer multi-part questions or those backed by weaker evidence. ChatGPT 4o performed substantially better than ChatGPT 3.5, though both models were vulnerable to hallucination.
... Firstly, Hallucinations are an important metric for the evaluation of Large Language Models, as they can be a measure of the factual correctness of the output (Alkaissi & McFarlane 2023). Within the context of LLMs, a hallucination is defined as "the generated content that is nonsensical or unfaithful to the provided source content" (Ji et al. 2023). ...
... Gen AI may misinterpret prompts, as seen in the examples given above. It may also falsely create content that appears accurate and believable, a phenomenon known as hallucination (Alkaissi and McFarlane, 2023), which may mislead learners if not appropriately reviewed or critiqued. ...
Article
Full-text available
Simulation-based education plays a critical role in paramedic education by enabling learners to develop both clinical and non-technical skills in a safe, realistic environment. This article concludes a continuing professional development series on generative artificial intelligence (Gen AI), focusing on how this technology can support and enhance simulation design. It explores how prompt engineering can reduce facilitator workload, improve scenario consistency and support reflective practice through augmented debriefing. Practical examples of Gen AI-generated simulation prompts are presented and critically analysed to illustrate both the potential and limitations of current tools. As paramedic education continues to evolve, Gen AI offers promising opportunities to enrich simulation delivery – provided that its use is guided by professional standards and supported by appropriate human oversight.
... The system uses Gemini as backbone with real-time monitoring through Papertrail logging, incorporating mechanisms to reduce hallucinations through trusted, domain-specific sources. and preventing hallucinations that plague generalpurpose models (Alkaissi & McFarlane 2023;Huang et al. 2023;Towhidul Islam Tonmoy et al. 2024). This agent employs Gemini's embedding-004 model 3 to create vector representations of both queries and knowledge base content, enabling semantic matching between student questions and relevant course materials. ...
Preprint
Full-text available
We present a study of LLM integration in final-year undergraduate astronomy education, examining how students develop AI literacy through structured guidance and documentation requirements. We developed AstroTutor, a domain-specific astronomy tutoring system enhanced with curated arXiv content, and deployed it alongside general-purpose LLMs in the course. Students documented their AI usage through homework reflections and post-course surveys. We analyzed student evolution in AI interaction strategies and conducted experimental comparisons of LLM-assisted versus traditional grading methods. LLM grading showed strong correlation with human evaluation while providing more detailed and consistent feedback. We also piloted LLM-facilitated interview-based examinations as a scalable alternative to traditional assessments, demonstrating potential for individualized evaluation that addresses common testing limitations. Students experienced decreased rather than increased reliance on LLMs over the semester, developing critical evaluation skills and strategic tool selection. They evolved from basic assistance-seeking to verification workflows, with documentation requirements fostering metacognitive awareness. Students developed effective prompting strategies, contextual enrichment techniques, and cross-verification practices. Our findings suggest that structured LLM integration with transparency requirements and domain-specific tools can enhance astronomy education while building essential AI literacy skills. We provide implementation guidelines for educators and make our AstroTutor repository freely available.
Article
Background Lack of information is a critical challenge in occupational health. With over 180 million users, ChatGPT has become a prominent trend, swiftly addressing a wide array of queries, yet it critically needs validation in occupational health. Objective This study evaluated GPT-3.5 (free version) and GPT-4 (paid version) on their ability to respond to Occupational Risk Prevention formal multiple-choice questions. Methods A total of 303 questions were assessed, categorized across four levels of complexity—task-specific, national, European, and global—within various Spanish regions. Results GPT-3.5 achieved an overall accuracy of 56.8%, while GPT-4 reached 73.9% (p < 0.001). GPT-3.5 showed particularly limited performance on domain-specific content. Both models shared similar error patterns, with incorrect response rates ranging from 18–24% across regions. Conclusion Despite GPT-4's improved performance, both models display notable limitations in occupational health applications. To enhance reliability, four strategies are proposed: formal validation, continuous training, error analysis, and regional adaptation.
Article
This study examines the effectiveness of artificial intelligence (AI) in psychological report writing by comparing reports generated by human psychologists with those produced by OpenAI’s Generative Pre-trained Transformer Version 4 (ChatGPT-4). A total of 249 licensed psychologists evaluated the reports based on overall quality, readability, writing style, organization, summary quality, recommendations, preference, and willingness to sign off on the reports. Although human-generated reports were generally rated more favorably and participants expressed greater comfort in approving them, effect sizes were typically small. Two exceptions were noted: moderate effect sizes were found in favor of human-written summaries, while AI-generated reports showed moderate effect sizes for the quality of their recommendations. These findings suggest that AI shows potential for augmenting report writing. Comprehensive guidelines are necessary for the ethical and effective integration of AI into psychological practice. Further research is needed to enhance our understanding of AI’s role and capabilities in psychological assessment and reporting.
Article
The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.
Preprint
Full-text available
We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, even clinical decision-making.
Article
At least four articles credit the AI tool as a co-author, as publishers scramble to regulate its use. At least four articles credit the AI tool as a co-author, as publishers scramble to regulate its use. Credit: Iryna Imago/Shutterstock Hands typing on a laptop keyboard with screen showing artificial intelligence chatbot ChatGPT Hands typing on a laptop keyboard with screen showing artificial intelligence chatbot ChatGPT
Article
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
Abstracts written by ChatGPT fool scientists. [Preprint]
  • C A Gao
  • F M Howard
  • S Nikolay
  • Gao CA
Gao CA, Howard FM, Nikolay S: Abstracts written by ChatGPT fool scientists. [Preprint]. bioRxiv. 2022, 10.1101/2022.12.23.521610
Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. [Preprint]
  • TH Kung
  • M Cheatham
  • ChatGPT ChatGPT