ArticlePDF Available

Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: a pilot single center study

Authors:

Abstract and Figures

Purpose The potential of Large Language Models (LLMs) in enhancing a variety of natural language tasks in clinical fields includes medical imaging reporting. This pilot study examines the efficacy of a retrieval-augmented generation (RAG) LLM system considering zero-shot learning capability of LLMs, integrated with a comprehensive database of PET reading reports, in improving reference to prior reports and decision making. Methods We developed a custom LLM framework with retrieval capabilities, leveraging a database of over 10 years of PET imaging reports from a single center. The system uses vector space embedding to facilitate similarity-based retrieval. Queries prompt the system to generate context-based answers and identify similar cases or differential diagnoses. From routine clinical PET readings, experienced nuclear medicine physicians evaluated the performance of system in terms of the relevance of queried similar cases and the appropriateness score of suggested potential diagnoses. Results The system efficiently organized embedded vectors from PET reports, showing that imaging reports were accurately clustered within the embedded vector space according to the diagnosis or PET study type. Based on this system, a proof-of-concept chatbot was developed and showed the framework’s potential in referencing reports of previous similar cases and identifying exemplary cases for various purposes. From routine clinical PET readings, 84.1% of the cases retrieved relevant similar cases, as agreed upon by all three readers. Using the RAG system, the appropriateness score of the suggested potential diagnoses was significantly better than that of the LLM without RAG. Additionally, it demonstrated the capability to offer differential diagnoses, leveraging the vast database to enhance the completeness and precision of generated reports. Conclusion The integration of RAG LLM with a large database of PET imaging reports suggests the potential to support clinical practice of nuclear medicine imaging reading by various tasks of AI including finding similar cases and deriving potential diagnoses from them. This study underscores the potential of advanced AI tools in transforming medical imaging reporting practices.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
European Journal of Nuclear Medicine and Molecular Imaging
https://doi.org/10.1007/s00259-025-07101-9
Introduction
The integration of Large Language Models (LLMs) into the
clinical domain has heralded a new era in healthcare innova-
tion, particularly in the realm of medical imaging reports [1,
2]. LLMs, with their sophisticated zero-shot learning capa-
bilities, have shown promise in parsing, summarizing, and
generating complex medical texts, thereby enhancing the
eciency and accuracy of clinical documentation and deci-
sion-making processes [3]. Their application extends across
various specialties, aiming to revolutionize how healthcare
Hongyoon Choi
chy1000@snu.ac.kr
1 Department of Nuclear Medicine, Seoul National University
Hospital, 101 Daehak-ro, Jongno-gu, Seoul
03080, Republic of Korea
2 Department of Nuclear Medicine, Seoul National University
College of Medicine, Seoul, Republic of Korea
3 Portrai, Inc., Seoul, Republic of Korea
Abstract
Purpose The potential of Large Language Models (LLMs) in enhancing a variety of natural language tasks in clinical elds
includes medical imaging reporting. This pilot study examines the ecacy of a retrieval-augmented generation (RAG) LLM
system considering zero-shot learning capability of LLMs, integrated with a comprehensive database of PET reading reports,
in improving reference to prior reports and decision making.
Methods We developed a custom LLM framework with retrieval capabilities, leveraging a database of over 10 years of PET
imaging reports from a single center. The system uses vector space embedding to facilitate similarity-based retrieval. Queries
prompt the system to generate context-based answers and identify similar cases or dierential diagnoses. From routine clini-
cal PET readings, experienced nuclear medicine physicians evaluated the performance of system in terms of the relevance
of queried similar cases and the appropriateness score of suggested potential diagnoses.
Results The system eciently organized embedded vectors from PET reports, showing that imaging reports were accurately
clustered within the embedded vector space according to the diagnosis or PET study type. Based on this system, a proof-of-
concept chatbot was developed and showed the framework’s potential in referencing reports of previous similar cases and
identifying exemplary cases for various purposes. From routine clinical PET readings, 84.1% of the cases retrieved relevant
similar cases, as agreed upon by all three readers. Using the RAG system, the appropriateness score of the suggested poten-
tial diagnoses was signicantly better than that of the LLM without RAG. Additionally, it demonstrated the capability to
oer dierential diagnoses, leveraging the vast database to enhance the completeness and precision of generated reports.
Conclusion The integration of RAG LLM with a large database of PET imaging reports suggests the potential to support
clinical practice of nuclear medicine imaging reading by various tasks of AI including nding similar cases and deriving
potential diagnoses from them. This study underscores the potential of advanced AI tools in transforming medical imaging
reporting practices.
Keywords PET reports · Large language model · Retrieval-augmented generation · Articial intelligence
Received: 14 October 2024 / Accepted: 17 January 2025
© The Author(s) 2025
Empowering PET imaging reporting with retrieval-augmented large
language models and reading reports database: a pilot single center
study
HongyoonChoi1,2,3 · DongjooLee3· Yeon-kooKang1· MinseokSuh1,2
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
professionals interact with and leverage vast amounts of
medical data for patient care.
Despite the growing interest and proven benets of LLMs
in many areas of medicine, their potential has not been
fully explored in the realm of nuclear medicine imaging,
particularly PET imaging reporting. Despite the potential
of ChatGPT to revolutionize content creation by generat-
ing human-like text [4], specic applications leveraging
LLMs in the nuclear medicine eld, particularly for imaging
reports, has not been explored. PET imaging, which is per-
formed for a variety of purposes and conditions, produces
complex data requiring thorough analysis and interpreta-
tion, playing a critical role in clinical decision-making [5,
6]. There is a need for advanced tools to aid in referencing
past reports, sourcing cases for educational purposes, and
conducting dierential diagnoses, especially as the use of
PET, which encompasses various radiotracers and diseases,
becomes more widespread. This unmet need presents a sig-
nicant opportunity for LLMs to improve the specicity
and relevance of PET report generation. By leveraging prior
reports and analogous case studies, LLMs can provide clini-
cians with valuable insight, aiding them in making informed
decisions.
In this study, we introduce a pioneering approach to PET
imaging reporting by developing and accessing a custom-
built, retrieval-augmented generation (RAG) LLM frame-
work [7]. This system leverages a comprehensive large
database of PET reading reports. By embedding these
reports into a vector space for ecient retrieval based on
similarity metrics, our framework aims to enhance PET
imaging reporting in three key ways: (1) Assisting PET
reading experts by referencing past reports, enabling them
to review similar cases and outcomes during the diagnostic
process. (2) Supporting educational purposes by identify-
ing appropriate cases for case-centered study. (3) Facilitat-
ing interactive queries related to PET reading for clinicians,
based on a database of past reports. This proof-of-concept
study seeks to demonstrate the feasibility and benets of
integrating advanced LLM capabilities with a vast reposi-
tory of PET imaging data, aiming to set enhanced medical
imaging reporting practices.
Materials and methods
Dataset
This study was conducted at a single center, utilizing reading
reports of PET imaging data sourced from the clinical data
warehouse (CDW) of the SUPREME Platform. We extracted
data spanning from 2010 to 2023, comprising reports from
118,107 patients across 211,813 cases. Institutional Review
Board (IRB) approval was secured from our hospital (IRB
No. 2401-090-1501), with the requirement for written
informed consent waived due to the retrospective nature of
the study and the use of deidentied information. The data-
set encompassed reading reports for all cases, along with
the exam date, exam name, a deidentied research identier
(ID), sex, and date of birth (year-month format).
Model architecture
In this study, we designed a proof-of-concept chatbot sys-
tem for eciently querying reading reports from a substan-
tial dataset. It was based on ‘RAG’ [7]. The adaptability of
this system allows for the utilization of various database
formats, including but not limited to ‘csv’ les, to accom-
modate dierent sources of reading reports. This system
amalgamates state-of-the-art language model technologies
with sophisticated natural language processing and infor-
mation retrieval techniques, aiming to deliver precise, con-
textually relevant responses to inquiries concerning PET
imaging reading reports. The overall workow of this sys-
tem is illustrated in Fig. 1.
The architecture of our system is underpinned by a
series of modular components, each crucial for interpret-
ing and responding to user queries. At the forefront is a
sentence embedding layer, crafted to process intricate
texts and queries by transforming sentences into vectors.
This transformation facilitates subsequent processing by
various mathematical models. We employed the Sentence
Transformer model, specically the “paraphrase-multilin-
gual-MiniLM-L12-v2” ( h t t p s : / / h u g g i n g f a c e . c o / s e n t e n c e
- t r a n s f o r m e r s / p a r a p h r a s e - m u l t i l i n g u a l - M i n i L M - L 1 2 - v 2 ) ,
renowned for its ability to comprehend and paraphrase texts
across multiple languages—a necessary feature consider-
ing the bilingual nature (English and Korean) of the reading
reports in our dataset. To manage and retrieve PET reading
reports eectively, our system incorporates a vector storage
mechanism, Chroma (Chroma, h t t p s : / / w w w . t r y c h r o m a . c o
m / ) . Chroma organizes textual data into a searchable vec-
tor space by converting text into numerical vectors derived
from the sentence embeddings. This conversion enables the
system to execute advanced retrieval operations, identify-
ing responses that are semantically relevant to the queries
posed. The retrieval after embedding to Chroma was per-
formed using the cosine similarity of the query text vectors,
retrieving the top-k texts from the database as context for
generating prompts for the LLM. We set this top-k value
to k = 5.
After retrieving the related context, specically previ-
ous PET reports, a question-answering (QA) component
was integrated. This QA mechanism excels at comprehend-
ing user queries, sourcing the most pertinent documents
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
from the dataset, and formulating informative responses
that precisely address the queries. To generate prompts, the
system integrates retrieved texts as contexts along with the
reader’s question to create a full prompt. For example, the
prompt includes the text: “Give an answer by only referring
to the context, include the address within the context in the
answer, and clearly number the answer,” along with (con-
text), which contains the retrieved reports, and (question),
representing the readers query. For the generation of these
responses, we incorporated the Llama-3 (7-billion param-
eter model) language model [8] and the system architecture
was based on Langchain [9].
Visualization of vector embedding
Following the process of sentence embedding, the result-
ing vectors were stored in a vector database. These vectors
played a crucial role in identifying similarities between var-
ious texts, including the queries submitted to the system. To
facilitate a deeper understanding of how PET reading reports
are represented within this vector space, we employed
t-distributed Stochastic Neighbor Embedding (t-SNE) for
visualization purposes [10]. Specic keywords associated
with imaging reports, such as “lung cancer,” “breast can-
cer,” “lymphoma,” “methionine PET,” and “PSMA PET,”
were chosen for this analysis. The objective was to ascer-
tain whether reports containing these selected terms would
naturally form distinct clusters within the vector space. This
approach aimed to visually demonstrate the eectiveness of
our vector embedding process in grouping similar reports,
thereby providing insights into the semantic relationships
and similarities between dierent PET reports in the dataset.
Test examples
In the evaluation of prototype chatbots designed for navi-
gating an extensive database of PET reading reports, we
focused on testing their ability to accurately retrieve reports
similar to those specied in user queries and to assist in dif-
ferential diagnosis by referencing previous reports. This
involved assessing the prociency in identifying cases with
specic diagnoses or imaging ndings and their capability
to extract relevant information to support nuclear medicine
experts in diagnosing complex cases. The testing protocol
simulated real-world scenarios, presenting the chatbots with
diverse clinical questions to comprehensively evaluate their
utility in clinical decision-making and their eectiveness in
leveraging the vast database to enhance the accuracy and
relevance of their responses.
Fig. 1 Workow of the Chatbot System for Querying PET Imag-
ing Reading Reports. The overall workow of the proof-of-concept
system designed for ecient querying of reading reports from a
substantial dataset is illustrated. The system integrates the Retrieval-
Augmented Generation (RAG) model with advanced language model
technologies, natural language processing, and information retrieval
techniques. The workow demonstrates the process from user query
input through to the delivery of the relevant reading report, showcas-
ing the operational framework and interaction with dierent sources
of reading reports
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
Results
Clustered unstructured PET reports by sentence
embedding
We analyzed PET imaging reports from 118,107 patients,
totaling 211,813 cases, by converting them into vector
embeddings. These embeddings were then visualized on a
t-SNE plot to demonstrate dimensionality reduction and the
clustering of reports with similar characteristics (Fig. 2A).
Each point on this plot represents a unique PET imaging
report, with a specic case highlighted in red for illustrative
purposes, including its original report. By examining the
distribution of these clusters, we observed distinct group-
ings based on diagnostic terms and exam types, indicating
that reports with similar clinical contexts naturally grouped
together in the embedding space. For instance, to evaluate
the representational ecacy of the embeddings, reports con-
taining key diagnostic terms such as ‘lung cancer’, ‘breast
cancer’, and ‘lymphoma’, as well as those pertaining to spe-
cic types of exams like ‘C-11 methionine PET’ and ‘Ga-68
PSMA-11 PET’, were marked on the plot. The clusters con-
taining ‘lung cancer’ exhibited substantial cohesion, poten-
tially reecting the higher prevalence of lung cancer cases in
our dataset, while distinct clusters also emerged for ‘breast
cancer,’ ‘lymphoma,’ and specic PET modalities such as
C-11 methionine PET and Ga-68 PSMA-11 PET (Fig. 2B).
These cohesive clusters highlight clinically meaningful pat-
terns, suggesting that sentence embeddings from unstruc-
tured reports could be leveraged to make a context for using
LLM for question and answering. The formation of these
distinct clusters underscores the text embedding ability in
PET reports to reect the semantic similarity among cases,
oering potential clinical utility in identifying disease-spe-
cic patterns and retrieving relevant texts.
LLM with RAG chatbot-assisted querying and
suggested diagnosis
Using the prototype chatbot, we tested its ecacy in iden-
tifying cases pertinent to specic user queries. A notable
instance involved the chatbot’s response to the query,
Identify cases of breast cancer with metastasis to internal
mammary lymph nodes,” where it prociently located and
presented relevant cases from the database of prior reading
reports (Fig. 3A) (More examples are presented in Supple-
mentary Video 1). This example demonstrates how clini-
cians or trainees could rapidly nd comparable cases for
reference, potentially aiding diagnostic reasoning or edu-
cational purposes. The retrieved cases included key details
from prior reports, allowing users to cross-reference imaging
ndings, disease progression, and nal outcomes in patients
Evaluation of queried similar cases and potential
diagnoses
From daily routine PET exams, we simulated prompts to
evaluate relevance and appropriateness by three indepen-
dent nuclear medicine physicians. We extracted 19 cases
from routine PET exams and their reports to evaluate two
tasks: query performance for similar cases and potential
diagnoses from ndings. To evaluate query performance
for similar cases, we used the text from the conclusions of
the PET reports to generate prompts such as “nd similar
cases and summarize the reports.” For evaluating potential
diagnoses, specic texts from the ndings sections of the
reports were used to generate prompts to “suggest poten-
tial diagnoses for this nding.” Examples of conclusions
and ndings used for these prompts are summarized in
Supplementary Table 1. Three nuclear medicine physicians
independently scored the system’s answers for medical rele-
vance on a scale of 1 (poor), 2 (fair), and 3 (good). The gold
standard for these evaluations was the consensus judgment
of these experienced physicians, who assessed the medical
relevance and accuracy of the system’s responses based on
their expert knowledge and clinical experience. To assess
the eect of the RAG on the performance of the LLM, we
compared the appropriateness scores of the LLM with and
without RAG using the Wilcoxon rank-sum test. This com-
parative analysis helped determine the added value of the
RAG framework in enhancing the relevance and accuracy
of the generated responses.
In addition to performance evaluations based on physi-
cian scoring, a quantitative assessment was conducted to
evaluate the accuracy of conclusions generated from nd-
ings. By inputting text from the ndings section, the LLM
with and without RAG was tested for its ability to generate
conclusion texts for reading reports, simulating diagnostic
reasoning. (prompt: Write a concise conclusion, includ-
ing a potential diagnosis, in one or two sentences”). These
generated conclusions were compared to the actual con-
clusion reports described by nuclear medicine physicians.
The comparisons were quantied using the ROUGE-L met-
ric (Recall-Oriented Understudy for Gisting Evaluation),
which measures the alignment between generated and
reference texts by focusing on the longest common subse-
quences (LCS) while accounting for word order [11, 12]. To
assess the overall quantitative performance, the ROUGE-L
F-score—representing the harmonic mean of precision and
recall—was calculated for both the LLM with RAG and
without RAG. This evaluation highlights the impact of the
RAG framework on improving the alignment and relevance
of the generated conclusions.
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
Fig. 2 Visualization of PET Imaging Report Embeddings Using
t-SNE. (A) t-SNE plot illustrates PET imaging report embeddings
from 118,107 patients, totaling 211,813 cases. Each point on the plot
represents a unique report, with a selected case highlighted in red to
show an example of an original report. (B) t-SNE plots showcases the
clustering ecacy of the embeddings, highlighting how reports con-
taining key diagnostic terms like ‘lung cancer’, ‘breast cancer’, ‘lym-
phoma’, and specic types of exams such as ‘C-11 methionine PET’
and ‘Ga-68 PSMA-11 PET’ form distinct clusters. These clusters indi-
cate the embeddings’ capability to reect the similarity among cases,
demonstrating the potential of this method in facilitating the identica-
tion and visualization of related PET imaging reports
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
Fig. 3 Examples of Chatbot Responses to Queries. (A) An example
case displays an instance of the chatbot’s capability to accurately iden-
tify and present relevant cases in response to a user query about breast
cancer with metastasis to internal mammary lymph nodes. It highlights
the capacity to navigate a vast database of previous reading reports
to identify relevant cases. (B) An example of the utility of system in
generating dierential diagnoses is displayed. This is demonstrated
through the chatbot’s response to a query, where it oers a detailed list
of potential diagnoses along with reference identiers. As an example,
by employing identiers within the PACS system (in this example,
we used deidentied information), prior imaging cases could be refer-
enced for understanding cases and supporting decision making
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
Discussion
In this study, we have explored the integration of LLMs into
the PET imaging reporting process, presenting a novel pro-
totype chatbot based on RAG capable of retrieving relevant
cases and oering dierential diagnoses based on specic
user queries. This LLM with RAG represents a feasibility
for medical purposes in nuclear medicine imaging eld,
particularly by incorporating contextual understanding
from previous PET imaging reports to respond to queries
from nuclear medicine physicians. This approach marks a
departure from simple chatbot functionalities, introducing
a system that integrates with the clinical workow to pro-
vide contextually relevant information and insights. This
proof-of-concept not only validates the utility of LLMs in
enhancing the PET reporting process but also underscores
the potential of AI-assisted tools to augment diagnostic
accuracy and clinical decision-making in nuclear medicine.
The RAG model combines the strengths of information
retrieval and generative AI to oer precise and informa-
tive answers to complex medical queries. It works by rst
retrieving relevant documents or data points from a vast
database—in this case, a collection of PET imaging reports.
Following this, the model uses the retrieved information as
a context to generate responses that are not only relevant
but also enriched with the specicity and detail required for
decision-making. This method allows the system to pro-
vide answers that are deeply informed by historical cases
and existing medical knowledge, thereby supporting physi-
cians in diagnosing and managing patient care with a higher
degree of accuracy and condence. The introduction of RAG
not only reduces the risk of hallucinations but also enhances
the accuracy of responses by grounding them in specialized,
domain-specic data. This is particularly important in PET
reporting, where the complexity and specicity of the infor-
mation require expertise-driven answers. RAG provides
a viable solution for eectively applying LLMs to such
specialized areas, ensuring more reliable and contextually
appropriate outputs. In contrast to earlier language models
that concentrated on singular tasks [1315], models based
on the RAG framework with LLMs can handle diverse que-
ries and produce varied outputs. The RAG model, distinct
from LLMs that rely solely on their pre-trained datasets,
actively incorporates pertinent historical information during
its response generation. Primarily, employing LLMs like
ChatGPT or Gemini directly is constrained by their inabil-
ity to access individual center databases, which restricts
their reference to prior cases and clinical outcomes. In this
regard, a previous study demonstrated that RAG applica-
tions can enhance domain-specic decision-making when
using LLMs in medical elds, whereas querying and retriev-
ing specic cases to reference previous outcomes in nuclear
with similar clinical scenarios. Additionally, we evaluated
the chatbot’s functionality in oering dierential diagnoses
by leveraging its integration with LLM. This was exempli-
ed in a scenario where the chatbot was tasked to provide
dierential diagnoses for the condition described as Mul-
tiple mediastinal lymph nodes with increased FDG uptake
without an identied primary site.” The chatbot responded
with a detailed list of dierential diagnoses, accompanied
by reference identiers, thus enabling medical professionals
to quickly locate and compare relevant case histories, imag-
ing ndings, and clinical outcomes (Fig. 3B).
Taken together, these examples underscore the poten-
tial for integrating real-world historical data into the deci-
sion-making process. By referencing prior PET reports
through the RAG framework, clinicians receive contextu-
ally enriched insights, which can be especially valuable for
less common clinical presentations. This improved retrieval
and diagnosis suggestion process highlights a practical way
to apply generative AI tools in nuclear medicine practice,
where rapid access to similar cases and dierential diagno-
ses can benet patient care.
Evaluating appropriateness for case querying and
diagnosis suggestion using LLM with RAG
In addition, the appropriateness scores evaluated by nuclear
medicine physicians were assessed for two dierent simu-
lated tasks: querying similar cases and suggesting potential
diagnoses from specic ndings. Firstly, for the similar
cases queried by specic reports, 16 out of 19 (84.2%) were
appropriately identied, with all three readers rating these
as better than ‘Fair in relevance (Fig. 4A). Furthermore, the
appropriateness of potential diagnoses for specic ndings
was evaluated, with 15 out of 19 (78.9%) cases receiving
a better than ‘Fair (2)’ grade from all readers for the sug-
gested potential diagnoses. To compare the performance of
the LLM with and without RAG, the Wilcoxon rank sum
test was conducted. The LLM with RAG showed signi-
cantly better appropriateness scores compared to the LLM
without RAG (W = 226; p < 0.05) (Fig. 4B). In addition to
the appropriateness assessed by physicians’ scores, the con-
clusions generated using ndings with and without the RAG
framework were quantitatively evaluated. The ROUGE-L
F-score, which measures how well the generated conclu-
sion from ndings captures the reference conclusion text,
was signicantly higher for the RAG framework compared
to the LLM without RAG (0.16 ± 0.08 vs 0.07 ± 0.03,
p < 0.001; Fig. 4C).
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
institutions rather than serving as a universal model for all
hospitals. In other words, implementing LLM with RAG
frameworks to retrieve data specic to each hospital could
improve responses to questions directly related to that data.
In addition, this feature is especially benecial in specialty
elds like nuclear medicine, where insights drawn from
previous cases are helpful for informed decision-making in
current clinical scenarios.
In this study, we evaluated the performance of LLM-
based answers for two tasks: querying similar cases and sug-
gesting potential diagnoses based on previous reports. The
medicine imaging are specialized tasks addressed by our
work [16]. Moreover, due to stringent regulations concern-
ing clinical data and privacy, the transfer of clinical records
to external AI servers is considered highly sensitive and is
inherently prohibited in numerous healthcare institutions
[17, 18]. In this context, implementing a LLM with RAG
framework that utilizes PET reading reports could address
these challenges by facilitating the application of real-
world data in each hospital, while also avoiding the vari-
ous data-related regulatory constraints. Although tested in
a single-center study, this approach is tailored to individual
Fig. 4 Evaluation of Appropriateness Scores by Nuclear Medicine
Physicians. (A) The appropriateness of querying similar cases was
assessed. Using a conclusion text to generate the prompt “nd simi-
lar reports and summarize it,” the system retrieved results. For spe-
cic reports, 16 out of 19 (84.2%) were appropriately identied, with
all three readers rating these as better than ‘Fair’ in relevance. (B)
The appropriateness of potential diagnoses for specic ndings was
evaluated. Using specic nding texts to generate prompts for sug-
gesting potential diagnoses, the responses of system were assessed.
Medical relevance and appropriateness of the suggested potential
diagnoses were evaluated by readers. The system without RAG was
also assessed, and the performance of the LLM with and without RAG
was represented as a heatmap. The results indicated that the LLM with
RAG showed signicantly better appropriateness scores (p < 0.05). (C)
The ROUGE-L F-score was used to quantitatively evaluate the align-
ment between generated conclusions and reference conclusion texts
from nding descriptions. The RAG framework demonstrated signi-
cantly higher scores compared to the LLM without RAG (0.16 ± 0.08
vs. 0.07 ± 0.03, p < 0.001)
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
thereby enhancing their diagnostic skills and understanding
of nuclear medicine [4, 22]. Furthermore, the ability of this
system to reference previous cases when providing dier-
ential diagnoses enriches the educational content with prac-
tical, real-world examples, fostering critical thinking and
decision-making skills among trainees.
One potential application of this system is its ability to
correlate imaging ndings with follow-up clinical results,
including nal diagnoses and clinical outcomes because the
RAG LLM can reference previous reports. These previous
references allow readers to nd similar cases and trace their
future clinical outcomes or nal diagnoses. By integrating
historical data-driven context into the imaging interpreta-
tion process, the system oers an opportunity to provide a
holistic view of the clinical journey of similar cases, from
imaging to nal outcome [23, 24]. This comprehensive
approach facilitates a more nuanced understanding of the
potential implications of specic imaging ndings, guiding
physicians in crafting PET imaging interpretation that are
informed by both the current condition and comparable past
cases. The insights derived from this analysis are invaluable
for informing dierential diagnosis, predicting patient out-
comes, and even anticipating potential complications. Such
insights are crucial for bridging the gap between imaging
ndings and patient management strategies, ultimately con-
tributing to improved patient care.
However, the study also acknowledges certain limita-
tions, including the inherent risk of generating inaccurate
information (hallucinations) and the current model’s reli-
ance on textual data [20, 21]. Additionally, due to limita-
tions in retrieval performance, the system showed poor
appropriateness score in retrieving rare cases and their
related potential diagnosis. This aects the overall per-
formance and quality, as experienced physicians would
nd the system most useful for rare or atypical cases. To
address this, using better LLM models that allow for a larger
number of tokens and can reference more previous reports
simultaneously could mitigate these issues. However, this
requires further study and development. Additionally, while
our RAG approach avoids the pitfalls of overtting through
the use of pre-trained language models without additional
training, it is important to recognize the limitations inher-
ent in the database composition. The variability in disease
prevalence across dierent hospitals may impact the perfor-
mance of similar case retrieval, potentially limiting the gen-
eralizability of our ndings. Further studies involving more
diverse and representative datasets are necessary to validate
and enhance the robustness of our tool. A larger, multicenter
study is required to validate the approach across dierent
clinical settings, given the variations in PET indications and
disease prevalence across centers. These dierences could
impact the model’s performance. However, this approach
similar case retrieval demonstrated good performance, cor-
rectly identifying similar cases in nearly 90% of instances.
The use of the sentence transformer in our retrieval method
of RAG provides an advantage in handling PET reports as
unstructured text data, which is common in large-scale hos-
pital settings. Unlike traditional query systems that require
structured tagging, our approach allows for eective data
retrieval without the need for extensive pre-processing,
making it more adaptable and practical for real-world
clinical applications. However, the system showed limita-
tions with rare cases; for example, it failed to appropriately
retrieve a case of scalp angiosarcoma due to its rarity. In this
case, we could consider incorporating a database speci-
cally labeled with rare cases and implementing a weighting
system to prioritize their retrieval during queries. Address-
ing the retrieval of rare case-related data is a crucial aspect
of applying LLMs in medical elds. Managing a database
enriched with rarity information could signicantly enhance
the performance of LLMs with RAG, particularly in PET
reporting, by improving their ability to handle uncommon
and complex cases eectively [19]. We also assessed the
use of RAG for generating answers. By leveraging con-
texts from previous PET reports, RAG provided reliable
and medically relevant responses. In particular, during the
generation of potential diagnoses, RAG could reference pre-
vious cases, which helped readers perceive the answers as
reliable and relevant, mitigating the hallucination eect—a
common issue with LLMs in medical applications [20, 21].
Despite the positive results, the system with RAG has limi-
tations, especially with rare cases, and the potential diag-
noses could be inuenced by the contexts of queried cases,
reducing the number of suggested diagnoses. Nonetheless,
the ability of RAG to reference relevant cases that clinicians
and readers can review adds a crucial layer of validation,
reducing the potential risks associated with noise and com-
plex multi-disease scenarios. This approach distinguishes
it from the direct application of LLMs for PET reading-
related questions. Additional optimized methods for using
RAG to identify rare cases and incorporate more context
will enhance the system’s performance. In addition, while
our evaluation relied on expert judgment as the gold stan-
dard, we acknowledge the inherent subjectivity in human
assessments, which may impact reproducibility. To address
this, we have provided the prompts used in Supplementary
Table 1, allowing for testing across various LLM systems.
Future studies should incorporate objective metrics and
more diverse, representative datasets to further enhance the
generalizability and robustness of our approach.
The application of our system extends beyond diagnos-
tic support, serving as a valuable educational resource. By
facilitating access to similar cases, it enables medical prac-
titioners and trainees to explore diverse clinical scenarios,
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
Number: 1711137868, RS-2020-KD000006) and the NAVER Digital
Bio Innovation Research Fund, funded by NAVER Corporation (Grant
No. 3720230020).
Data availability Due to personal information protection policies, the
complete datasets of reading reports are not available outside the hos-
pital server. Sample data are included in the supplementary materials
and their related contents can be provided by the corresponding author
upon reasonable request.
Declarations
Competing interests H.C. is a co-founder of Portrai.
Ethics approval The retrospective analysis of human data and the
waiver of informed consent were approved by the Institutional Review
Board of the Seoul National University Hospital (No. 2401-090-1501).
Consent to participate Written informed consent was acquired from
all patients.
Consent to publish Not applicable.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format,
as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate
if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit h t t p : / / c r e a t i v e c o m m o n s . o
r g / l i c e n s e s / b y / 4 . 0 / .
References
1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan
TF, Ting DSW. Large language models in medicine. Nat Med.
2023;29:1930–40.
2. Elkassem AA, Smith AD. Potential use cases for ChatGPT in
radiology reporting. Am J Roentgenol. 2023.
3. Doshi R, Amin K, Khosla P, Bajaj S, Chheang S, Forman HP. Uti-
lizing Large Language Models to Simplify Radiology Reports:
a comparative analysis of ChatGPT3. 5, ChatGPT4. 0, Google
Bard, and Microsoft Bing. medRxiv. 2023:2023.06. 04.23290786.
4. Alberts IL, Mercolli L, Pyka T, Prenosil G, Shi K, Rominger A,
et al. Large language models (LLM) and ChatGPT: what will the
impact on nuclear medicine be? Eur J Nucl Med Mol Imaging.
2023;50:1549–52.
5. Monshi MMA, Poon J, Chung V. Deep learning in generating
radiology reports: a survey. Artif Intell Med. 2020;106:101878.
6. Tie X, Shin M, Pirasteh A, Ibrahim N, Huemann Z, Castellino SM
et al. Personalized impression generation for PET reports using
large Language models. J Imaging Inf Med. 2024:1–18.
7. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et
al. Retrieval-augmented generation for knowledge-intensive nlp
tasks. Adv Neural Inf Process Syst. 2020;33:9459–74.
leverages LLMs tailored to individual hospital settings
through RAG without requiring complex LLM training,
demonstrating its potential utility in this report. In addition
to the challenges and future perspectives, one of the future
challenges, exploring the integration of multimodal data,
such as combining visual and textual analysis, are identied
as essential steps forward. This future direction promises
not only to mitigate the limitations but also to further enrich
the system’s utility by providing a more holistic approach to
medical query answering and decision support.
Conclusion
In conclusion, our suggested AI framework arm the trans-
formative potential of AI-assisted tools in nuclear medicine,
particularly in the context of PET imaging report analysis.
The integration of an RAG LLM with a comprehensive PET
imaging report database demonstrated feasibility for use in
real-world clinical routines in nuclear medicine, particularly
for imaging interpretation and reporting. This approach
enhances the workow of nuclear medicine physician and
relevance of PET report generation, possibly supporting
decision-making and providing educational benets. It
underscores the potential role of AI in improving the qual-
ity and ecacy of medical care within nuclear medicine.
Furthermore, as we look to the future, the development of
better LLM and multimodal models stands as a pivotal next
step in overcoming current limitations and fully realizing
the benets of AI in medical imaging. This proof-of-concept
study and proposed framework demonstrated the feasibil-
ity of using LLMs in the clinical routine of nuclear medi-
cine, particularly by leveraging large report databases and
showed promise for improving diagnostics, education, and
patient management.
Supplementary Information The online version contains
supplementary material available at h t t p s : / / d o i . o r g / 1 0 . 1 0 0 7 / s 0 0 2 5 9 - 0
2 5 - 0 7 1 0 1 - 9 .
Acknowledgements We employed ChatGPT, developed by OpenAI,
exclusively for grammatical corrections and enhancements in clarity
and it did not generate any new content.
Author contributions Conceptualization and design: H.C.; data acqui-
sition: H.C., Y.K., and M.S.; data analysis: H.C. and D.J.L.; original
draft preparation: H.C. and Y.K.; review and editing: H.C., Y.K., and
M.S.; supervision: H.C.; funding acquisition: H.C., and Y.K. All au-
thors have read and agreed to the submission of the manuscript.
Funding Open Access funding enabled and organized by Seoul Na-
tional University Hospital.
This research was supported by Korea Medical Device Development
Fund grant funded by the Korea government (the Ministry of Science
and ICT, the Ministry of Trade, Industry and Energy, the Ministry of
Health & Welfare, the Ministry of Food and Drug Safety) (Project
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
European Journal of Nuclear Medicine and Molecular Imaging
17. Minssen T, Vayena E, Cohen IG. The challenges for regulat-
ing medical use of ChatGPT and other large language models.
JAMA. 2023.
18. Meskó B, Topol EJ. The imperative for regulatory oversight of
large language models (or generative AI) in healthcare. NPJ Digit
Med. 2023;6:120.
19. Wang G, Ran J, Tang R, Chang C-Y, Chuang Y-N, Liu Z et al.
Assessing and enhancing large language models in rare disease
question-answering. arXiv Preprint arXiv:240808422. 2024.
20. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucina-
tion in natural language generation. ACM-CSUR. 2023;55:1–38.
21. Alkaissi H, McFarlane SI. Articial hallucinations in ChatGPT:
implications in scientic writing. Cureus. 2023;15.
22. Currie G, Barry K. ChatGPT in nuclear medicine education. J
Nucl Med Technol. 2023;51:247–54.
23. Silva W, Poellinger A, Cardoso JS, Reyes M. Interpretability-
guided content-based medical image retrieval. Medical Image
Computing and Computer Assisted Intervention–MICCAI 2020:
23rd International Conference, Lima, Peru, October 4–8, 2020,
Proceedings, Part I 23: Springer; 2020. pp. 305 – 14.
24. Choe J, Hwang HJ, Seo JB, Lee SM, Yun J, Kim M-J, et al.
Content-based image retrieval by using deep learning for
interstitial lung disease diagnosis with chest CT. Radiology.
2022;302:187–97.
Publisher’s note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional aliations.
8. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y
et al. Llama 2: open foundation and ne-tuned chat models. arXiv
Preprint arXiv:230709288. 2023.
9. Topsakal O, Akinci TC. Creating large language model applica-
tions utilizing langchain: A primer on developing llm apps fast.
International Conference on Applied Engineering and Natural
Sciences; 2023. pp. 1050-6.
10. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J
Mach Learn Res. 2008;9.
11. Lin C-Y, Rouge. A package for automatic evaluation of summa-
ries. Text summarization branches out; 2004. pp. 74–81.
12. Gulden C, Kirchner M, Schüttler C, Hinderer M, Kampf M,
Prokosch H-U, et al. Extractive summarization of clinical trial
descriptions. Int J Med Informatics. 2019;129:114–21.
13. Huemann Z, Lee C, Hu J, Cho SY, Bradshaw TJ. Domain-adapted
large language models for classifying nuclear medicine reports.
Radiology: Artif Intell. 2023;5:e220281.
14. Garcia EV. Integrating articial intelligence and natural language
processing for computer-assisted reporting and report understand-
ing in nuclear cardiology. J Nuclear Cardiol. 2023;30:1180–90.
15. Mithun S, Jha AK, Sherkhane UB, Jaiswar V, Purandare NC,
Rangarajan V et al. Development and validation of deep learn-
ing and BERT models for classication of lung cancer radiology
reports. Inf Med Unlocked. 2023:101294.
16. Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, et
al. Almanac—retrieval-augmented language models for clinical
medicine. NEJM AI. 2024;1:AIoa2300068.
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... An additional potential use of LLMs is improving references to prior reports, narrowing down differential diagnoses and supporting clinician decision making. The combination of a retrieval-augmented generation (RAG) LLM system with an extensive database of previous PET imaging reports from patients with breast cancer, lung cancer, and lymphoma enabled the identification of similar cases and the extraction of potential diagnoses based on those cases [31]. ...
Article
Full-text available
Recently, there has been tremendous interest on the use of large language models (LLMs) in radiology. LLMs have been employed for various applications in cancer imaging, including improving reporting speed and accuracy via generation of standardized reports, automating the classification and staging of abnormal findings in reports, incorporating appropriate guidelines, and calculating individualized risk scores. Another use of LLMs is their ability to improve patient comprehension of imaging reports with simplification of the medical terms and possible translations to multiple languages. Additional future applications of LLMs include multidisciplinary tumor board standardizations, aiding patient management, and preventing and predicting adverse events (contrast allergies, MRI contraindications) and cancer imaging research. However, limitations such as hallucinations and variable performances could present obstacles to widespread clinical implementation. Herein, we present a review of the current and future applications of LLMs in cancer imaging, as well as pitfalls and limitations.
... The copyright holder for this preprint this version posted April 1, 2025. ; environments (AlGhadban et al., 2023;Alonso et al., 2024;Choi et al., 2024;. Specific applications were identified in ophthalmology diagnostics and the analysis of multimodal patient data, suggesting broader opportunities for integrating RAG AI into medical training and diagnostic processes (Upadhyaya et al., 2024). ...
Preprint
Full-text available
Background: Retrieval-augmented generation (RAG) is an emerging artificial intelligence (AI) strategy that integrates encoded model knowledge with external data sources to enhance accuracy, transparency, and reliability. Unlike traditional large language models (LLMs), which are limited by static training data and potential misinformation, RAG dynamically retrieves and integrates relevant medical literature, clinical guidelines, and real-time data. Given the rapid adoption of AI in healthcare, this scoping review aims to systematically map the current applications, implementation challenges, and research gaps related to RAG in health professions. Methods: A scoping review was conducted following the Joanna Briggs Institute (JBI) framework and reported using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses for Scoping Reviews (PRISMA-ScR) guidelines. A systematic search strategy was designed in collaboration with faculty and a research and education librarian to include PubMed, Scopus, Embase, Google Scholar, and Trip, covering studies published between January 2020 and August 2024. Eligible studies examined the use of RAG in healthcare. Studies were screened in two stages: title/abstract review followed by full-text assessment. Data extraction focused on study characteristics, applications of RAG, ethical and technical challenges, and proposed improvements. Results: A total of 31 studies met inclusion criteria, with 90.32% published in 2024. Authors came from 17 countries with the most frequent publications coming from the USA (n = 15), China (n = 3), and the Republic of Korea (n = 3). Key applications included clinical decision support, healthcare education, and pharmacovigilance. Ethical concerns centered on data privacy, algorithmic bias, explainability, and potential overreliance on AI-generated recommendations. Bias mitigation strategies included dataset diversification, fine-tuning techniques, and expert oversight. Transparency measures such as structured citations, traceable information retrieval, and explainable diagnostic pathways were explored to enhance clinician trust in AI-generated outputs. Identified challenges included optimizing retrieval mechanisms, improving real-time integration, and standardizing validation frameworks. Conclusion: RAG AI has the potential to improve clinical decision-making and healthcare education by addressing key limitations of traditional LLMs. However, significant challenges remain regarding ethical implementation, model reliability, and regulatory oversight. Future research should prioritize refining retrieval accuracy, strengthening bias mitigation strategies, and establishing standardized evaluation metrics. Responsible deployment of RAG-based systems requires interdisciplinary collaboration between AI researchers, clinicians, and policymakers to ensure ethical, transparent, and effective integration into healthcare workflows.
Article
Full-text available
This study focuses on the utilization of Large Language Models (LLMs) for the rapid development of applications, with a spotlight on LangChain, an open-source software library. LLMs have been rapidly adopted due to their capabilities in a range of tasks, including essay composition, code writing, explanation, and debugging, with OpenAI’s ChatGPT popularizing their usage among millions ofusers. The crux of the study centers around LangChain, designed to expedite the development of bespoke AI applications using LLMs. LangChain has been widely recognized in the AI community for its ability to seamlessly interact with various data sources and applications. The paper provides an examination of LangChain's core features, including its components and chains, acting as modular abstractions and customizable, use-case-specific pipelines, respectively. Through a series of practical examples, the study elucidates the potential of this framework in fostering the swift development of LLM-based applications.
Article
Full-text available
The rapid advancements in artificial intelligence (AI) have led to the development of sophisticated large language models (LLMs) such as GPT-4 and Bard. The potential implementation of LLMs in healthcare settings has already garnered considerable attention because of their diverse applications that include facilitating clinical documentation, obtaining insurance pre-authorization, summarizing research papers, or working as a chatbot to answer questions for patients about their specific data and concerns. While offering transformative potential, LLMs warrant a very cautious approach since these models are trained differently from AI-based medical technologies that are regulated already, especially within the critical context of caring for patients. The newest version, GPT-4, that was released in March, 2023, brings the potentials of this technology to support multiple medical tasks; and risks from mishandling results it provides to varying reliability to a new level. Besides being an advanced LLM, it will be able to read texts on images and analyze the context of those images. The regulation of GPT-4 and generative AI in medicine and healthcare without damaging their exciting and transformative potential is a timely and critical challenge to ensure safety, maintain ethical standards, and protect patient privacy. We argue that regulatory oversight should assure medical professionals and patients can use LLMs without causing harm or compromising their data or privacy. This paper summarizes our practical recommendations for what we can expect from regulators to bring this vision to reality.
Article
Full-text available
Purpose: Manual cohort building from radiology reports can be tedious. Natural Language Processing (NLP) can be used for automated cohort building. In this study, we have developed and validated an NLP approach based on deep learning (DL) to select lung cancer reports from a thoracic disease management group cohort. Materials and methods: 4064 radiology reports (CT and PET/CT) of a thoracic disease management group reported between 2014 and 2016 were used. These reports were anonymised, cleaned, text normalized and split into a training, testing, and validation set. External validation was performed on radiology reports from the MIMIC-III clinical database. We used three DL models, namely, Bi-LSTM_simple, Bi-LSTM_dropout, and Pre-trained _BERT model to predict if a report concerned lung cancer. We studied the effect of minority oversampling on all models. Results: Without oversampling, the F1 scores at 95% CI for Bi-LSTM_simple, Bi-LSTM_dropout and BERT were 0.89, 0.90, and 0.86; with oversampling, the F1 scores were 0.94, 0.94, and 0.9, on internal validation. On external validation the F1-scores of Bi-LSTM_simple, Bi-LSTM_dropout and BERT models were 0.63, 0.77 and 0.80 without oversampling and 0.72, 0.78 and 0.77 with oversampling. Conclusion: Pre-trained BERT model and Bi-LSTM_dropout models to predict a lung cancer report showed consistent performance on internal and external validation with the BERT model exhibiting superior performance. The overall F1 score decreased on external validation for both Bi-LSTM models with the Bi-LSTM_simple model showing a more significant drop. All models showed some improvement on minority oversampling.
Preprint
Full-text available
This paper investigates the application of Large Language Models (LLMs), specifically OpenAI's ChatGPT-3.5, ChatGPT-4.0, Google Bard, and Microsoft Bing, in simplifying radiology reports, thus potentially enhancing patient understanding. We examined 254 anonymized radiology reports from diverse examination types and used three different prompts to guide the LLMs' simplification processes. The resulting simplified reports were evaluated using four established readability indices. All LLMs significantly simplified the reports, but performance varied based on the prompt used and the specific model. The ChatGPT models performed best when additional context was provided (i.e., specifying user as a patient or requesting simplification at the 7th grade level). Our findings suggest that LLMs can effectively simplify radiology reports, although improvements are needed to ensure accurate clinical representation and optimal readability. These models have the potential to improve patient health literacy, patient-provider communication, and ultimately, health outcomes.
Article
Purpose To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Materials and Methods Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician’s identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Results Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman’s ρ correlations (ρ=0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). Conclusion Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
Article
Purpose: To evaluate the impact of domain adaptation on the performance of language models in predicting five-point Deauville scores on the basis of clinical fluorine 18 fluorodeoxyglucose PET/CT reports. Materials and methods: The authors retrospectively retrieved 4542 text reports and images for fluorodeoxyglucose PET/CT lymphoma examinations from 2008 to 2018 in the University of Wisconsin-Madison institutional clinical imaging database. Of these total reports, 1664 had Deauville scores that were extracted from the reports and served as training labels. The bidirectional encoder representations from transformers (BERT) model and initialized BERT models BioClinicalBERT, RadBERT, and RoBERTa were adapted to the nuclear medicine domain by pretraining using masked language modeling. These domain-adapted models were then compared with the non-domain-adapted versions on the task of five-point Deauville score prediction. The language models were compared against vision models, multimodal vision-language models, and a nuclear medicine physician, with sevenfold Monte Carlo cross-validation. Means and SDs for accuracy are reported, with P values from paired t testing. Results: Domain adaptation improved the performance of all language models (P = .01). For example, BERT improved from 61.3% ± 2.9 (SD) five-class accuracy to 65.7% ± 2.2 (P = .01) following domain adaptation. Domain-adapted RoBERTa (named DA RoBERTa) performed best, achieving 77.4% ± 3.4 five-class accuracy; this model performed similarly to its multimodal counterpart (named Multimodal DA RoBERTa) (77.2% ± 3.2) and outperformed the best vision-only model (48.1% ± 3.5, P ≤ .001). A physician given the task on a subset of the data had a five-class accuracy of 66%. Conclusion: Domain adaptation improved the performance of large language models in predicting Deauville scores in PET/CT reports.Keywords Lymphoma, PET, PET/CT, Transfer Learning, Unsupervised Learning, Convolutional Neural Network (CNN), Nuclear Medicine, Deauville, Natural Language Processing, Multimodal Learning, Artificial Intelligence, Machine Learning, Language Modeling Supplemental material is available for this article. © RSNA, 2023See also the commentary by Abajian in this issue.
Article
Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.
Article
Academic integrity has been challenged by artificial intelligence algorithms in teaching institutions, including those providing nuclear medicine training. The GPT 3.5-powered ChatGPT chatbot released in late November 2022 has emerged as an immediate threat to academic and scientific writing. Methods: Both examinations and written assignments for nuclear medicine courses were tested using ChatGPT. Included was a mix of core theory subjects offered in the second and third years of the nuclear medicine science course. Long-answer-style questions (8 subjects) and calculation-style questions (2 subjects) were included for examinations. ChatGPT was also used to produce responses to authentic writing tasks (6 subjects). ChatGPT responses were evaluated by Turnitin plagiarism-detection software for similarity and artificial intelligence scores, scored against standardized rubrics, and compared with the mean performance of student cohorts. Results: ChatGPT powered by GPT 3.5 performed poorly in the 2 calculation examinations (overall, 31.7% compared with 67.3% for students), with particularly poor performance in complex-style questions. ChatGPT failed each of 6 written tasks (overall, 38.9% compared with 67.2% for students), with worsening performance corresponding to increasing writing and research expectations in the third year. In the 8 examinations, ChatGPT performed better than students for general or early subjects but poorly for advanced and specific subjects (overall, 51% compared with 57.4% for students). Conclusion: Although ChatGPT poses a risk to academic integrity, its usefulness as a cheating tool can be constrained by higher-order taxonomies. Unfortunately, the constraints to higher-order learning and skill development also undermine potential applications of ChatGPT for enhancing learning. There are several potential applications of ChatGPT for teaching nuclear medicine students.
Article
This Viewpoint discusses how regulators across the world should approach the legal and ethical challenges, including privacy, device regulation, competition, intellectual property rights, cybersecurity, and liability, raised by the medical use of large language models.