ArticlePDF Available

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

Authors:

Abstract and Figures

Objective This study aimed to develop an open-source multimodal large language model (CXR-LLaVA) for interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially replicate the image interpretation skills of human radiologists. Materials and methods For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities (Dataset 1) and 217,699 provided free-text radiology reports (Dataset 2). After pre-training a vision transformer with Dataset 1, we integrated it with an LLM influenced by the LLaVA network. Then, the model was fine-tuned, primarily using Dataset 2. The model’s diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists, to gauge its potential for autonomous reporting. Results The model demonstrated impressive performance in test sets, achieving an average F1 score of 0.81 for six major pathological findings in the MIMIC internal test set and 0.56 for six major pathological findings in the external test set. The model’s F1 scores surpassed those of GPT-4-vision and Gemini-Pro-Vision in both test sets. In human radiologist evaluations of the external test set, the model achieved a 72.7% success rate in autonomous reporting, slightly below the 84.0% rate of ground truth reports. Conclusion This study highlights the significant potential of multimodal LLMs for CXR interpretation, while also acknowledging the performance limitations. Despite these challenges, we believe that making our model open-source will catalyze further research, expanding its effectiveness and applicability in various clinical contexts. Key Points Question How can a multimodal large language model be adapted to interpret chest X-rays and generate radiologic reports? Findings The developed CXR-LLaVA model effectively detects major pathological findings in chest X-rays and generates radiologic reports with a higher accuracy compared to general-purpose models. Clinical relevance This study demonstrates the potential of multimodal large language models to support radiologists by autonomously generating chest X-ray reports, potentially reducing diagnostic workloads and improving radiologist efficiency.
Content may be subject to copyright.
Lee et al.European Radiology
https://doi.org/10.1007/s00330-024-11339-6
IMAGING INFORMATICS AND ARTIFICIAL INTELLIGENCE Open Access
CXR-LLaVA: a multimodal large language
model for interpreting chest X-ray images
Seowoo Lee
1
, Jiwon Youn
2
, Hyungjin Kim
1
, Mansu Kim
2
*and Soon Ho Yoon
1
*
Abstract
Objective This study aimed to develop an open-source multimodal large language model (CXR-LLaVA) for
interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially
replicate the image interpretation skills of human radiologists.
Materials and methods For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for
certain radiographic abnormalities (Dataset 1) and 217,699 provided free-text radiology reports (Dataset 2). After pre-
training a vision transformer with Dataset 1, we integrated it with an LLM inuenced by the LLaVA network. Then, the
model was ne-tuned, primarily using Dataset 2. The models diagnostic performance for major pathological ndings
was evaluated, along with the acceptability of radiologic reports by human radiologists, to gauge its potential for
autonomous reporting.
Results The model demonstrated impressive performance in test sets, achieving an average F1 score of 0.81 for six
major pathological ndings in the MIMIC internal test set and 0.56 for six major pathological ndings in the external
test set. The models F1 scores surpassed those of GPT-4-vision and Gemini-Pro-Vision in both test sets. In human
radiologist evaluations of the external test set, the model achieved a 72.7% success rate in autonomous reporting,
slightly below the 84.0% rate of ground truth reports.
Conclusion This study highlights the signicant potential of multimodal LLMs for CXR interpretation, while also
acknowledging the performance limitations. Despite these challenges, we believe that making our model open-source
will catalyze further research, expanding its effectiveness and applicability in various clinical contexts.
Key Points
Question How can a multimodal large language model be adapted to interpret chest X-rays and generate radiologic
reports?
Findings The developed CXR-LLaVA model effectively detects major pathological ndings in chest X-rays and generates
radiologic reports with a higher accuracy compared to general-purpose models.
Clinical relevance This study demonstrates the potential of multimodal large language models to support radiologists by
autonomously generating chest X-ray reports, potentially reducing diagnostic workloads and improving radiologist efciency.
Keywords Radiography (thoracic), Thorax, Deep learning, Image interpretation, Image interpretation (computer-assisted)
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriat e credit to the original author(s)
and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the articles Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the articles Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
*Correspondence:
Mansu Kim
mansu.kim@gist.ac.kr
Soon Ho Yoon
yshoka@snu.ac.kr
1
Department of Radiology, Seoul National University College of Medicine,
Seoul National University Hospital, Seoul, Republic of Korea
2
AI Graduate School, Gwangju Institute of Science and Technology, Gwangju,
Republic of Korea
1234567890():,;
1234567890():,;
1234567890():,;
1234567890():,;
Introduction
Advances in deep learning, marked by the emergence of
convolutional neural networks (CNNs) and vision trans-
formers (ViTs), have profoundly impacted radiology
[13]. Numerous deep learning algorithms have made
their way into practical, commercial applications. How-
ever, while CNNs and ViTs are adept at specic tasks,
such as classication and segmentation, this specialization
could limit their ability to address multifaceted challenges
in areas such as radiology. Concurrently, the natural
language processing domain has witnessed signicant
breakthroughs, enabling large language models (LLMs),
such as ChatGPT, to understand and generate human-like
text with remarkable prociency and unprecedented
performance levels in linguistic tasks ranging from text
generation to translation [4]. The integration of natural
language processing and image processing technologies
has led to the development of models that have set new
benchmarks in the eld, such as contrastive language-
image pre-training (CLIP) [5] and the bootstrapping
language-image pre-training (BLIP-2) model, which was
introduced in 2023 and can interpret the context within
images and generate detailed captions [6].
Most LLMs have primarily focused on text processing.
However, there is a growing trend towards a multimodal
approach involving processing of image, text, and even video
data. OpenAI and Google have released general-purpose
multimodal models (GPT-4-vision and Gemini-Pro-Vision,
respectively). Furthermore, the Large Language and Vision
Assistant (LLaVA), an open-source project combining vision
encoding with an LLM, has demonstrated exemplary per-
formance across a range of visual tasks [7]. However, it
remains unclear how effective these general-purpose models
are at interpreting chest X-rays (CXRs). Within the medical
domain, there are few specic multimodal models. Google has
published results for ELIXR, a model capable of interpreting
CXRs, but this model is not publicly available [8]. Similarly,
the open-source LLaVA-MED, a model tuned to the medical
domain, has been released. However, detailed insights into its
prociency in interpreting CXRs remain limited [9].
Radiologistsworkload has signicantly increased over
the past three decades, potentially impacting the accuracy
of radiologic diagnoses [10]. In response, numerous stu-
dies have explored the use of deep learning models to
improve diagnostic accuracy and reduce the burden on
radiologists [11]. Building on this line of research, our
study employed the latest technology, a multimodal LLM,
to attempt radiologic report generation for CXRs. This
study aimed to develop a multimodal LLM designed for
CXR interpretation, while also exploring its potential for
autonomous CXR reporting.
A preliminary version of this work has been made
publicly available as a preprint on arXiv [12].
Materials and methods
This retrospective study solely used publicly available
datasets and did not require institutional review board
approval.
Data collection
For model training, we included several public CXR
datasets, collecting a total of 592,580 frontal CXRs
(Table 1)[1320]. The Medical Information Mart for
Intensive Care (MIMIC) dataset provides radiologic
reports in a free-text form (Dataset 2, n=217,699), while
the other training datasets have multi-class or binary
labeling for radiographic abnormalities (Dataset 1,
n=374,881). Some datasets contain information regard-
ing lesionslocation, but this information was not utilized.
Adapting a multimodal LLM to CXRs (CXR-LLaVA)
A model inuenced by the LLaVA network was developed
[7]. LLaVA, which consists of an LLM and an image
encoder, converts images into a sequence of image tokens
that are then combined with query text tokens for text
generation within the LLM. Our primary objective was to
ne-tune LLaVA using CXRradiologic report pairs.
To achieve optimal performance, we developed a cus-
tom image encoder from scratch rather than using pre-
trained weights. We empirically employed the ViT-L/16
version of the vision transformer as the image encoder.
This encoder begins with a convolutional layer that pro-
cesses 1-channel grayscale CXR images into 1024-
dimensional patches. These patches are passed through
a series of 24 residual attention blocks, each containing
multi-head attention mechanisms and multilayer per-
ceptrons. The output from these blocks is normalized
through normalization layers and eventually projected
into a higher-dimensional space suitable for multimodal
processing. Following the vision encoder, the multimodal
projector linearly transforms the 1024-dimensional image
tokens into a 4096-dimensional space. These tokens are
then integrated into the language model component. In
alignment with LLaVAs framework, we utilized the Large
Language Model Meta AI (LLAMA)-2 as our language
model [21]. We selected the version with 7 billion para-
meters due to cost considerations.
The nal CXR-LLaVA takes a CXR image and question
prompt as input; the image is transformed into image
tokens via an image encoder, and the prompt is converted
to text tokens through a tokenizer. Both are then fed into
a causal language model, which autoregressively generates
text responses to the questions. The trained model is
available as open-source (https://github.com/ECOFRI/
CXR_LLaVA), and its demo can be found at https://
radiologist.app/cxr-llava/. Additionally, a comprehensive
model card detailing the models intended use cases,
Lee et al.European Radiology Page 2 of 13
out-of-scope use, and limitations is provided on the same
website to ensure transparency and facilitate further
research.
Training step 1: constructing and training a CXR-specic
image encoder
Despite the capabilities of pretrained image encoders in
understanding common visual objects, they often fall
short in accurately describing radiographic ndings. In
this section, we propose an image encoder, based on ViT-
L/16 and a two-step strategy for training them to learn the
radiological context specic to CXR images.
In the rst step, a simple classication task was used to
train the image encoder (Fig. 1a). The image encoder
transformed a CXR image into a representation and then
classied an abnormality by adding a simply fully con-
nected layer as a classier. This classication task enabled
the model to learn a fundamental yet crucial ability
regarding abnormalities. We used 374,881 image-label
pairs from Dataset 1 to train and validate our image
encoder. We assigned binary labels: when images had
labels associated with pathology, they were labeled as
abnormal,while those marked as no ndingwere
designated normal.The detailed implementation and
settings are described in the Supplementary material.
In the second step, the image encoder was further
trained based on the CLIP strategy to learn complex
representations of radiological terms (Fig. 1b) [5]. Using
the CLIP strategy, the image encoder learned shared
representations between image and text by mapping
corresponding image and text pairs closer together and
non-corresponding pairs further apart. For instance, an
image showing signs of pleural effusionwould have its
corresponding text label vector pleural effusionmap-
ped closely to its image vector. This ensures that the
model can accurately associate the visual features of
pleural effusion in CXRs with the correct textual
description, thereby enhancing its ability to correctly
identify and describe pleural effusion in new, unseen
images. We chose pathological labels provided in the
dataset, such as atelectasis,”“pneumonia,”“pleural
effusionand so on. For images with multiple patholo-
gical labels, we connected them using commas. The
592,580 image-text pairs from Datasets 1 and 2 were used
in the training and validating process. The performance
of the trained image encoder was evaluated and com-
pared with the nal model; the detailed process and the
performance evaluation are described in the Supplemen-
tary material.
Training step 2: feature alignment and end-to-end ne-
tuning of CXR-LLaVA
Before ne-tuning the CXR-LLaVA model, the features
from the image encoder, as described in step 1, and lan-
guage model (i.e., LLaMa-2) were aligned through addi-
tional training, where the image encoder and language
model weights were frozen, updating only the projection
matrix. The aligned image representation was computed
Table 1 Countries of collection, years of publication, and numbers of frontal chest radiographs in the publicly available datasets used
for model training and evaluation
Dataset Country of collection Year of publication Numbers of frontal CXRs
Training Validation Test
Training dataset 1: chest radiograph datasets with pathologic ndings labeled
BrixIA COVID-19 dataset [12] Italy 2021 3755 470 -
CheXpert train/validation dataset [13] USA 2019 152,983 19,123 -
NIH dataset [14] USA 2017 70,671 8833 -
PadChest dataset [15] Spain 2019 86,438 10,805 -
RSNA COVID-19 AI Detection Challenge [16] Various countries 2021 5066 634 -
VinDR dataset [17] Vietnam 2020 14,314 1789 -
Subtotal 333,227 41,654 -
Training dataset 2: chest radiograph dataset with free-text radiologic reports
MIMIC dataset [18] USA 2019 193,513 24,186 -
Internal test sets
MIMIC dataset (randomly selected) [18] USA 2019 - - 3000
CheXpert test dataset [13] USA 2022 - - 518
Subtotal - - 3518
External test set
Indiana University dataset [19] USA 2016 - - 3689
Lee et al.European Radiology Page 3 of 13
Fig. 1 CXR-LLaVA training process. aInitially, the image encoder was trained on a basic classication task to differentiate between normal and abnormal
CXRs, thereby acquiring fundamental representations of CXRs. bSubsequently, the model underwent training with pairs of CXRs and their corresponding
pathological ndings. This training employed the contrastive language-image pre-training (CLIP) strategy to foster shared representations between
images and text. cThe image encoder was then assimilated into CXR-LLaVA, initiating the alignment of image representations with the large language
model (LLM). In this phase, training focused on pairs of CXR images and radiologic reports, with updates conned to the projection layer. dUpon
successful alignment of the image encoder with the LLM, an instruction ne-tuning process was undertaken. This involved a variety of radiologic reports
and question-answer pairs, aiming to rene the models capability to interpret CXRs and facilitate more informative interactions. Please note that the
gure abstracts from the detailed neural network information, omitting elements such as tokenizer, batch normalization, projection, and linear
classication layers
Lee et al.European Radiology Page 4 of 13
by updating the projection matrix using CXR images with
rened radiologic reports from Dataset 2 (Fig. 1c).
After aligning the image features, CXR-LLaVA under-
went an instruction-tuning process, which was critical for
rening the models interpretative capabilities (Fig. 1d).
This process involved using rened radiology reports and
multi-turn question-answer dialogs generated by GPT-4,
all based on Dataset 2 (Supplementary materials).
Internal and external test set composition
For internal model testing, we utilized a randomly selec-
ted MIMIC dataset comprising 3000 images and accom-
panying free-text radiologic reports [19]. These were not
used during the models training and validation phases.
Additionally, we employed the CheXpert test dataset,
which consists of 518 images, each binary labeled for 14
ndings: atelectasis, cardiomegaly, consolidation, edema,
enlarged cardiomediastinum, fracture, lung lesion, lung
opacity, no nding, pleural effusion, pleural other, pneu-
monia, pneumothorax, and support devices [14]. For
external model testing, we used a dataset from Indiana
University, consisting of 3689 pairs of images and free-
text radiologic reports [20].
Comparison with other multimodal LLMs
To evaluate the performance of our model, we compared
its results with those of other publicly available multi-
modal LLMs, including OpenAIs GPT-4-vision and
Googles Gemini-Pro-Vision. Despite being in a preview
state and not being ne-tuned for CXR report generation,
these general-purpose models have shown some potential.
For instance, GPT-4-vision has demonstrated a limited
ability to detect abnormalities in CXRs and the capacity to
solve the United States Medical Licensing Examination
tests [22,23]. However, LLaVA-MED, a model ne-tuned
for medical image analysis, failed to generate accurate
radiologic reports from CXRs, producing nearly identical
reports for diverse CXRs, and was therefore excluded
from our study. Other models, such as ELIXR and Med-
PALM, which claim the ability to interpret CXRs, were
not publicly available and thus were not included in this
analysis [8,24] (Supplementary materials).
Internal test set evaluation
To evaluate the performance of radiologic report genera-
tion in the MIMIC internal test set, we utilized CheXpert-
Labeler to generate pathological labels [14]. This tool
analyzes free-text radiologic reports and generates labels
such as positive, negative, or uncertain for each patholo-
gical nding (atelectasis, cardiomegaly, consolidation,
edema, enlarged cardiomediastinum, fracture, lung lesion,
lung opacity, no nding, pleural effusion, pleural other,
pneumonia, pneumothorax, and support devices). We
compared these labels from the model-generated reports
with those from the original ground truth reports (Fig. 2a).
For the CheXpert test set, which does not contain
ground-truth radiologic reports, we instructed the model to
generate binary labels for the same 14 ndings. These labels
were then compared with the ground truth. This dataset is
identical to that used in a previous study where the
CheXzero model exhibited expert-level pathology detec-
tion capabilities [25]. Therefore, we evaluated our models
performance against both CheXzero and the average
diagnostic performance of three board-certied radi-
ologists, as documented in the same publication (Fig. 2b).
External test set evaluation and human radiologist
evaluation
To evaluate the models performance on the Indiana
external test set, we employed the same methodology
used for the MIMIC internal test set, which involved
comparing the labels generated from the models reports
with the ground truth (Fig. 2a).
To assess the models capability for autonomous or
semi-autonomous reporting without human radiologist
intervention, an evaluation was conducted involving three
human radiologists. From the Indiana external test set, 25
abnormal images and 25 normal images were randomly
selected. A total of 50 images were used to create 100
report-image pairs, with each image paired with a model-
generated report and a ground truth report. The radi-
ologists were presented with these 100 report-image pairs
in a random order for evaluation. They rated the
acceptability of each report on a 4-point scale: (1) totally
acceptable without any revision, (2) acceptable with minor
revision, (3) acceptable with major revision, and (4)
unacceptable (Supplementary materials).
Statistical analysis
The models performance in generating radiologic reports
was assessed using accuracy, sensitivity, specicity, and F1
scores. Cases where the CheXpert-Labeler assigned an
uncertainlabel or where the label was not mentioned
(missing element) were excluded from our analysis. We
included only denite positive or negative labels. Addi-
tionally, due to the scarce number of images with labels
such as pleural otherand fractures,these were omitted
from the analysis. The specic criteria for removing certain
labels and the details of the excluded labels are outlined in
the accompanying table. To estimate the condence
intervals of the accuracy, sensitivity, specicity, and
F1 scores, we utilized non-parametric bootstrapping with
1000 iterations. For the evaluation conducted by human
radiologists, the Cochran Q test was employed to deter-
mine the statistical signicance of differences between the
evaluations made by human radiologists and the model.
Lee et al.European Radiology Page 5 of 13
Results
Model performance on the internal test set
Table 2illustrates the report generation capabilities of our
model on the MIMIC internal test set. The model
achieved an average F1 score of 0.81, a sensitivity of 0.80,
and a specicity of 0.89 for six pathological labels,
including cardiomegaly, consolidation, edema, pleural
effusion, pneumonia, and pneumothorax. It demonstrated
strong performance, with F1 scores exceeding 0.8, in
identifying cardiomegaly, edema, and pleural effusion,
while its ability to detect pneumothorax was weaker.
Overall, the model exhibited higher average F1 scores
than GPT-4-vision or Gemini-Pro-Vision.
Table 3presents the models pathology detection per-
formance on the CheXpert internal test set [25]. The
model achieved an average F1 score of 0.57, a sensitivity of
0.90, and a specicity of 0.67 for ve pathological ndings:
atelectasis, cardiomegaly, consolidation, edema, and
pleural effusion. While it performed relatively well in
identifying lung opacity, atelectasis, and pleural effusion,
its effectiveness in detecting consolidation was lower. This
average F1 score of 0.57 is marginally lower than that of
CheXzero, which achieved 0.61, and slightly below the
0.62 F1 score reported for human radiologists. No
established F1 scores from CheXzero and human radi-
ologists are available for diagnosing lung opacity and
support devices, but our model demonstrated com-
mendable F1 scores in detecting these conditions in CXR.
Figure 3displays an example CXR, highlighting the for-
mat of the generated radiologic report. This report effec-
tively pinpoints critical ndings, such as bilateral pleural
effusion, yet it occasionally overlooks specic details, such
as the presence of a central catheter. In contrast, Fig. 4
shows that while the model appropriately recognized and
described the left pleural effusion, it failed to describe the
left pneumothorax and the left pleural drainage catheter.
Model performance on the external test set
In the external test set, the model produced an average F1
score of 0.56, a sensitivity of 0.63, and a specicity of 0.93
for detecting cardiomegaly, consolidation, edema, pleural
effusion, pneumonia, and pneumothorax. It showed an
excellent ability to detect cardiomegaly, edema, and
pneumonia, but its performance in detecting pneu-
mothorax was signicantly weaker (Table 4). Overall, the
model outperformed other models in this regard. A
review of several examples showed that the model accu-
rately identied and described the corresponding lesions
(Figs. 5and 6).
In the evaluation of radiologic report acceptability by
human radiologists, the model achieved an acceptable
without any revisionrate of 51.3%, which closely aligns
Fig. 2 Model evaluation ow diagram. aEvaluation of datasets with ground-truth free-text radiologic reports, including the MIMIC internal test set and
the Indiana external test set. Pathologic labels were obtained using the CheXpert-Labeler from both the original reports and the model-generated
reports, with a subsequent comparison of these results. bEvaluation of datasets with established ground-truth pathologic labels, specically the
CheXpert internal test set, involved directly generating pathologic labels from the model using a label generation prompt
Lee et al.European Radiology Page 6 of 13
Table 2 Model performance with the MIMIC internal test set
Model performance of each pathologic label in the MIMIC internal test set
Metric and pathologic label Models
CXR-LLaVA GPT-4-vision Gemini-Pro-Vision
Accuracy
Cardiomegaly 0.79 (0.77, 0.82) 0.65 (0.62, 0.67) 0.65 (0.62, 0.67)
Consolidation 0.93 (0.91, 0.96) 0.80 (0.77, 0.83) 0.55 (0.51, 0.58)
Edema 0.81 (0.77, 0.85) 0.68 (0.61, 0.75) 0.61 (0.58, 0.64)
Pleural effusion 0.87 (0.85, 0.88) 0.66 (0.64, 0.68) 0.53 (0.51, 0.55)
Average for above four pathologies 0.85 (0.84, 0.86) 0.67 (0.65, 0.69) 0.58 (0.57, 0.59)
Pneumonia 0.69 (0.62, 0.76) 0.67 (0.61, 0.74) 0.71 (0.64, 0.77)
Pneumothorax 0.89 (0.88, 0.91) 0.88 (0.87, 0.90) 0.97 (0.91, 1.00)
Overall average 0.86 (0.85, 0.87) 0.73 (0.71, 0.74) 0.59 (0.58, 0.60)
Sensitivity
Cardiomegaly 0.88 (0.85, 0.90) 0.92 (0.90, 0.93) 0.98 (0.97, 0.99)
Consolidation 0.62 (0.48, 0.75) 0.17 (0.09, 0.27) 0.72 (0.65, 0.80)
Edema 0.85 (0.81, 0.89) 0.69 (0.59, 0.78) 0.80 (0.76, 0.83)
Pleural effusion 0.85 (0.82, 0.87) 0.31 (0.27, 0.35) 0.93 (0.92, 0.95)
Average for above four pathologies 0.85 (0.84, 0.87) 0.62 (0.59, 0.64) 0.91 (0.90, 0.92)
Pneumonia 0.53 (0.42, 0.64) 0.80 (0.73, 0.87) 0.95 (0.91, 0.98)
Pneumothorax 0.35 (0.28, 0.42) 0.02 (0.00, 0.04) 0.00 (0.00, 0.00)
Overall average 0.80 (0.78, 0.82) 0.61 (0.59, 0.64) 0.91 (0.90, 0.93)
Specicity
Cardiomegaly 0.55 (0.49, 0.61) 0.17 (0.13, 0.20) 0.04 (0.02, 0.06)
Consolidation 0.97 (0.96, 0.99) 0.90 (0.88, 0.93) 0.50 (0.45, 0.54)
Edema 0.75 (0.68, 0.81) 0.67 (0.56, 0.78) 0.39 (0.34, 0.43)
Pleural effusion 0.88 (0.86, 0.90) 0.85 (0.83, 0.87) 0.28 (0.26, 0.31)
Average for above four pathologies 0.85 (0.83, 0.86) 0.71 (0.69, 0.73) 0.29 (0.28, 0.31)
Pneumonia 0.87 (0.78, 0.95) 0.29 (0.16, 0.42) 0.14 (0.06, 0.23)
Pneumothorax 0.97 (0.96, 0.98) 0.99 (0.98, 0.99) 1.00 (1.00, 1.00)
Overall average 0.89 (0.88, 0.90) 0.79 (0.78, 0.80) 0.30 (0.28, 0.31)
F1 score
Cardiomegaly 0.86 (0.85, 0.88) 0.77 (0.75, 0.79) 0.78 (0.76, 0.80)
Consolidation 0.68 (0.57, 0.78) 0.20 (0.11, 0.29) 0.41 (0.36, 0.47)
Edema 0.84 (0.81, 0.87) 0.71 (0.63, 0.78) 0.69 (0.66, 0.72)
Pleural effusion 0.83 (0.81, 0.85) 0.39 (0.35, 0.43) 0.61 (0.58, 0.63)
Average for above four pathologies 0.84 (0.83, 0.85) 0.62 (0.60, 0.64) 0.67 (0.66, 0.68)
Pneumonia 0.65 (0.54, 0.74) 0.79 (0.73, 0.84) 0.82 (0.77, 0.86)
Pneumothorax 0.46 (0.37, 0.53) 0.03 (0.00, 0.07) 0.00 (0.00, 0.00)
Overall average 0.81 (0.80, 0.82) 0.62 (0.61, 0.64) 0.68 (0.66, 0.69)
In our analysis, specic labels such as lung lesion,”“lung opacity,”“atelectasis,”“pleural other,”“fracture,and support deviceswere excluded due to their low
frequency, being under 5% in either the negative or positive class or having a sample number below 10, which makes them statistically less signicant for a balanced
analysis. Additionally, the label enlarged cardiomediastinumwas not included as it signicantly overlaps with cardiomegaly,which could lead to redundant data
interpretations. For the labels cardiomegaly,”“consolidation,”“edema,and pleural effusion,we recorded the average score across multiple datasets to facilitate a
comprehensive performance comparison of the model on these four critical labels
The model achieved an excellent average F1 score of 0.81, outperforming the GPT-4-vision and Gemini-Pro-Vision models, which scored 0.62 and 0.68, respectively.
The models sensitivity of 0.80 was higher than GPT-4-Visions 0.61 but lower than Gemini-Pro-Visions 0.91. Its specicity of 0.89 was higher than both GPT-4-Visions
0.79 and Gemini-Pro-Visions 0.30
Lee et al.European Radiology Page 7 of 13
Table 3 Model performance with the CheXpert internal test set
Model performance of each pathologic label in the CheXpert internal test set
Metric and pathologic label Models
CXR-LLaVA GPT-4-vision Gemini-Pro-Vision CheXzero [25] Human radiologists [25]
Accuracy
Cardiomegaly 0.66 (0.62, 0.70) 0.31 (0.27, 0.35) 0.48 (0.44, 0.53) N/A N/A
Consolidation 0.67 (0.64, 0.71) 0.94 (0.92, 0.96) 0.13 (0.10, 0.16) N/A N/A
Edema 0.72 (0.68, 0.76) 0.84 (0.81, 0.87) 0.76 (0.73, 0.80) N/A N/A
Pleural effusion 0.77 (0.74, 0.81) 0.58 (0.54, 0.62) 0.78 (0.74, 0.81) N/A N/A
Average for above four pathologies 0.71 (0.69, 0.73) 0.67 (0.65, 0.69) 0.54 (0.52, 0.56) N/A N/A
Atelectasis 0.77 (0.73, 0.80) 0.70 (0.66, 0.74) 0.69 (0.65, 0.73) N/A N/A
Average for above ve pathologies 0.72 (0.70, 0.74) 0.67 (0.66, 0.69) 0.57 (0.55, 0.59) N/A N/A
Lung opacity 0.82 (0.79, 0.86) 0.68 (0.64, 0.72) 0.54 (0.50, 0.59) N/A N/A
Support devices 0.76 (0.72, 0.79) 0.63 (0.58, 0.67) 0.59 (0.55, 0.64) N/A N/A
Overall average 0.74 (0.73, 0.75) 0.67 (0.65, 0.68) 0.57 (0.55, 0.59) N/A N/A
Sensitivity
Cardiomegaly 0.90 (0.86, 0.95) 1.00 (1.00, 1.00) 0.57 (0.49, 0.65) N/A N/A
Consolidation 0.93 (0.83, 1.00) 0.00 (0.00, 0.00) 0.93 (0.82, 1.00) N/A N/A
Edema 0.92 (0.87, 0.98) 0.03 (0.00, 0.06) 0.18 (0.10, 0.27) N/A N/A
Pleural effusion 0.96 (0.92, 0.99) 0.70 (0.61, 0.78) 0.02 (0.00, 0.05) N/A N/A
Average for above four pathologies 0.93 (0.90, 0.95) 0.63 (0.58, 0.68) 0.36 (0.31, 0.41) N/A N/A
Atelectasis 0.85 (0.79, 0.91) 0.00 (0.00, 0.00) 0.02 (0.00, 0.04) N/A N/A
Average for above ve pathologies 0.90 (0.88, 0.93) 0.44 (0.40, 0.48) 0.25 (0.22, 0.29) N/A N/A
Lung opacity 0.90 (0.87, 0.93) 0.73 (0.68, 0.78) 0.94 (0.91, 0.97) N/A N/A
Support devices 0.83 (0.79, 0.88) 0.95 (0.92, 0.97) 0.94 (0.90, 0.96) N/A N/A
Overall average 0.89 (0.87, 0.90) 0.64 (0.61, 0.67) 0.60 (0.57, 0.63) N/A N/A
Specicity
Cardiomegaly 0.56 (0.51, 0.61) 0.01 (0.00, 0.03) 0.44 (0.39, 0.50) N/A N/A
Consolidation 0.66 (0.62, 0.70) 1.00 (1.00, 1.00) 0.09 (0.06, 0.11) N/A N/A
Edema 0.68 (0.64, 0.73) 0.99 (0.97, 1.00) 0.87 (0.83, 0.90) N/A N/A
Pleural effusion 0.72 (0.68, 0.77) 0.55 (0.51, 0.60) 0.97 (0.95, 0.99) N/A N/A
Average for above four pathologies 0.66 (0.64, 0.68) 0.68 (0.66, 0.70) 0.58 (0.55, 0.60) N/A N/A
Atelectasis 0.73 (0.68, 0.77) 1.00 (1.00, 1.00) 0.99 (0.98, 1.00) N/A N/A
Average for above ve pathologies 0.67 (0.65, 0.69) 0.73 (0.71, 0.75) 0.65 (0.63, 0.67) N/A N/A
Lung opacity 0.74 (0.68, 0.79) 0.63 (0.57, 0.69) 0.10 (0.06, 0.14) N/A N/A
Support devices 0.68 (0.62, 0.74) 0.29 (0.24, 0.35) 0.23 (0.18, 0.29) N/A N/A
Overall average 0.68 (0.66, 0.70) 0.68 (0.66, 0.70) 0.56 (0.54, 0.58) N/A N/A
F1 score
Cardiomegaly 0.62 (0.56, 0.67) 0.46 (0.42, 0.51) 0.39 (0.33, 0.45) 0.74 (0.69, 0.79) 0.68 (0.63, 0.72)
Consolidation 0.24 (0.17, 0.31) 0.00 (0.00, 0.00) 0.11 (0.07, 0.15) 0.33 (0.24, 0.42) 0.39 (0.28, 0.49)
Edema 0.50 (0.43, 0.57) 0.05 (0.00, 0.11) 0.19 (0.11, 0.26) 0.60 (0.52, 0.68) 0.58 (0.51, 0.65)
Pleural effusion 0.63 (0.57, 0.69) 0.41 (0.34, 0.47) 0.03 (0.00, 0.09) 0.70 (0.63, 0.76) 0.74 (0.69, 0.78)
Average for above four pathologies 0.53 (0.49, 0.56) 0.40 (0.37, 0.44) 0.21 (0.18, 0.25) N/A N/A
Atelectasis 0.69 (0.64, 0.74) 0.00 (0.00, 0.00) 0.04 (0.00, 0.08) 0.65 (0.59, 0.70) 0.69 (0.65, 0.73)
Average for above ve pathologies 0.57 (0.54, 0.59) 0.35 (0.32, 0.39) 0.19 (0.17, 0.22) 0.61 (0.57, 0.64) 0.62 (0.59, 0.64)
Lung opacity 0.84 (0.81, 0.87) 0.71 (0.66, 0.75) 0.68 (0.64, 0.72) N/A N/A
Support devices 0.78 (0.74, 0.82) 0.72 (0.69, 0.76) 0.70 (0.66, 0.74) N/A N/A
Overall average 0.67 (0.65, 0.69) 0.53 (0.51, 0.55) 0.45 (0.43, 0.47) N/A N/A
Unbalanced labels like lung lesion,”“pneumonia,”“pneumothorax,”“pleural other,and fracture,which had a frequency of less than 5% in either the negative or
positive class, were not included in the analysis. Additionally, the label enlarged cardiomediastinumwas not included as it signicantly overlaps with cardiomegaly,
which could lead to redundant data interpretations. For the labels cardiomegaly,”“consolidation,”“edema,and pleural effusion,we recorded the average score
across multiple datasets to facilitate a comprehensive performance comparison of the model on these four critical labels
The model attained an average F1 score of 0.57 for ve key pathologies, which was marginally lower than CheXzeros 0.61 and human radiologists0.62. However, it
demonstrated exceptional capability in identifying lung opacity, support devices, and atelectasis. The models overall sensitivity was 0.89 and specicity was 0.68,
compared to GPT-4-visions sensitivity of 0.64 and specicity of 0.68, and Gemini-Pro-Visions sensitivity of 0.60 and specicity of 0.56. The sensitivity and specicity for
CheXzero and human radiologists were not available, so a direct comparison could not be made
Lee et al.European Radiology Page 8 of 13
with the 54.0% acceptability rate of ground truth reports.
To gauge the models capability for autonomous reporting
without human radiologist intervention, we dened suc-
cessful autonomous reporting as reports deemed accep-
table either without any revision or with only minor
revisions. By this criterion, the model achieved a success
rate of 72.7% (Table 5). While this is lower than the 84.0%
success rate of ground truth reports, the difference in the
rate of autonomous reporting between the model and the
ground truth was found to be statistically signicant,
indicating that the model was somewhat inferior in terms
of the autonomous reporting rate. However, the model
still maintained a commendable success rate of over 70%.
Discussion
We successfully developed a multimodal large language
model capable of accurately detecting major pathological
ndings in CXRs and generating free-text radiologic
reports. Our model exhibited relatively good performance
compared to other publicly available general-purpose
multimodal LLMs, such as GPT-4-vision and Gemini-
Pro-Vision. We also explored the potential of multimodal
LLMs for autonomous or semi-autonomous reporting in
chest radiography. However, there are some limitations to
our study.
First, the evaluation method we employed has inherent
limitations. While we used CheXpert-Labeler to assess the
quality of the reports, this tool only evaluates the explicit
presence of pathological labels and does not consider the
location or number of pathological lesions. As a result, this
method may not fully reectthetrueaccuracyofthe
generated reports. Second, our model showed poor per-
formance in identifying certain pathological lesions, such as
pneumothorax and consolidation. Notably, its diagnostic
performance was inferior to that of human radiologists, as
shown in the CheXpert internal test set. This might be
partly due to the resolution limitations of our model, which
processes 512 × 512 pixel images, a lower resolution than
the higher-resolution images used by radiologists on spe-
cialized monitors. Moreover, our model processes 8-bit
images with a grayscale of 256 levels, whereas radiologist
monitors can display up to 10 or 12-bit grayscale images,
Fig. 4 An example of a chest radiograph from the CheXpert internal test set. The model appropriately recognized the left pleural effusion but failed to
identify the left pneumothorax and left pleural drainage catheter. The left pneumothorax is a clinically signicant nding, indicating that further
improvements to the model are necessary
Fig. 3 An example of a chest radiograph from the CheXpert internal test set. While the model identied the presence of pleural effusions, atelectasis,
and lung opacity, it omitted details about the central catheter (support device)
Lee et al.European Radiology Page 9 of 13
Table 4 Model performance with the Indiana external test set
Model performance of each pathologic label in the Indiana external test set
Metric and pathologic label Models
CXR-LLaVA GPT-4-vision Gemini-Pro-Vision
Accuracy
Cardiomegaly 0.72 (0.69, 0.74) 0.35 (0.33, 0.37) 0.41 (0.39, 0.43)
Consolidation 0.93 (0.89, 0.95) 0.93 (0.91, 0.94) 0.74 (0.71, 0.77)
Edema 0.94 (0.90, 0.98) 0.76 (0.63, 0.88) 0.63 (0.57, 0.68)
Pleural effusion 0.94 (0.93, 0.95) 0.88 (0.87, 0.90) 0.66 (0.64, 0.68)
Average for above four pathologies 0.88 (0.87, 0.89) 0.68 (0.66, 0.69) 0.58 (0.56, 0.59)
Pneumonia 0.83 (0.74, 0.90) 0.61 (0.47, 0.76) 0.69 (0.46, 0.92)
Pneumothorax 0.95 (0.94, 0.96) 0.99 (0.98, 0.99) N/A
Overall average 0.90 (0.89, 0.90) 0.75 (0.74, 0.76) 0.58 (0.56, 0.59)
Sensitivity
Cardiomegaly 0.67 (0.62, 0.71) 0.81 (0.78, 0.84) 0.81 (0.78, 0.84)
Consolidation 0.33 (0.09, 0.58) 0.08 (0.00, 0.18) 0.18 (0.08, 0.30)
Edema 0.50 (0.25, 0.77) 0.29 (0.00, 0.67) 0.44 (0.29, 0.58)
Pleural effusion 0.71 (0.63, 0.79) 0.19 (0.12, 0.27) 0.71 (0.63, 0.79)
Average for above four pathologies 0.66 (0.62, 0.70) 0.66 (0.63, 0.70) 0.73 (0.70, 0.76)
Pneumonia 0.52 (0.30, 0.71) 0.84 (0.65, 1.00) 1.00 (1.00, 1.00)
Pneumothorax 0.07 (0.00, 0.19) 0.00 (0.00, 0.00) N/A
Overall average 0.63 (0.59, 0.67) 0.65 (0.61, 0.68) 0.73 (0.70, 0.77)
Specicity
Cardiomegaly 0.75 (0.71, 0.78) 0.21 (0.19, 0.23) 0.29 (0.27, 0.31)
Consolidation 0.96 (0.93, 0.98) 0.96 (0.95, 0.97) 0.77 (0.74, 0.80)
Edema 1.00 (1.00, 1.00) 0.83 (0.71, 0.95) 0.67 (0.60, 0.72)
Pleural effusion 0.95 (0.95, 0.96) 0.92 (0.90, 0.93) 0.65 (0.64, 0.68)
Average for above four pathologies 0.91 (0.90, 0.92) 0.68 (0.66, 0.69) 0.55 (0.54, 0.57)
Pneumonia 0.95 (0.88, 1.00) 0.47 (0.28, 0.66) 0.00 (0.00, 0.00)
Pneumothorax 0.97 (0.96, 0.98) 1.00 (1.00, 1.00) N/A
Overall average 0.93 (0.92, 0.94) 0.76 (0.75, 0.77) 0.55 (0.54, 0.57)
F1 score
Cardiomegaly 0.62 (0.57, 0.65) 0.37 (0.34, 0.39) 0.39 (0.37, 0.42)
Consolidation 0.31 (0.09, 0.50) 0.08 (0.00, 0.17) 0.07 (0.03, 0.11)
Edema 0.67 (0.33, 0.86) 0.25 (0.00, 0.52) 0.28 (0.18, 0.37)
Pleural effusion 0.55 (0.48, 0.62) 0.13 (0.08, 0.18) 0.17 (0.14, 0.20)
Average for above four pathologies 0.59 (0.55, 0.62) 0.33 (0.31, 0.35) 0.30 (0.28, 0.32)
Pneumonia 0.63 (0.42, 0.79) 0.63 (0.45, 0.77) 0.82 (0.56, 0.96)
Pneumothorax 0.05 (0.00, 0.13) 0.00 (0.00, 0.00) N/A
Overall average 0.56 (0.53, 0.59) 0.33 (0.31, 0.35) 0.30 (0.28, 0.32)
We excluded labels that were unbalanced, with fewer than 10 samples in either the negative or positive category. This included labels such as lung lesion,
atelectasis,”“pleural other,”“fracture,and support devices.These were omitted to ensure statistical relevance and balance in the analysis. Additionally, the label
enlarged cardiomediastinumwas not included as it signicantly overlaps with cardiomegaly,which could lead to redundant data interpretations. Furthermore,
lung opacitywas excluded due to its broad nature, which increases the likelihood of labeling errors by the labeler. For the labels cardiomegaly,”“consolidation,
edema,and pleural effusion,we recorded the average score across multiple datasets to facilitate a comprehensive performance comparison of the model on these
four critical labels. Gemini-Pro-Vision was excluded from the analysis for pneumothoraxdue to the absence of positive samples, which is indicated as N/A in
the table
The model achieved an overall average F1 score of 0.56, excelling particularly in the detection of cardiomegaly (0.62), edema (0.67), and pneumonia (0.63). However,
its performance in detecting pneumothorax was notably lower (0.05)
Lee et al.European Radiology Page 10 of 13
providing ner details. These factors could contribute to
the models suboptimal performance in detecting subtle
lesions. Third, the model intentionally omits descriptions
of supporting devices such as feeding tubes and endo-
tracheal tubes in CXRs. During the process of rening
original reports with GPT-4 for training data, we removed
all mentions of these devices. This decision was due to the
varied nomenclature used to describe these devices (e.g.,
nasogastric tube, Levin tube, Dobhoff tube, feeding tube)
and the frequent inclusion of specic numerical details
about the tip location. Since language models inherently
struggle with processing numerical information accurately,
including this varied and numerical information led to
hallucinations. Consequently, we excluded it from the
training dataset, and the generated reports do not include
information about these supporting devices. Fourth, the
Fig. 6 An example of a chest radiograph from the Indiana external test set. The models interpretation identied right upper lobe consolidation and
proposed pneumonia as a possible diagnosis, which is reasonable. Nonetheless, the model failed to detect a small left upper lung nodule (black arrow)
Fig. 5 An example of a chest radiograph from the Indiana external test set. The models interpretation included information about bilateral pulmonary
nodules and suggested a possible diagnosis of lung metastasis or infection, which is reasonable. It also recommended that an additional chest CT scan
might be helpful. However, the model could not detect the implanted venous access device
Lee et al.European Radiology Page 11 of 13
models to which we compared ours are general purpose
and not ne-tuned for CXR interpretation. Therefore, it is
not unexpected that our ne-tuned model would outper-
form them. Nevertheless, it is noteworthy that these
general-purpose models still achieved high F1 scores in
diagnosing conditions like cardiomegaly and pneumonia.
Several non-peer-reviewed public multimodal LLMs, such
as Xraygpt, UniXGen, LLM-CXR, and CheXagent have
been released for CXR interpretation, but we did not
include them in our comparison due to potential dataset
overlap, as we utilized a public dataset for both training and
testing [2629]. Fifth, our assessment of the potential of
our model for autonomous reporting was based on a lim-
ited dataset of just 50 CXRs, which does not mirror real-
world clinical settings. Future research should involve
larger-scale studies to ensure the safety and efcacy of
multimodal LLMs in CXR interpretation. Lastly, as an
integral component of CXR-LLaVA is an LLM, hallucina-
tions and confabulations are inherent limitations. The
model can generate text that appears plausible but may be
incorrect or nonsensical. Therefore, CXR-LLaVA may
exhibit hallucinations, and it is unclear under which con-
ditions these hallucinations are most likely to occur or how
they can be effectively prevented. Consequently, it is clear
that the current model must not be used for clinical pur-
poses without extensive validation.
In conclusion, our study demonstrates the capability of
multimodal LLMs to generate radiologic reports that
accurately recognize major lesions. By making our model
open-source, we aim to promote the development of
more capable and accurate models. We are condent that
multimodal large language models have considerable
potential to assist clinicians, reduce the workload of
radiologists in clinical settings, and ultimately improve
patient outcomes.
Abbreviations
CLIP Contrastive language-image pre-training
CNN Convolutional neural network
CXRs Chest X-ray images
LLM Large language model
ViT Vision transformers
Supplementary information
The online version contains supplementary material available at https://doi.
org/10.1007/s00330-024-11339-6.
Acknowledgements
We appreciate the support of high-performance graphic processing unit
computing provided by the High-Performance Computing and Articial
Intelligence Open Infrastructure at the Gwangju Institute of Science and
Technology Super Computing Center. We also acknowledge the utilization of
large language model (GPT-4, OpenAI) to enhance the quality of our medical
writing. Any revisions made by large language model was thoroughly
reviewed by the authors and subsequent adjustments were made as deemed
appropriate.
Funding
This study was supported by Institute of Information and Communications
Technology Planning and Evaluation (IITP) grant funded by the Korea
government (MSIT) (No.2019-0-01842, Articial Intelligence Graduate School
Program (GIST), No. 2021-0-02068, Articial Intelligence Innovation Hub), the
National Research Foundation of Korea under grant NRF-2022R1F1A1068529,
and the NAVER Digital Bio Innovation Research Fund, funded by NAVER
Corporation (Grant No. 3720230020). Open Access funding enabled and
organized by Seoul National University Hospital.
Compliance with ethical standards
Guarantor
The scientic guarantor of this publication is Seowoo Lee.
Conict of interest
The authors of this manuscript declare relationships with the following
companies: Medical IP and RadiSen. Seowoo Lee, Hyungjin Kim, and Soon Ho
Yoon hold stock options in Medical IP. Hyungjin Kim receives consulting fees
from RadiSen and is a member of the Scientic Editorial Board for European
Radiology (section: chest) and, as such, did not participate in the selection nor
review processes for this article. The authors roles in these organizations are
unrelated to the content of the submitted work. Other authors do not have
conicts of interest.
Statistics and biometry
No complex statistical methods were necessary for this paper.
Informed consent
Written informed consent was not required for this study due to the exclusive
use of publicly available datasets.
Ethical approval
Institutional Review Board (IRB) approval was not required for this study
because it utilized publicly available datasets. This research adheres to the
terms of use associated with the datasets, ensuring compliance with ethical
standards.
Table 5 Evaluation of radiologic report acceptability by human radiologists from the Indiana external test set
Class Meaning CXR-LLaVA Ground truth Comparison
A Acceptable without any revision 77 (51.3%) 81 (54.0%)
B Acceptable after minor revision 32 (21.3%) 45 (30.0%)
C Acceptable after major revision 8 (5.3%) 6 (4.0%)
D Unacceptable 33 (22.0%) 18 (12.0%)
A+B Successful autonomous reporting 109 (72.7%) 126 (84.0%) p< 0.001
The model achieved a 51.3% rate (77 cases) of being acceptable without any revision(Class A), closely mirroring the 54.0% rate of the ground truth reports. The
models success rate for autonomous reporting (Class A +B) reached 72.7% (109 cases), slightly lower than the 84.0% for ground truth reports. This difference was
statistically signicant (p< 0.001), highlighting the comparative capabilities and limitations of the model in autonomous radiologic reporting
Lee et al.European Radiology Page 12 of 13
Study subjects or cohorts overlap
A preliminary version of this work has been made publicly available as a
preprint. The preprint of this study, titled CXR-LLAVA: a multimodal large
language model for interpreting chest X-ray images,has been uploaded to
the arXiv repository. It can be accessed through the following https://arxiv.org/
abs/2310.18341.
Methodology
Retrospective
Diagnostic study
Study using public data collected from multiple centers
Received: 19 March 2024 Revised: 25 October 2024 Accepted: 4 December
2024
References
1. Shamshad F, Khan S, Zamir SW et al (2023) Transformers in medical
imaging: a survey. Med Image Anal 88:102802
2. Huang S-C, Pareek A, Jensen M, Lungren MP, Yeung S, Chaudhari AS
(2023) Self-supervised learning for medical image classication: a sys-
tematic review and implementation guidelines. NPJ Digit Med 6:74
3. Shen D, Wu G, Suk H-I (2017) Deep learning in medical image analysis.
Annu Rev Biomed Eng 19:221248
4. Gozalo-Brizuela R, Garrido-Merchan EC (2023) ChatGPT is not all you need.
A state of the art review of large generative AI models. Preprint at https://
arxiv.org/abs/2301.04655
5. Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual
models from natural language supervision. PMLR 139:87488763
6. Li J, Li D, Savarese S, Hoi S (2023) Blip-2: Bootstrapping language-image
pre-training with frozen image encoders and large language models.
International conference on machine learning. PMLR, pp 1973019742
7. Liu H, Li C, Wu Q, Lee YJ (2024). Visual instruction tuning. Advances in
Neural Information Processing Systems, 36.
8. Xu S, Yang L, Kelly C et al (2023) ELIXR: towards a general purpose X-ray
articial intelligence system through alignment of large language models
and radiology vision encoders. Preprint at https://arxiv.org/abs/2308.
01317
9. Li C, Wong C, Zhang S et al (2024). Llava-med: Training a large language-
and-vision assistant for biomedicine in one day. Advances in Neural
Information Processing Systems, 36.
10. MarkotićV, Pojužina T, RadančevićD, Miljko M, PokrajčićV (2021) The
radiologist workload increase; where is the limit?: Mini review and case
study. Psychiatr Danub 33:768770
11. Sajed S, Sanati A, Garcia JE, Rostami H, Keshavarz A, Teixeira A (2023) The
effectiveness of deep learning vs. traditional methods for lung disease
diagnosis using chest X-ray images: a systematic review. Appl Soft
Comput 147:110817
12. Lee S, Youn J, Kim M, Yoon SH (2023) CXR-LLAVA: multimodal large
language model for interpreting chest x-ray images. Preprint at https://
arxiv.org/abs/2310.18341
13. Signoroni A, Savardi M, Benini S et al (2021) BS-Net: learning COVID-19
pneumonia severity on a large chest X-ray dataset. Med Image Anal
71:102046
14. Irvin J, Rajpurkar P, Ko M et al (2019) CheXpert: a large chest radiograph
dataset with uncertainty labels and expert comparison. Preprint at
https://arxiv.org/abs/1901.07031
15. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-
supervised classication and localization of common thorax diseases.
Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 20972106
16. Bustos A, Pertusa A, Salinas J-M, De La Iglesia-Vaya M (2020) Padchest: a
large chest x-ray image dataset with multi-label annotated reports. Med
Image Anal 66:101797
17. Lakhani P, Mongan J, Singhal C et al (2023) The 2021 SIIM-FISABIO-
RSNA machine learning COVID-19 challenge: annotation and standard
exam classication of COVID-19 chest radiographs. J Digit Imaging
36:365372
18. Nguyen HQ, Lam K, Le LT et al (2022) VinDr-CXR: an open dataset of chest
X-rays with radiologists annotations. Sci Data 9:429
19. Johnson AE, Pollard TJ, Berkowitz SJ et al (2019) MIMIC-CXR, a de-
identied publicly available database of chest radiographs with free-text
reports. Sci Data 6:317
20. Demner-Fushman D, Kohli MD, Rosenman MB et al (2016) Preparing a
collection of radiology examinations for distribution and retrieval. J Am
Med Inform Assoc 23:304310
21. Touvron H, Martin L, Stone K et al (2023) Llama 2: open foundation and
ne-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288
22. Yang Z, Yao Z, Tasmin M et al (2023) Performance of multimodal GPT-4V
on USMLE with image: potential for imaging diagnostic support with
explanations. Preprint at https://doi.org/10.1101/2023.10.26.23297629:
2023.2010.2026.23297629
23. Brin D, Sorin V, Barash Y et al (2024) Assessing GPT-4 multimodal per-
formance in radiological image analysis. European Radiology 17
24. Tu T, Azizi S, Driess D et al (2024) Towards generalist biomedical AI. NEJM
AI 1:AIoa2300138
25. Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P (2022) Expert-level
detection of pathologies from unannotated chest X-ray images via self-
supervised learning. Nat Biomed Eng 6:13991406
26. Thawkar O, Shaker A, Mullappilly SS et al (2023) XrayGPT: chest radio-
graphs summarization using medical vision-language models. Preprint at
https://arxiv.org/abs/2306.07971
27. Lee H, Kim W, Kim J-H et al (2024) Vision-Language Generative Model for
View-Specic Chest X-ray Generation. Conference on Health, Inference,
and Learning. PMLR, pp 280296
28. Lee S, Kim WJ, Chang J, Ye JC (2023) LLM-CXR: instruction-netuned LLM
for CXR image understanding and generation. Preprint at https://arxiv.
org/abs/2305.11490
29. Chen Z, Varma M, Delbrouck J-B et al (2024) A Vision-Language Foun-
dation Model to Enhance Efciency of Chest X-ray Interpretation. Preprint
at https://arxiv.org/abs/2401.12208
Publishers Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional afliations.
Lee et al.European Radiology Page 13 of 13
... CT-CHAT is fine-tuned with over 2.7 million question-answer pairs derived from the CT-RATE dataset. CT-CHAT significantly outperforms open-source vision-language AI assistants, such as LLaVA 1.6 [26], LLaVA-Med [27], and CXR-LLaVA [28], underscoring the need for specialized tools in 3D medical imaging. ...
... Given the absence of generalist models for 3D imaging, we evaluate CT-CHAT against several state-ofthe-art open-source vision-language AI models, including LLaVA 1.6 (Mistral 7B and Vicuna 13B) [26], LLaVA-Med [27], and CXR-LLaVA [28]. To ensure a fair comparison, these models, designed for 2D imaging tasks, are tested using Digitally Reconstructed Radiographs (DRRs) of 3D chest CT volumes-a method that ensures accurate representation in a 2D context, as suggested by [44], rather than using random or central 2D slices of a 3D volume (see benchmarking the CT-CHAT model section of Methods). ...
... Given the lack of pretrained VQA models for 3D imaging, we conduct a comparative analysis of CT-CHAT against four 2D state-of-the-art open-source VQA model: LLaVA 1.6 (Mistral 7B) [26], LLaVA 1.6 (Vicuna 13B) [26], LLaVA-Med [27], and CXR-LLaVA [28], using benchmarking framework established in [14]. To address the challenge of representing 3D information using their 2D encoders and to ensure a fair comparison, we utilize Digitally Reconstructed Radiographs (DRRs) derived from CT-RATE, as suggested by [44]. ...
Preprint
Full-text available
While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI—particularly in 3D imaging—have been limited due to the scarcity of comprehensive datasets. To address this critical gap, we introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Through various reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3 million 2D slices. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in two tasks: multi-abnormality detection and case retrieval. Remarkably, in multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully supervised models across all key metrics, effectively eliminating the need for manual annotation. In case retrieval, it efficiently retrieves relevant cases using either image or textual queries, thereby enhancing knowledge dissemination. By combining CT-CLIP's vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI assistants, underscoring the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging but also lays the groundwork for future innovations in medical AI and improved patient care.
... Their versatility enables them to be repurposed for applications across various domains, including medicine, where they have showcased impressive reasoning capabilities by excelling in medical exams [15]. LLMs hold significant promise for transforming patient care by analyzing electronic health records and medical image data to generate accurate, interpretable diagnosis [16], supporting clinical decision-making through treatment recommendations aligned with established guidelines [17], determining patients eligible for clinical trials [18], extracting relevant information from clinical notes [19], making health inferences from wearable sensor data [20], concisely summarizing medical evidence [21], and aiding development and discovery of novel drugs [22]. Their robust instruction-following capabilities have driven the emergence of agentic AI systems, where LLMs function as specialized agents to accomplish defined tasks. ...
... Please provide your answer in number of days. 16. In the past 7 days, how often have you felt lonely? ...
Preprint
Full-text available
Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs). Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT-4o and overall participant experience. Results: Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction. Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems.
... CephGPT-4 [42] MiniGPT-4 (ViT and Vicuna-7B)/VisualGLM (ViT and ChatGLM-6B) 7/6 − √ √ − PathAsst [43] LLaVA (PathCLIP and Vicuna-13B) 13 − √ √ Github CXR-LLaVA [44] LLaVA (ViT and LLaMA2-7B) 7 − √ √ Github LLaVA-Med [45] LLaVA 7 − √ √ Github ELIXR [46] ELIXR-C: CLIP ELIXR-B: BLIP-2 (ELIXR-C and PaLM2-S) [47] BLIP-2 (EVA and OPT Transformer) 2.7 − √ √ − LLIM for MIC [48] BLIP-2 (EVA and OPT Transformer) − − √ √ − ClinicalBLIP [49] InstructBLIP − − √ √ − XrayGPT [50] MedCLIP and Vicuna − − √ √ Github ...
... Some investigations rely on LLaVA [71] . In the first stage, Pathasst [43] freezes the vision encoder (PathCLIP) and the LLM, and focuses on training the Fully Connected (FC) layer. Then, it only freezes the PathCLIP, and fine-tunes other components in the subsequent stage. ...
Article
The advent of Large Language Models (LLMs) has sparked considerable interest in the medical image domain, as they can generalize to multiple tasks and offer outstanding performance. While LLMs achieve promising results, there is currently a lack of a comprehensive summary of medical images, making it challenging for researchers to understand the progress within this domain. To fill this gap, we make the first attempt to present a comprehensive survey for LLM on medical images. In addition, to better summarize the current progress comprehensively, we further introduce a novel x-stage tuning paradigm for summarization, including zero-stage tuning, one-stage tuning, and multi-stage tuning, offering a unified perspective on LLMs for medical images. Finally, we discuss challenges and future directions in this domain, aiming to spur more breakthroughs in the future. We hope this work can pave the way for the broad application of LLMs in medical images and provide a valuable resource for this domain.
... RaDialog, another LLM-based method, combines visual features and pathology findings to generate accurate radiology reports and support interactive tasks, significantly improving clinical efficacy. CXR-LLaVA, a multimodal LLM integrating a vision transformer with a language model, outperformed models like GPT-4 Vision and Gemini Pro Vision in CXR report generation (Lee et al., 2024). ...
... The aforementioned method of producing embeddings by grouping data from value and category columns ('Grouped embeddings') is compared to two other methods. The first is separate embeddings for each datum, where each value column datum is separately transformed using the previously described FNN, while each category column datum is converted to an embedding using a learn- (Kong et al., 2022) 368 960 images 29.8 30.4 62.3 20.4 14.9 17.6 22.6 7.9 RGRG (Tanida et al., 2023) 166 512 images 23.2 22.8 37.9 23.4 7.6 12.4 21.1 5.4 CvT2DistilGPT2 (Nicolson et al., 2023) 270 790 images 25.8 29.3 59.8 24.8 20.9 16.0 27.3 8.8 RaDialog (Pellegrini et al., 2023) 276 778 images 26.8 38.4 60.7 26.2 14.6 14.7 25.4 6.9 MedXChat (Yang et al., 2023) 270 790 images 22.6 13.1 21.3 19.3 9.8 14.3 23.2 7.0 CXR-LLaVA-v2 (Lee et al., 2024) 193 513 images 20.7 20.7 44.1 23.6 5.2 11.3 19.9 2.7 CXRMate (Nicolson et al., 2024a) 125 395 exams 28.8 33.9 71.3 30.5 22.4 17.7 28.1 9.7 CXRMate-RRG24 (Nicolson et al., 2024b) Figure 4: Case study demonstrating how incorporating a diverse set of patient data can aid with report generation. ...
Preprint
Full-text available
This study investigates the integration of diverse patient data sources into multimodal language models for automated chest X-ray (CXR) report generation. Traditionally, CXR report generation relies solely on CXR images and limited radiology data, overlooking valuable information from patient health records, particularly from emergency departments. Utilising the MIMIC-CXR and MIMIC-IV-ED datasets, we incorporate detailed patient information such as aperiodic vital signs, medications, and clinical history to enhance diagnostic accuracy. We introduce a novel approach to transform these heterogeneous data sources into embeddings that prompt a multimodal language model, significantly enhancing the diagnostic accuracy of generated radiology reports. Our comprehensive evaluation demonstrates the benefits of using a broader set of patient data, underscoring the potential for enhanced diagnostic capabilities and better patient outcomes through the integration of multimodal data in CXR report generation.
... Despite their powerful encoding capabilities, med-VLMs often struggle with naive combinations of image and text embeddings, leading to suboptimal performance. CXR images capture the full range of normal anatomical structures and pathological abnormalities in a given subject, whereas radiology reports often provide condensed, context-dependent descriptions that emphasize diagnostically critical abnormalities [33]. Moreover, substantial variability in radiological prose introduces additional discrepancies [34]. ...
Preprint
Diagnostic imaging relies on interpreting both images and radiology reports, but the growing data volumes place significant pressure on medical experts, yielding increased errors and workflow backlogs. Medical vision-language models (med-VLMs) have emerged as a powerful framework to efficiently process multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit their performance hinges on how well image and text representations are aligned. Existing alignment methods, predominantly based on contrastive learning, prioritize separation between disease classes over segregation of fine-grained pathology attributes like location, size or severity, leading to suboptimal representations. Here, we propose MedTrim (Meta-entity-driven Triplet mining), a novel method that enhances image-text alignment through multimodal triplet learning synergistically guided by disease class as well as adjectival and directional pathology descriptors. Unlike common alignment methods that separate broad disease classes, MedTrim leverages structured meta-entity information to preserve subtle but clinically significant intra-class variations. For this purpose, we first introduce an ontology-based entity recognition module that extracts pathology-specific meta-entities from CXR reports, as annotations on pathology attributes are rare in public datasets. For refined sample selection in triplet mining, we then introduce a novel score function that captures an aggregate measure of inter-sample similarity based on disease classes and adjectival/directional descriptors. Lastly, we introduce a multimodal triplet alignment objective for explicit within- and cross-modal alignment between samples sharing detailed pathology characteristics. Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.
... These datasets enable fine-tuning of pre-trained LLMs on domain-specific medical knowledge, allowing AI to better understand and generate medically relevant content. The amalgamation of LLMs with other AI techniques, such as vision and reasoning systems, has further enhanced the capabilities of AI chatbots enabling them to provide more accurate, context-aware responses, such as explaining patients' X-ray scans (Lee et al., 2023). As of publication date, the authors label Claude Opus (Anthropic, 2024.) as the strongest model for language and reasoning related use, and GPT-4 features such as add-ins for research use. ...
Chapter
The innovation of large language models (LLMs) has widened possibilities for renovating healthcare education through AI-powered learning resources, such as chatbots. This chapter explores the assimilation of LLMs with Bloom's taxonomy, demonstrating how this foundational framework for designing and assessing learning outcomes can support the development of critical thinking, problem-solving, and decision-making skills in healthcare learners. Through case examples and research presentations, this chapter illustrates how LLM chatbots provide interactive, scaffolding, and contextually relevant learning experiences. However, it also highlights the importance of designing these tools with key principles in mind, including learner-centeredness, co-creation with domain experts, and principled responsibility. By embracing a collaborative, interdisciplinary, and future-oriented approach to chatbot design and development, the power of LLMs can be harnessed to revolutionize healthcare education and ultimately improve patient care.
... 97 Li et al. developed an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXR), lever-aging recent advances in large language models (LLM) to potentially replicate human radiologists' Image interpretation skills materials and methods, achieving better results than GPT-4-vision and Gemini-Pro-Vision on two training data sets and one testing data set. 98 Ali H et al. proposed Domain Adaptive Language Modeling (RadLing) to extract Common Data Elements (CDEs) from chest radiology reports, which comprehensively outperforms the performance of GPT-4 in terms of accuracy, recall and is easy to deploy locally with low running costs. 99 The improvement efforts based on large models persist beyond this point. ...
Article
Full-text available
Objectives This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. Methods We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. Results GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model’s performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images ( p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images ( p < 0.001). These variations indicate inconsistencies in GPT-4V’s ability to interpret radiological images accurately. Conclusion While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. Clinical relevance statement Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. Key Points GPT-4V’s capability in analyzing images offers new clinical possibilities in radiology . GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection . Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications .
Article
Full-text available
Advancements in deep learning and computer vision provide promising solutions for medical image analysis, potentially improving healthcare and patient outcomes. However, the prevailing paradigm of training deep learning models requires large quantities of labeled training data, which is both time-consuming and cost-prohibitive to curate for medical images. Self-supervised learning has the potential to make significant contributions to the development of robust medical imaging models through its ability to learn useful insights from copious medical datasets without labels. In this review, we provide consistent descriptions of different self-supervised learning strategies and compose a systematic review of papers published between 2012 and 2022 on PubMed, Scopus, and ArXiv that applied self-supervised learning to medical imaging classification. We screened a total of 412 relevant studies and included 79 papers for data extraction and analysis. With this comprehensive effort, we synthesize the collective knowledge of prior work and provide implementation guidelines for future researchers interested in applying self-supervised learning to their development of medical imaging classification models.
Article
Full-text available
We describe the curation, annotation methodology, and characteristics of the dataset used in an artificial intelligence challenge for detection and localization of COVID-19 on chest radiographs. The chest radiographs were annotated by an international group of radiologists into four mutually exclusive categories, including “typical,” “indeterminate,” and “atypical appearance” for COVID-19, or “negative for pneumonia,” adapted from previously published guidelines, and bounding boxes were placed on airspace opacities. This dataset and respective annotations are available to researchers for academic and noncommercial use.
Article
Full-text available
In tasks involving the interpretation of medical images, suitably trained machine-learning models often exceed the performance of medical experts. Yet such a high-level of performance typically requires that the models be trained with relevant datasets that have been painstakingly annotated by experts. Here we show that a self-supervised model trained on chest X-ray images that lack explicit annotations performs pathology-classification tasks with accuracies comparable to those of radiologists. On an external validation dataset of chest X-rays, the self-supervised model outperformed a fully supervised model in the detection of three pathologies (out of eight), and the performance generalized to pathologies that were not explicitly annotated for model training, to multiple image-interpretation tasks and to datasets from multiple institutions.
Article
Full-text available
Most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. In this work, we describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. We designed and built a labeling platform for DICOM images to facilitate these annotation procedures. All images are made publicly available in DICOM format along with the labels of both the training set and the test set.
Article
Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as de facto operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, restoration, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the field's current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging.
Article
In this work we design an end-to-end deep learning architecture for predicting, on Chest X-rays images (CXR), a multi-regional score conveying the degree of lung compromise in COVID-19 patients. Such semi-quantitative scoring system, namely Brixia score, is applied in serial monitoring of such patients, showing significant prognostic value, in one of the hospitals that experienced one of the highest pandemic peaks in Italy. To solve such a challenging visual task, we adopt a weakly supervised learning strategy structured to handle different tasks (segmentation, spatial alignment, and score estimation) trained with a “from-the-part-to-the-whole” procedure involving different datasets. In particular, we exploit a clinical dataset of almost 5,000 CXR annotated images collected in the same hospital. Our BS-Net demonstrates self-attentive behavior and a high degree of accuracy in all processing stages. Through inter-rater agreement tests and a gold standard comparison, we show that our solution outperforms single human annotators in rating accuracy and consistency, thus supporting the possibility of using this tool in contexts of computer-assisted monitoring. Highly resolved (super-pixel level) explainability maps are also generated, with an original technique, to visually help the understanding of the network activity on the lung areas. We also consider other scores proposed in literature and provide a comparison with a recently proposed non-specific approach. We eventually test the performance robustness of our model on an assorted public COVID-19 dataset, for which we also provide Brixia score annotations, observing good direct generalization and fine-tuning capabilities that highlight the portability of BS-Net in other clinical settings. The CXR dataset along with the source code and the trained model are publicly released for research purposes.
Article
We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from http://bimcv.cipf.es/bimcv-projects/padchest/.