PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation. We aimed at finding the best combination among the models. The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.
Content may be subject to copyright.
Vision-Language Models for Automated Chest
X-ray Interpretation: Leveraging ViT and GPT-2
Md. Rakibul Islam1*, Md. Zahid Hossain1, Mustofa Ahmed1,
Most. Sharmin Sultana Samu2
1Department of Computer Science and Engineering, Ahsanullah
University of Science and Technology, Dhaka, 1208, Bangladesh.
2Department of Civil Engineering, Ahsanullah University of Science and
Technology, Dhaka, 1208, Bangladesh.
*Corresponding author(s). E-mail(s): rakib.aust41@gmail.com;
Contributing authors: zahidd16@gmail.com;
mustofahmed24@gmail.com;sharminsamu130@gmail.com;
These authors contributed equally to this work.
Abstract
Radiology plays a pivotal role in modern medicine due to its non-invasive diagnos-
tic capabilities. However, the manual generation of unstructured medical reports
is time consuming and prone to errors. It creates a significant bottleneck in
clinical workflows. Despite advancements in AI-generated radiology reports, chal-
lenges remain in achieving detailed and accurate report generation. In this study
we have evaluated different combinations of multimodal models that integrate
Computer Vision and Natural Language Processing to generate comprehensive
radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and
a SWIN Transformer as the image encoders. The BART and GPT-2 models serve
as the textual decoders. We used Chest X-ray images and reports from the IU-
Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN
Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report gen-
eration. We aimed at finding the best combination among the models. The
SWIN-BART model performs as the best-performing model among the four mod-
els achieving remarkable results in almost all the evaluation metrics like ROUGE,
BLEU and BERTScore.
Keywords: Medical Report Generation, Vision Language Model, SWIN-BART, ViT
B16-GPT-2, X-ray Interpretation.
1
arXiv:2501.12356v1 [cs.CV] 21 Jan 2025
1 Introduction
Medical reports play a key role in summarizing the diagnoses observed in X-ray images.
These reports are essential for promoting efficient communication among healthcare
professionals. They help guarantee precise care and facilitate well-informed decision-
making. However, creating medical reports requires considerable expertise and time
as radiologists typically write them manually. It can take a lot of time and effort
to complete this manual process. Automating medical report generation offers the
potential to improve patient care, streamline administrative workflows and enhance
overall healthcare delivery. Consequently, several studies [1] [2] [3] [4] have focused on
developing automated models to generate medical reports with most current efforts
dedicated to chest radiology reports.
Despite advancements, significant challenges hinder progress in this field. Accu-
rately interpreting medical images remains a complex task compounded by the need
for meticulous data annotation and addressing heterogeneity in input data. Ensuring
consistency and standardization in generated reports is also a major hurdle. Further-
more, the diversity and variability of diseases along with the requirement for algorithm
interpretability present additional obstacles. Overcoming these challenges is crucial
for improving the quality and reliability of automated medical report generation and
makes this an urgent area of research.
Automated radiology report generation systems are designed to process medical
images and create textual reports that describe radiological findings and interpreta-
tions. These systems aim to transform radiology practice by automating repetitive
tasks, identifying potential medical conditions and reducing diagnostic errors. Such
systems have the potential to enhance workflows by prioritizing patients based on
the urgency of their conditions in clinical settings. This capability could significantly
improve operational efficiency and save lives by expediting critical diagnoses.
Recent research has demonstrated that transformer-based models and global visual
features deliver promising results in automated report generation. However, these
models often struggle with identifying and characterizing rare or complex diseases.
The primary limitation lies in their inadequate understanding of medical terminology,
anatomical structures and lesion characteristics. As a result, the generated descriptions
frequently lack professionalism and lead to text that may be unclear or insufficiently
fluent. These issues compromise the usability and practical application of the reports
in clinical settings.
To address these challenges we use deep learning models to combine textual and
visual input. We employ the Vision Transformer (ViT-B16) and SWIN Transformer
as encoders, BART and GPT-2 as decoders. This architecture combines the strengths
of advanced vision and language models with the aim to enhance the overall quality
of automated medical report generation.
We conducted a systematic evaluation of different encoder-decoder combinations
including ViT B16 with BART, SWIN Transformer with BART, SWIN Transformer
with GPT-2 and ViT B16 with GPT-2. Each configuration was assessed based on its
ability to improve disease recognition accuracy, enhance the professionalism of disease
descriptions and ensure fluency in report text. Our goal is to maximize the automation
2
of radiology report generation by determining the most efficient combination. The
following concisely describes our main contributions:
We introduce an approach that integrates visual and textual data using Vision
Transformers (ViT B16) and SWIN Transformers as encoders along with BART
and GPT-2 as decoders.
We systematically evaluate four encoder-decoder combinations (SWIN Transformer-
BART, SWIN Transformer-GPT-2, ViT B16-BART, ViT B16-GPT-2) to identify
the optimal pairing for addressing key challenges such as disease recognition
accuracy, professionalism in descriptions and text fluency.
Our method focuses on reducing the burden on clinicians by automating the report-
writing process by improving diagnostic precision and ensuring that generated
reports adhere to professional medical standards.
The structure of this article is as follows. Section 2 presents a summary of related
studies. Section 3 contains the methodology of our approach. Section 4 describes the
experimental setup along with the description of the dataset and the pre-processing
of the data. It also reveals the hyperparameter combinations for different models.
Section 5 reports the analysis of the research results with a comparison among the
four multimodal models. Lastly, Section 6 concludes the article with discussion about
the limitations and future work directions.
2 Related Work
The studies explore different architectures for automated report generation. [5] intro-
duces a transformer-based model with a differentiable clinical information extraction
approach. In contrast, [6] employs an adversarial reinforcement learning framework
integrating an accuracy discriminator (AD) and a fluency discriminator (FD). [7] com-
bines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks with an attention mechanism to improve feature extraction and text gener-
ation. [8] focuses on explainable AI (XAI) techniques utilizing class activation maps
(CAM) for image analysis and LIME for explainability in natural language processing.
[9] proposes a self-guided framework (SGF) that combines unsupervised and super-
vised deep learning techniques allowing label-free learning from medical reports. The
datasets used in these studies include MIMIC-CXR [10], Indiana University (IU) X-ray
[11] and other medical report repositories. [5] and [6] utilize both MIMIC-CXR [10] and
IU X-ray [11] datasets to ensure robustness in evaluation. [7] also uses these datasets
emphasizing feature extraction through CNNs and LSTMs. [8] compiles data from
various studies rather than using a specific dataset. [9] primarily relies on existing radi-
ology reports without requiring annotated disease labels that take advantage of text
image associations to refine its self-guided learning process. [5] reports improvements
in clinical coherence achieving a 6.4-point increase in micro-averaged F1 scores and a
2.2-point gain in macro-averaged F1 scores. [6] demonstrates superior natural language
generation (NLG) performance on the IU X-ray [11] and MIMIC-CXR [10] datasets
through adversarial reinforcement learning. [7] achieves state-of-the-art BLEU scores
outperforming LRCN and hierarchical RNNs by incorporating attention mechanisms.
3
[9] shows significant improvements in report accuracy and length by demonstrating a
better alignment between image features and textual descriptions. However, [8] does
not present empirical results but highlights gaps in the application of XAI techniques
to medical NLP. [5] acknowledges that clinical coherence remains inadequate for real-
world deployment. [6] notes that the small size of the IU X-ray dataset and the
structural simplicity of reports may affect model performance. [7] struggles with com-
plex disease presentations and coherent long-form text generation. [8] identifies the
lack of a universal XAI method applicable to both vision and NLP models as a major
gap. [9] points out challenges in handling medical language complexity and potential
biases in text-image associations. [5] suggests integrating retrieval-based methods to
improve clinical accuracy. [6] proposes expanding dataset diversity and incorporat-
ing additional evaluation metrics. [7] recommends optimizing attention mechanisms
and enhancing NLP capabilities for better long-form text generation. [8] highlights
the need for hybrid XAI techniques and interdisciplinary collaboration. [9] aims to
refine its self-guided learning approach by integrating more diverse data sources and
improving the model’s understanding of complex medical language.
[12] introduces a Medical Semantic-Assisted Transformer that incorporates a
memory-augmented sparse attention block and a Medical Concepts Generation Net-
work (MCGN) to enhance semantic coherence. [13] presents MedEPT, a parameter-
efficient approach leveraging pre-trained vision-language models through a Generator
and Discriminator framework with knowledge extraction using MetaMap. [14] employs
a multi-modal machine learning approach that integrates transfer learning from pre-
trained models to combine image and text representations. [15] focuses on contrastive
learning by synthesizing increasingly hard negatives to improve feature discrimina-
tion without adding extra network complexity. [4] proposes a region-guided model
that detects and describes specific anatomical regions enabling improved accuracy
and human interaction. [12] employs the MIMIC-CXR [10] dataset, [13] uses image-
only datasets reducing reliance on extensive annotated data, [14] utilizes chest X-ray
report generation datasets, [15] employs both the IU-XRay [11] and MIMIC-CXR [10]
datasets to improve model generalization and [4] demonstrates its effectiveness through
experimental evaluations. [12] reports superior performance in image captioning and
report generation by integrating medical semantics. [13] significantly reduces trainable
parameters and training time while maintaining high accuracy making medical report
generation more efficient. [14] effectively combines image and text representations and
improves chest X-ray caption accuracy. [15] enhances feature discrimination using
hard negatives and achieves state-of-the-art alignment between images and reports. [4]
improves report quality and transparency of the report by incorporating anatomical
region guidance enabling better clinical applicability. [12] highlights the complexity of
radiographic images as a limitation suggesting refinement of model and exploring addi-
tional datasets for better generalization. [13] notes its reliance on existing datasets and
the potential biases present in medical reports recommending expansion of medical
datasets and improving entity extraction methods. [14] suggests further improvements
in pre-trained model applications. [15] acknowledges that generated reports are not yet
4
at human-level quality and proposes enhancing model architecture and training strate-
gies. [4] suggests improving interactive features and applying the model to various
imaging modalities to increase its clinical usability.
[16] introduces MERGIS, a transformer-based model that integrates image seg-
mentation to refine feature extraction significantly improving the coherence of the
report and outperforming previous methods in the MIMIC-CXR dataset. However,
it requires further validation on diverse datasets. [17] employs a deep learning-based
encoder-decoder model with an attention mechanism leveraging CheXnet for feature
extraction and demonstrates promising BLEU score results. However, its reliance
on limited datasets restricts its generalization potential. [18] presents InVERGe, a
lightweight transformer using the Cross-Modal Query Fusion Layer (CMQFL) to
bridge visual and textual modalities. Trained on MIMIC-CXR, IU-XRay and CDD-
CESM [19] datasets, it enhances image-text alignment, yet dataset diversity remains
a concern. [20] explores large multimodal models integrating LLaVA, IDEFICS 9B
and visionGPT2. It achieves strong BERTScore and ROUGE results on the ROCOv2
dataset. However, challenges in concept detection persist suggesting the need for addi-
tional datasets and refined models. [21] introduces XRaySwinGen, a multimodal model
combining the SWIN Transformer for image encoding and GPT-2 for text generation
with bilingual capabilities. Despite high ROUGE-L and METEOR scores, dataset bias
and validation across different imaging modalities remain challenges.
[22] introduces VLScore, a novel multi-modal evaluation metric that jointly
embeds visual and textual features to assess the diagnostic relevance of generated
reports. This approach addresses the limitations of prior evaluation methods that rely
solely on text-based metrics. Using the ReXVal [23] dataset and a custom perturba-
tion dataset, VLScore demonstrates strong correlation with radiologist evaluations.
However, its reliance on dataset-specific constants limits generalizability. [24] pro-
poses CXPMRG-Bench, a benchmarking framework for X-ray report generation using
the CheXpert Plus [25] dataset. It incorporates a multi-stage pre-training strategy
with self-supervised contrastive learning and fine-tuning to enhance report accu-
racy. Despite its improvements, the framework still requires dataset expansion and
architectural refinements for broader applicability. [26] presents a robust radiology
report generation system integrating multimodal learning combining CNNs for image
analysis with RNNs or Transformers for text generation. The model is trained on
IU-CXR and leverages pre-trained models and transfer learning to produce clinically
coherent reports. However, challenges persist in ensuring full clinical accuracy and
real-world validation. [27] introduces a multi-modal feature fusion-based approach for
chest X-ray report generation. By integrating a vision transformer for visual feature
extraction with Word2Vec for semantic textual features, this model achieves state-of-
the-art performance on the IU-Xray [11] and NIH [28] datasets. Despite its success,
it requires further dataset diversification and adaptation for various diseases.
The following research gaps are identified through our extensive literature search:
Need for diverse datasets and improvements in architecture.
Achieving full clinical accuracy and validating the system in real settings.
Need for better feature extraction and validation on diverse datasets.
5
Generated reports are not yet at human-level performance and require further
refinement.
3 Methodology
Our workflow (Fig.1) begins with image acquisition and processing. Chest X-ray
images were collected from publicly available IU-Xray dataset to ensure diversity and
relevance for the task. Feature extraction was performed using ViT (Vision Trans-
former) and SWIN Transformer. The Vision Transformer [29] divides the image into
fixed-size patches, embeds them into feature vectors and processes these vectors
through a series of self-attention layers to capture global contextual information. In
contrast, the SWIN Transformer [30] employs a hierarchical structure with shifted
windows by enabling efficient computation of both local and global image representa-
tions. Both models were pre-trained on large-scale image datasets such as ImageNet
and subsequently fine-tuned on the medical domain data to extract high-quality fea-
tures relevant for the task. These extracted features serve as the visual input to the
subsequent language generation model.
Fig. 1 Proposed methodology for automated X-ray interpretation
The language generation stage employs either GPT-2 or BART as the core text
generation model. GPT-2 [31] is a decoder-only architecture that generates descrip-
tive text in an autoregressive manner based on the extracted visual features ensuring
coherence and relevance to the input image. BART [32] is a sequence-to-sequence
model that combines bidirectional encoding and autoregressive decoding to generate
well-structured medical reports. Cross-attention mechanisms were employed to inte-
grate the visual features into the language model to ensure that the generated text
is grounded in the visual input. The models were trained using paired image-report
datasets to enable them to learn mappings between visual features and corresponding
textual descriptions effectively.
The training process was conducted in an end-to-end manner using supervised learn-
ing. Cross-entropy loss was employed to optimize the model. Validation datasets were
utilized during training to monitor model performance and prevent overfitting. The
optimization process leveraged the AdamW optimizer with weight decay to ensure
convergence.
Evaluation of the generated reports was carried out using a mix of textual and
image-aware metrics. These include ROUGE [33], BLEU [34], BERTScore [35]. More
6
specifically, BLEU assesses text quality through n-gram matching. ROUGE-L evalu-
ates text using the longest common subsequence. BERTScore Precision, BERTScore
Recall and BERTScore F1 measure text similarity using contextual embeddings.
4 Experimental Setup
The experiments were carried out using an NVIDIA T4 GPU made available through
Google Colab. This GPU provided the necessary computational power to ensure
the efficient training and evaluation of the proposed models. The computational
setup for the study included standard deep learning libraries. PyTorch was utilized
for implementing the models, while Hugging Face Transformers were employed for
fine-tuning.
4.1 Dataset
We have used the IU-Xray [11] dataset which is a publicly available collection of
radiographic images paired with their corresponding radiology reports. This dataset
is widely used for research in medical imaging. The dataset comprises a total of 5,910
chest X-ray images along with their associated findings in the form of radiology reports.
Each image in the dataset is accompanied by a detailed textual description that
provides diagnostic insights. Fig. 2shows two sample X-ray images with associated
reports.
Fig. 2 Sample X-ray images and corresponding findings in the form of reports from the IU-Xray
dataset. These reports are treated as the ground truth.
The dataset is organized into predefined splits for training, testing and validation.
Training set, test set and validation set contain 4138, 1180 and 592 images and their
corresponding reports respectively. The predefined split ratio adheres to a 70:20:10
distribution for training, testing and validation respectively. The dataset is free from
missing images, reports or split information. Figures 3and 4show the distribution
of report length by the number of words and the distribution of report length in the
train-test-validation split respectively.
7
Fig. 3 Distribution of report length in number of words
Fig. 4 Report length distribution in train, test and validation split
4.2 Data Preprocessing and Analysis
As part of the preprocessing pipeline, we have removed stopwords systematically from
the radiology reports in the dataset. Fig. 5and Fig. 6show the word cloud of reports
before and after removing stopwords respectively.
Table. 1lists detailed statistics of report length in train, test and validation split.
The training set contains 4,138 samples, the test set contains 1,180 samples and the
validation set contains 592 samples. The mean value indicates a balanced distribution
across all three subsets.
8
Table 1 Statistics of Report Length by Split
Split Count Mean Standard Deviation Min Max 25% 50% 75%
Train 4138.0 31.765 14.206 7.0 149.0 22.0 29.0 39.0
Test 1180.0 28.219 13.181 8.0 93.0 19.0 25.0 33.0
Validation 592.0 31.128 13.812 8.0 83.0 21.0 30.0 38.0
Fig. 5 Word cloud of reports before removing stopwords
Fig. 6 Word cloud of reports after removing stopwords
4.3 Model Hyperparameters
Table 2summarizes the hyperparameters used for training different model architec-
tures in our experiment. Each model is trained using the AdamW optimizer with a
consistent learning rate of 0.00005 and a weight decay of 0.01. The training batch size
is fixed at 8 across all models. The number of training epochs varies. ViT B16-GPT-2
9
Table 2 Model Hyperparameters
Model Number of Training Batch Optimizer Learning Weight
Epochs Size Rate Decay
ViT B16-GPT-2 5 8 AdamW 0.00005 0.01
ViT B16-BART 8 8 AdamW 0.00005 0.01
SWIN-BART 8 8 AdamW 0.00005 0.01
SWIN-GPT-2 5 8 AdamW 0.00005 0.01
and SWIN-GPT-2 are trained for 5 epochs, while ViT B16-BART and SWIN-BART
are trained for 8 epochs.
5 Result Analysis
We evaluated the performance of four vision-language models such as SWIN
Transformer-BART, SWIN Transformer-GPT-2, ViT B16-BART and ViT B16-GPT-2
on automated X-ray interpretation tasks. The models were compared across multi-
ple metrics including ROUGE scores, BLEU score and BERTScore to assess their
syntactic precision, contextual fidelity and semantic similarity.
Table. 3shows that, the SWIN-BART model demonstrated superior performance
across all ROUGE metrics. This indicates the effectiveness in capturing textual overlap
and contextual accuracy. For ROUGE1 F1, SWIN-BART achieved the highest score
of 0.4134 significantly outperforming ViT B16-GPT-2 with a score of 0.2877, ViT
B16-BART with a score of 0.3176 and SWIN-GPT-2 with a score of 0.2855.
Table 3 Model Evaluation with ROUGE and BLEU score
Model ROUGE1 ROUGE2 ROUGE3 ROUGE4 ROUGEL BLEU
F1 F1 F1 F1 F1
ViT B16-GPT-2 0.2877 0.1273 0.0689 0.0435 0.2031 0.0403
ViT B16-BART 0.3176 0.0612 0.0121 0.0029 0.2324 0.0169
SWIN-BART 0.4134 0.1537 0.0738 0.0427 0.2935 0.0648
SWIN-GPT-2 0.2855 0.1108 0.0531 0.0305 0.1933 0.0319
In bigram and trigram overlaps measured by ROUGE2 F1 and ROUGE3 F1,
SWIN-BART scored 0.1537 and 0.0738 respectively. These values were notably higher
compared to ViT B16-GPT-2, which scored 0.1273 in ROUGE2 F1 and 0.0689 in
ROUGE3 F1. Similarly, ViT B16-BART achieved 0.0612 in ROUGE2 F1 and 0.0121 in
ROUGE3 F1, while SWIN-GPT-2 had 0.1108 in ROUGE2 F1 and 0.0531 in ROUGE3
F1.
For ROUGE4 F1, which evaluates four-gram overlaps, SWIN-BART maintained
its lead with a score of 0.0427, while ViT B16-GPT-2 scored 0.0435, ViT B16-
BART scored 0.0029 and SWIN-GPT-2 scored 0.0305. In ROUGEL F1, SWIN-BART
achieved the highest score of 0.2935, surpassing ViT B16-GPT-2 at 0.2031, ViT B16-
BART at 0.2324 and SWIN-GPT-2 at 0.1933. These results establish SWIN-BART as
10
the most capable model in preserving syntactic structure and capturing long-sequence
dependencies.
The BLEU score, which measures n-gram precision, further confirmed SWIN-BART’s
dominance. It achieved a score of 0.0648, significantly outperforming ViT B16-GPT-2
at 0.0403, ViT B16-BART at 0.0169 and SWIN-GPT-2 at 0.0319. This result high-
lights SWIN-BART’s enhanced ability to generate text sequences that closely align
with reference descriptions.
Table 4 Model Evaluation with BERTScore
Model BERTScore Precision BERTScore Recall BERTScore F1
ViT B16-GPT-2 0.8392 0.9015 0.8691
ViT B16-BART 0.8158 0.8471 0.831
SWIN-BART 0.8855 0.8947 0.8899
SWIN-GPT-2 0.8331 0.8998 0.865
Table. 4shows that, the SWIN-BART model demonstrated superior performance
across all BERTScore metrics. This indicates the effectiveness in capturing seman-
tic similarity and contextual relevance in X-ray report generation. For BERTScore
Precision, SWIN-BART achieved the highest score of 0.8855, outperforming ViT
B16-GPT-2 with a score of 0.8392, ViT B16-BART with 0.8158 and SWIN-GPT-2
with 0.8331. This suggests that SWIN-BART generates text with higher lexical accu-
racy and better alignment with reference reports.
Fig. 7 Training vs Validation Loss of ViT B16-GPT-2 Model
In BERTScore Recall, which measures the ability to retain important details from
reference reports, SWIN-BART maintained a competitive score of 0.8947, slightly
surpassing ViT B16-GPT-2 at 0.9015 and SWIN-GPT-2 at 0.8998. However, it sig-
nificantly outperformed ViT B16-BART, which scored 0.8471. This highlights the
11
advantage in capturing more clinically relevant details.
For BERTScore F1, which balances precision and recall, SWIN-BART achieved the
highest score of 0.8899, surpassing ViT B16-GPT-2 at 0.8691, ViT B16-BART at
0.8310 and SWIN-GPT-2 at 0.8650. This confirms its overall effectiveness in generating
fluent, coherent and semantically accurate medical reports.
Fig. 8 Training vs Validation Loss of ViT B16-BART Model
Fig. 9 Training vs Validation Loss of SWIN-BART Model
The comparative analysis establishes SWIN-BART as the best-performing model
among the four. It consistently achieved superior results across all evaluated metrics.
12
Fig. 10 Training vs Validation Loss of SWIN-GPT-2 Model
Figures 7,8,9and 10 illustrate the training and validation loss over different epochs
for the ViT-B16-GPT-2, ViT-B16-BART, SWIN-BART and SWIN-GPT-2 models
respectively. We observe a consistent drop in both training and validation losses for
all the models.
Fig. 11 Generated report sample from our implementation with attached ground truth
Fig. 11 shows an example report generated through our implementation. From the
attached ground truth, it is clearly visible that our model is capable of generating
accurate and coherent report from a given X-ray image.
6 Conclusion and Future Work
In this paper we have evaluated different combinations of Vision Transformer and
SWIN Transformer along with BART and GPT-2 for multimodal tasks of X-ray report
generation. Our experiment finds the best performance from the SWIN-BART model
in the task of accurate and coherent report generation from input Chest X-ray images.
Due to computational limitations, we had to work with fixed length report generation.
13
This is one of the limitations of our work expected to be overcome in future research.
There is a lack of a more robust dataset for this generation task. This should also
be addressed in the future work to get more reliable and accurate models for report
generation.
References
[1] Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for
radiology report generation. arXiv preprint arXiv:2204.13258 (2022)
[2] Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging
reports. arXiv preprint arXiv:1711.08195 (2017)
[3] Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve,
paraphrase for medical image report generation. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 33, pp. 6666–6673 (2019)
[4] Tanida, T., uller, P., Kaissis, G., Rueckert, D.: Interactive and explainable
region-guided radiology report generation. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7433–7442 (2023)
[5] Lovelace, J., Mortazavi, B.: Learning to generate clinically coherent chest x-ray
reports. In: Findings of the Association for Computational Linguistics: EMNLP
2020, pp. 1235–1243 (2020)
[6] Hou, D., Zhao, Z., Liu, Y., Chang, F., Hu, S.: Automatic report generation for
chest x-ray images via adversarial reinforcement learning. IEEE Access 9, 21236–
21250 (2021)
[7] Sirshar, M., Paracha, M.F.K., Akram, M.U., Alghamdi, N.S., Zaidi, S.Z.Y.,
Fatima, T.: Attention based automated radiology report generation using cnn
and lstm. Plos one 17(1), 0262209 (2022)
[8] Ahmed, S.B., Solis-Oba, R., Ilie, L.: Explainable-ai in automated medical report
generation using chest x-ray images. Applied Sciences 12(22), 11750 (2022)
[9] Li, J., Li, S., Hu, Y., Tao, H.: A self-guided framework for radiology report
generation. In: International Conference on Medical Image Computing and
Computer-Assisted Intervention, pp. 588–598 (2022). Springer
[10] Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P.,
Deng, C.-y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available
database of chest radiographs with free-text reports. Scientific data 6(1), 317
(2019)
[11] Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology
14
examinations for distribution and retrieval. Journal of the American Medical
Informatics Association 23(2), 304–310 (2016)
[12] Wang, Z., Tang, M., Wang, L., Li, X., Zhou, L.: A medical semantic-assisted
transformer for radiographic report generation. In: International Conference on
Medical Image Computing and Computer-Assisted Intervention, pp. 655–664
(2022). Springer
[13] Li, Q.: Harnessing the power of pre-trained vision-language models for effi-
cient medical report generation. In: Proceedings of the 32nd ACM International
Conference on Information and Knowledge Management, pp. 1308–1317 (2023)
[14] Nicolson, A., Dowling, J., Koopman, B.: Improving chest x-ray report genera-
tion by leveraging warm starting. Artificial intelligence in medicine 144, 102633
(2023)
[15] Voutharoja, B.P., Wang, L., Zhou, L.: Automatic radiology report generation by
learning with increasingly hard negatives. In: ECAI 2023, pp. 2427–2434. IOS
Press, ??? (2023)
[16] Nimalsiri, W., Hennayake, M., Rathnayake, K., Ambegoda, T.D., Meedeniya, D.:
Automated radiology report generation using transformers. In: 2023 3rd Inter-
national Conference on Advanced Research in Computing (ICARC), pp. 90–95
(2023). IEEE
[17] Kumar, M.A., Panitini, M., Vemulapalli, S., Sai, M.J.N.V.: Deep learning based
automatic radiology report generation. In: 2023 Third International Conference
on Artificial Intelligence and Smart Energy (ICAIS), pp. 1521–1526 (2023). IEEE
[18] Deria, A., Kumar, K., Chakraborty, S., Mahapatra, D., Roy, S.: Inverge: Intelli-
gent visual encoder for bridging modalities in report generation. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
2028–2038 (2024)
[19] Khaled, R., Helal, M., Alfarghaly, O., Mokhtar, O., Elkorany, A., El Kassas, H.,
Fahmy, A.: Categorized contrast enhanced mammography dataset for diagnostic
and artificial intelligence research. Scientific data 9(1), 122 (2022)
[20] Hoque, M., Hasan, M.R., Emon, M., Khalifa, F., Rahman, M.: Medical image
interpretation with large multimodal models. In: CEUR Workshop Proceedings
(2024). CEUR Workshop Proceedings 3740, CEUR-WS. org 2024
[21] Magalh˜aes, G.V., Santos, R.L.d.S., Vogado, L.H., Paiva, A.C., Santos Neto,
P.d.A.: Xrayswingen: Automatic medical reporting for x-ray exams with multi-
modal model. Heliyon 10(7) (2024)
[22] Dawidowicz, G., Hirsch, E., Tal, A.: Image-aware evaluation of generated medical
15
reports. arXiv preprint arXiv:2410.17357 (2024)
[23] Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E.P., Fonseca, E., Lee,
H., Shakeri, Z., Ng, A., et al.: Radiology report expert evaluation (rexval) dataset
(2023)
[24] Wang, X., Wang, F., Li, Y., Ma, Q., Wang, S., Jiang, B., Li, C., Tang, J.: Cxpmrg-
bench: Pre-training and benchmarking for x-ray medical report generation on
chexpert plus dataset. arXiv preprint arXiv:2410.00379 (2024)
[25] Chambon, P., Delbrouck, J.-B., Sounack, T., Huang, S.-C., Chen, Z., Varma,
M., Truong, S.Q., Chuong, C.T., Langlotz, C.P.: Chexpert plus: Hundreds
of thousands of aligned radiology texts, images and patients. arXiv preprint
arXiv:2405.19538 (2024)
[26] Singh, S.: Designing a robust radiology report generation system. arXiv preprint
arXiv:2411.01153 (2024)
[27] Cheddi, F., Habbani, A., Nait-Charif, H.: A multi-modal feature fusion-based
approach for chest x-ray report generation. In: 2024 11th International Conference
on Wireless Networks and Mobile Communications (WINCOM), pp. 1–7 (2024).
IEEE
[28] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi-
fication and localization of common thorax diseases. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017)
[29] Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., Xue, H.: Towards
robust vision transformer. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 12042–12051 (2022)
[30] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-
former: Hierarchical vision transformer using shifted windows. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022
(2021)
[31] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.:
Language models are unsupervised multitask learners. OpenAI blog 1(8), 9
(2019)
[32] Lewis, M.: Bart: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. arXiv preprint arXiv:1910.13461
(2019)
[33] Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text
Summarization Branches Out, pp. 74–81 (2004)
16
[34] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics, pp. 311–318 (2002)
[35] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluat-
ing text generation with bert. arXiv preprint arXiv:1904.09675 (2019)
17
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Medical image captioning plays an important role in modern healthcare, improving clinical report generation and aiding radiologists in detecting abnormalities and reducing misdiagnosis. The complex visual and textual data biases make this task more challenging. Recent advancements in transformer-based models have significantly improved the generation of radiology reports from medical images. However, these models require substantial computational resources for training and have been observed to produce unnatural language outputs when trained solely on raw image-text pairs. Our aim is to generate more detailed reports specific to images and to explain the reasoning behind the generated text through image-text alignment. Given the high computational demands of end-to-end model training, we introduce a two-step training methodology with an Intelligent Visual Encoder for Bridging Modalities in Report Generation (InVERGe) model. This model incorporates a lightweight transformer known as the Cross-Modal Query Fusion Layer (CMQFL), which utilizes the output from a frozen encoder to identify the most relevant text-grounded image embedding. This layer bridges the gap between the encoder and decoder, significantly reducing the workload on the decoder and enhancing the alignment between vision and language. Our experimental results, conducted using the MIMIC-CXR, Indiana University chest X-ray images, and CDD-CESM breast images datasets, demonstrate the effectiveness of our approach. Code: https://github.com/labsroy007/InVERGe
Article
Full-text available
The importance of radiology in modern medicine is acknowledged for its non-invasive diagnostic capabilities, yet the manual formulation of unstructured medical reports poses time constraints and error risks. This study addresses the common limitation of Artificial Intelligence applications in medical image captioning, which typically focus on classification problems, lacking detailed information about the patient's condition. Despite advancements in AI-generated medical reports that incorporate descriptive details from X-ray images, which are essential for comprehensive reports, the challenge persists. The proposed solution involves a multimodal model utilizing Computer Vision for image representation and Natural Language Processing for textual report generation. A notable contribution is the innovative use of the Swin Transformer as the image encoder, enabling hierarchical mapping and enhanced model perception without a surge in parameters or computational costs. The model incorporates GPT-2 as the textual decoder, integrating cross-attention layers and bilingual training with datasets in Portuguese PT-BR and English. Promising results are noted in the proposed database with ROUGE-L 0.748, METEOR 0.741, and NIH CHEST X-ray with ROUGE-L 0.404 and METEOR 0.393.
Chapter
Full-text available
Automatic radiology report generation is challenging as medical images or reports are usually similar to each other due to the common content of anatomy. This makes a model hard to capture the uniqueness of individual images and is prone to producing undesired generic or mismatched reports. This situation calls for learning more discriminative features that could capture even fine-grained mismatches between images and reports. To achieve this, this paper proposes a novel framework to learn discriminative image and report features by distinguishing them from their closest peers, i.e., hard negatives. Especially, to attain more discriminative features, we gradually raise the difficulty of such a learning task by creating increasingly hard negative reports for each image in the feature space during training, respectively. By treating the increasingly hard negatives as auxiliary variables, we formulate this process as a min-max alternating optimisation problem. At each iteration, conditioned on a given set of hard negative reports, image and report features are learned as usual by minimising the loss functions related to report generation. After that, a new set of harder negative reports will be created by maximising a loss reflecting image-report alignment. By solving this optimisation, we attain a model that can generate more specific and accurate reports. It is noteworthy that our framework enhances discriminative feature learning without introducing extra network weights. Also, in contrast to the existing way of generating hard negatives, our framework extends beyond the granularity of the dataset by generating harder samples out of the training set. Experimental study on benchmark datasets verifies the efficacy of our framework and shows that it can serve as a plug-in to readily improve existing medical report generation models. The code is publicly available at https://github.com/Bhanu068/ITHN.
Article
Full-text available
The use of machine learning in healthcare has the potential to revolutionize virtually every aspect of the industry. However, the lack of transparency in AI applications may lead to the problem of trustworthiness and reliability of the information provided by these applications. Medical practitioners rely on such systems for clinical decision making, but without adequate explanations, diagnosis made by these systems cannot be completely trusted. Explainability in Artificial Intelligence (XAI) aims to improve our understanding of why a given output has been produced by an AI system. Automated medical report generation is one area that would benefit greatly from XAI. This survey provides an extensive literature review on XAI techniques used in medical image analysis and automated medical report generation. We present a systematic classification of XAI techniques used in this field, highlighting the most important features of each one that could be used by future research to select the most appropriate XAI technique to create understandable and reliable explanations for decisions made by AI systems. In addition to providing an overview of the state of the art in this area, we identify some of the most important issues that need to be addressed and on which research should be focused.
Conference Paper
Given the rapid increase of respiratory illnesses in the recent times, the demand for medical report writing for chest X-Rays (CXR) has significantly increased. In practice, a specialized medical expert has to go through an X-Ray image to compile the accompanying report, which is tedious, not scalable and potentially prone to human error. Therefore, automatic medical report generation (AMRG) solutions for CXR as a diagnostic assistance tool could play an important role to lower the burden on radiologists making them more productive. However, current AMRG solutions are still lagging far behind the performance of human experts due to the reasons such as the inability to extract the most relevant features to be used for the compilation of the report. We address this by proposing MERGIS: MEdical Report Generation using Image Segmentation approach. MERGIS is a modern transformer-based encoder-decoder model that leverages image segmentation for improving the accuracy of automatic report generation. In this approach, the CXR images are segmented before feeding into the model, enabling the encoder to extract relevant visual features of the medical image resulting in more accurate radiography reports. The proposed model outperforms the current state-ofthe-art model for report generation on the MIMIC-CXR dataset with performance scores: BLUE-1 = 0.296, METEOR = 0.128, ROUGE L = 0.335 and CIDEr = 1.150.