ArticlePDF Available

Multi-modal representation learning in retinal imaging using self-supervised learning for enhanced clinical predictions

Springer Nature
Scientific Reports
Authors:

Abstract and Figures

Self-supervised learning has become the cornerstone of building generalizable and transferable artificial intelligence systems in medical imaging. In particular, contrastive representation learning techniques trained on large multi-modal datasets have demonstrated impressive capabilities of producing highly transferable representations for different downstream tasks. In ophthalmology, large multi-modal datasets are abundantly available and conveniently accessible as modern retinal imaging scanners acquire both 2D fundus images and 3D optical coherence tomography (OCT) scans to assess the eye. In this context, we introduce a novel multi-modal contrastive learning-based pipeline to facilitate learning joint representations for the two retinal imaging modalities. After self-supervised pre-training on 153,306 scan pairs, we show that such a pre-training framework can provide both a retrieval system and encoders that produce comprehensive OCT and fundus image representations that generalize well for various downstream tasks on three independent external datasets, explicitly focusing on clinically pertinent prediction tasks. In addition, we show that interchanging OCT with lower-cost fundus imaging can preserve the predictive power of the trained models.
This content is subject to copyright. Terms and conditions apply.
Multi-modal representation
learning in retinal imaging using
self-supervised learning for
enhanced clinical predictions
Emese Sükei1, Elisabeth Rumetshofer2, Niklas Schmidinger2, Andreas Mayr2,
Ursula Schmidt-Erfurth1, Günter Klambauer2 & Hrvoje Bogunović1,3
Self-supervised learning has become the cornerstone of building generalizable and transferable
articial intelligence systems in medical imaging. In particular, contrastive representation learning
techniques trained on large multi-modal datasets have demonstrated impressive capabilities of
producing highly transferable representations for dierent downstream tasks. In ophthalmology,
large multi-modal datasets are abundantly available and conveniently accessible as modern retinal
imaging scanners acquire both 2D fundus images and 3D optical coherence tomography (OCT) scans to
assess the eye. In this context, we introduce a novel multi-modal contrastive learning-based pipeline
to facilitate learning joint representations for the two retinal imaging modalities. After self-supervised
pre-training on 153,306 scan pairs, we show that such a pre-training framework can provide both a
retrieval system and encoders that produce comprehensive OCT and fundus image representations
that generalize well for various downstream tasks on three independent external datasets, explicitly
focusing on clinically pertinent prediction tasks. In addition, we show that interchanging OCT with
lower-cost fundus imaging can preserve the predictive power of the trained models.
Keywords Multi-modal imaging, Contrastive pre-training, Representation learning, Predictive modeling,
Retinal imaging
Deep learning techniques have signicantly advanced in various medical image interpretation tasks14. However,
building robust and generalizable deep-learning models oen demands a substantial volume of labeled data. For
instance, in their seminal work, Esteva et al.1 compiled a dermatologist-labeled dataset of 129,450 clinical images
to reach on-par performance with experts in classifying skin cancer. Compiling large datasets with diverse cases
is a laborious and time-consuming task, and it can also introduce annotator biases. As such, the progress of
supervised deep learning models for medical imaging is hindered by the limited availability of costly, extensively
labeled datasets5. Self-supervised learning (SSL) approaches aim to overcome these challenges by learning
meaningful representations from large amounts of unlabeled data, reducing the reliance on costly and time-
consuming expert annotations while improving the accuracy and generalizability of predictive tasks6.
e fundamental concept behind SSL is to create auxiliary or pretext tasks that enable the model to learn
meaningful and valuable representations directly from the data without relying on human annotations. Once
the model acquires these representations through pretext tasks, the learned features can be transferred to
downstream tasks, such as classication, segmentation, or detection, where labeled data is oen scarce. ese
approaches usually utilize discriminative modeling, reconstruction tasks, or contrastive learning techniques710,
enabling ecient downstream task learning with fewer labeled examples. ese methods have demonstrated
substantial performance improvements not only in the natural image domain but also in medical imaging,
enhancing both classication11,12 and segmentation1315 tasks. Nonetheless, they oen require large datasets
for eective pre-training, which can be challenging in medical domains where privacy concerns restrict data
collection and sharing. Additionally, many existing SSL models focus on single-modality inputs, limiting their
ability to fully capture the complementary information available from multiple modalities.
1OPTIMA Lab, Department of of Ophthalmology and Optometry, Medical University of Vienna, Vienna,
Austria. 2LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria. 3Institute of
Articial Intelligence, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria. email:
emese.suekei@meduniwien.ac.at; hrvoje.bogunovic@meduniwien.ac.at
OPEN
Scientic Reports | (2024) 14:26802 1
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports
Content courtesy of Springer Nature, terms of use apply. Rights reserved
In the medical workow, clinicians regularly interpret a combination of multiple modalities to deliver
comprehensive patient care, such as clinical notes, laboratory tests, vital signs, medical images, genomics, and
more. However, conventional machine learning applications oen concentrate on single modalities, leading to
inexible models that struggle to adapt to other tasks or varying data distributions for the same task without
retraining. e recent emergence of foundation models exploits the multi-modal nature of available medical data
and has gained considerable attention1618. ese models, by denition, pre-trained encoders adaptable or ne-
tunable for many tasks, show great promise19. ey are of particular interest for their eectiveness in deciphering
intricate structures within multi-modal data, making them well-suited to successfully address a broad range of
challenging tasks in the medical domain2023.
Ophthalmology is an image-intensive sub-specialty at the forefront of integrating articial intelligence
(AI) into medical practice2427. e eye and its intricate retina oer a unique advantage, directly observable
through non-invasive imaging methods such as color fundus photography and, more recently, optical coherence
tomography (OCT). e 3D OCT provides micrometer-level resolution within a volumetric scan, achieving
widespread adoption and emerging as the gold standard for managing retinal diseases. Combining 2D color
fundus photography or near-infrared reectance imaging with 3D OCT confers a holistic insight into the retinal
structure28; hence, every OCT scanner on the market takes a fundus image in addition to the OCT scan as part
of the acquisition. However, expert annotations are not always readily available for these imaging modalities,
which can hinder the training of supervised AI models in clinical practice. e automatic availability of these
two modalities, OCT and fundus images, nonetheless, oers an opportunity for multi-modal SSL methods.
Learning meaningful representations by jointly modeling the dierent imaging modalities can, in turn, facilitate
disease progression modeling and personalized patient management. Despite this potential, the realm of multi-
modal SSL in ophthalmology remains relatively uncharted, as current methodologies primarily focus on fusion
techniques29,30 or uni-modal setups3134.
Motivated by this, our research focuses explicitly on the multi-modal advantages of pairing OCT volumetric
data with fundus images. To this end, we base our analysis on multi-modal SSL techniques like Contrastive
Language-Image Pre-training (CLIP)35 and Contrastive Leave One Out Boost (CLOOB)36, that have proven their
ecacy in acquiring transferable representations when trained on both textual and visual data and are expected
to hold immense promise in medical imaging. So, unlike approaches that focus solely on 2D information or
neglect the complementary insights provided by aligning fundus images with OCT scans of the same eye,
our methodology seeks to leverage the synergies between these modalities to capture complex patterns and
representations applying these paradigms to retinal imaging in a multi-modal fashion. By doing so, we anticipate
generating more meaningful and transferable representations of the retina, ultimately enhancing the predictive
models’ overall performance.
Our contributions can be summarized as follows:
1. We introduce a novel multi-modal contrastive pre-training method for retinal imaging based on the CLIP
and CLOOB paradigms, combining the 2D and 3D information available from fundus and OCT imaging.
2. We extensively evaluate the pre-trained encoders’ performance on three independent external datasets at
adapting to a series of clinically relevant prediction tasks: i) regression: structure-function and biomarker
measurements, ii) classication: biomarker detection, and disease diagnosis, and iii) forecasting: disease and
treatment prognosis.
3. We compare our proposed approach with fully supervised training, natural image-based pre-training, and
uni-modal SSL pre-training, oering insights into its relative eectiveness and demonstrating its potential
advantages in enhancing predictive models for ophthalmic applications.
4. We demonstrate how the multi-modal contrastive pre-training enhances the predictive performance of 2D
fundus image-based models. Furthermore, we show how it allows, to a certain extent, the interchanging of
the two imaging modalities when the encoder weights are frozen for supervised predictive training. Fundus
image-based predictions are of strong interest as fundus imaging tools are more cost-eective and widely
accessible for diagnostics in ophthalmology than OCT scanners.
Materials and methods
Datasets
We used dierent datasets for the contrastive pre-training and the supervised downstream tasks. Here, we list
the datasets and the trial numbers, while a detailed overview of the dataset properties from each study and the
cohorts’ demographic and clinical data at the time of baseline OCT acquisition is provided in Supplementary
Table A1. Ethics approval for post-hoc analysis of the datasets was obtained from the Ethics Committee at
the Medical University of Vienna (MUW), Austria (EK: 1246/2016). is work adhered to the tenets of the
Declaration of Helsinki and the MUW’s standards of good scientic practice.
For pre-training, we used large-scale longitudinal data from the imaging data collection of clinical studies
available at OPTIMA Lab, MUW. We extracted 153,306 fundus photography/near-infrared reectance imaging
scans and the corresponding OCT volumes from 3,790 patients diagnosed with neovascular age-related macular
degeneration (AMD). e scans were acquired using Spectralis, Cirrus, Nidek, or Topcon scanners at dierent
sites. e majority of the imaging data were prospectively collected during ve randomized multi-center clinical
trials: FLUID (NCT01972789), TREND (NCT01948830), OCTAVE (NCT01780935), HAWK (NCT02307682)
and HARRIER (NCT02434328).
For ne-tuning and validation of the proposed supervised models on clinical downstream tasks, we used the
retrospective de-identied HARBOR clinical trial dataset (NCT00891735)37 (51,186 scans from 2,183 unique
eyes of 1,094 patients with AMD), the publicly available OLIVES dataset38 (1,590 scans from 96 unique eyes
of 87 patients with diabetic retinopathy (DR)), and a mixed diseases dataset from the OPTIMA Lab imaging
Scientic Reports | (2024) 14:26802 2
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
datasets (1,922 scans from 1,922 unique eyes of 1,922 patients), which is referred to as MIX in the rest of the
paper. e latter consisted of a selection of baseline scans from clinical studies covering 4 dierent retinal
conditions: diabetic macular edema (DME), intermediate age-related macular degeneration (iAMD), retinal
vein occlusion (RVO), geographic atrophy (GA), and healthy cases. Hence, our assessment of transferability was
based on various datasets extending to diverse clinical settings and disease types.
Among the dierent datasets, the scanning protocols varied, leading to dierent OCT volume resolutions
with 21-256 B-scans, with each B-scan having a size in the range of 256-1,024
×
480-1,024 pixels with a pixel
size of 2.13–23.43
×
1.95–4.19
µ
m
2
. Similarly, the size of the fundus images was in the range of 290–2,000
×
348–2,992 pixels.
Data processing and augmentation
We re-scaled the fundus images and the OCT B-scans to 224×224 to allow larger batch sizes, which is especially
important for contrastive pre-training. For OCT volumes, we then uniformly sample 20 B-scans - preserving the
order of B-scans - to account for the variability in the number of B-scans used during image acquisition. is can
lead to potentially missing variations and abnormalities in other parts of the volume; however, it still captures
information from dierent locations within the volume, which helps in understanding the 3D morphology of
the tissue, which is critical for many clinical and research applications. e B-scans and near-infrared reectance
fundus images are inherently grayscale; however, the color fundus photographs from the Topcon scanners were
converted to grayscale. Finally, the images/volumes were normalized by intensity mean and variance, image/
volume-wise.
Additionally, random image transformations are used for the uni-modal contrastive pre-training to
create dierent views of the same image/volume. For this, we followed the SimCLR39 framework without the
transformations that do not apply to the data at hand, resulting in (1) centered crop, (2) horizontal ip, (3)
Gaussian blur, and (4) contrast adjustment. e rst two transformations were applied with a probability of
0.5, while the Gaussian blur and contrast adjustment with a probability of 0.3. Finally, the images/volumes were
normalized, image/volume-wise.
Data partitioning and stratication for experimental setup
We partitioned the respective datasets into non-overlapping training-validation-test partitions for each
experimental setup at a ratio of 80%–15%–5% for the contrastive pre-training task and 80%–10%–10% for the
external downstream tasks (Supplementary Figure A1). We created a patient-level separation while dividing the
data into subsets, ensuring a stringent evaluation of the models. As the labeled datasets were generally imbalanced,
we split them using a stratication technique for the downstream tasks, ensuring the same proportion of target
labels in each subset.
Methodology
As shown in Fig. 1, our proposed framework consists of pre-training and downstream phases. In the rst phase,
the model learns a transferable fundus image/OCT volume representation via contrastive learning. In the
Figure 1. e proposed CLIP/CLOOB framework: contrastive pre-training of the encoders (
hx,h
y
) of the
two retinal imaging modalities (fundus images—
x
and OCT volumes—
y
), followed by using the pre-trained
encoders for downstream predictive tasks.
Scientic Reports | (2024) 14:26802 3
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
downstream phase, clinically relevant prediction tasks are conducted by either linear probing or ne-tuning
the learned encoder with oen limited labeled data. Beyond disease classication, we aim to explore a wider
range of clinically relevant prediction tasks, such as structure-function and biomarker measurements or disease
evolution forecasting, broadening the potential applications of these enhanced models.
Contrastive pre-training
Contrastive learning is a learning paradigm that extracts rich and transferable representations. e central idea
of contrastive learning is that matched data points should yield similar representations, so-called embeddings,
while unmatched data points should have a low similarity. In multi-modal contrastive learning, CLIP has
emerged as a widely adopted and eective approach. Its fundamental goal is to concurrently train modality-
specic encoders, aligning each modality to a shared embedding space through the InfoNCE contrastive
objective4042. is objective is designed to bring representations of distinct modalities within matched data
points, such as text describing a specic image, into close proximity in the embedding space. Conversely,
representations of unmatched data points, like an image and text describing an entirely dierent one, should be
positioned considerably far from each other.
While CLIP has been highly successful and widely used, it has been shown to suer from the explaining away
problem36, in which a few features are overemphasized while others are neglected. is occurs when learning
focuses on only a subset of features or when the covariance structure in the data is insuciently extracted,
oen due to the saturation of the InfoNCE learning objective. Fürst et al.36 introduced CLOOB to overcome
this problem. It employs a modern Hopeld network43 to enhance the capturing of covariance structure during
learning. Additionally, CLOOB employs the InfoLOOB41 objective to avoid saturation issues associated with the
InfoNCE objective used in CLIP.
Our proposed contrastive framework, depicted in Fig. 1a, utilizes the CLIP and CLOOB pre-training
techniques. During pre-training, the aim is to bring the embeddings of paired fundus and OCT scans of the
same eye, acquired at the same visit close while pushing the negatives, hence unpaired fundus and OCT scan
embeddings further apart. is allows for learning representations for both modalities explicitly containing
information about both.
We employ ResNet18 with pre-trained ImageNet weights as the backbone image encoder and VideoResNet18
with pre-trained Kinetics44 weights as the backbone volume encoder for fundus images and OCT volumes,
respectively. We start from pre-trained weights as they provide a good starting point for training, signicantly
speeding up the convergence of the model. e dimension of the embedding space was set to
d= 512
, which
determines the output size of both encoders. e hyper-parameters and training strategies suggested by
OpenCLIP45 and CLOOB were used (Supplementary Table A2).
e models were trained with mixed precision for 300 epochs in a distributed fashion on three NVIDIA
A100 80GB GPUs with batch size 128 on each GPU. To speed up the training and decrease the memory
requirements, a single randomly selected fundus image - OCT volume pair per patient was shown to the model
at each epoch, with the random seed being adjusted according to the current time. e model weights at the nal
epoch were saved as the checkpoint for adapting to downstream tasks. As primary evaluation metrics during the
contrastive pre-training, we tracked the evolution of InfoNCE and InfoLOOB losses. Additionally, we analyzed
the distribution of cosine similarities of the embedded fundus and OCT imaging modalities. Furthermore, we
computed the top-k retrieval accuracy metric to evaluate the ability of the model to retrieve the corresponding
fundus image - OCT volume pairs.
Adaptation to downstream tasks
Aer contrastive pre-training, the encoders were used to extract lower-dimensional feature representations for
downstream predictive tasks. We added a single fully connected layer aer the encoder block (Fig. 1b) as our
prediction head. To demonstrate the models’ feature extractor capabilities, linear probing was performed by
freezing the encoder weights and training only the last layer. Additionally, we conducted experiments by ne-
tuning the entire model for the downstream tasks.
We have outlined a range of clinically relevant downstream tasks on three external datasets not used for
pre-training, including disease diagnosis, biomarker detection, disease and treatment prognosis, and structure-
function prediction. ese tasks cover regression, classication, and forecasting problems, as described below:
Regression tasks: structure-function and biomarker measurements. e HARBOR and OLIVES datasets
include information on best-corrected visual acuity (BCVA) and central subeld thickness (CST). BCVA
quanties overall visual function, reecting the sharpness of vision with optimal correction, oen via glasses
or contact lenses, and is assessed using a Snellen chart. CST, obtained via OCT imaging, gauges the thickness
of the central subeld in the macula, which serves to diagnose conditions like diabetic macular edema. ese
measurements are pivotal in diagnosing, treating, and monitoring eye conditions. ey assist ophthalmolo-
gists in evaluating visual function, detecting anomalies, and planning appropriate interventions, thus enhanc-
ing clinical workows. Here, we aim to provide them with an automated assessment tool.
Classication tasks: biomarker detection and disease diagnosis. We formulated an image classication
task on the HARBOR dataset to detect uid accumulation within choroidal neovascularization (CNV) (Fluid
present). is task involves identifying the presence of uid in OCT, as determined by a certied reading
center, which may manifest as subretinal uid, intraretinal uid, or cystoid macular edema. Although this is
an OCT-related biomarker, we hypothesize that the multi-modal pre-training could also allow for predicting
it accurately from fundus images, oering a cost-eective screening tool in practices where OCT scanners are
unavailable.
Scientic Reports | (2024) 14:26802 4
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
In the context of the OLIVES and MIX datasets, we focus on the task of disease classication. In the former,
we want to identify the presence of DME or DR, while the MIX dataset allows for dening a multi-class
problem. is entailed distinguishing between healthy cases and those involving DME, iAMD, RVO, and GA.
Automated disease diagnosis from retinal imaging is essential in ophthalmology as it enables accurate and
timely identication of various conditions, facilitating early intervention and personalized treatment strate-
gies for improved patient outcomes.
Forecasting tasks: disease and treatment prognosis. In the HARBOR dataset, we utilized the eective num-
ber of injections (
) received during the 2-year study period under the pro-re-nata treatment regimen
to establish distinct treatment requirement categories, much akin to the approach taken by Romo et al.46.
Specically, we categorized patients into high (
ninj 16
) and low (
ninj 5
) treatment requirement groups,
where the high category encompassed patients in the rst quartile of the population, and the low category
included those in the third quartile. Here, we focus on predicting the high treatment need (High TN) category
for our forecasting task aer the administration of loading doses. Establishing treatment requirement catego-
ries, such as high and low treatment needs, can be crucial in optimizing patient care and treatment planning,
ensuring timely and personalized interventions.
Furthermore, within this dataset, the fellow eyes of a subgroup of patients exhibited a conversion to geograph-
ic atrophy (GA Conv.) or choroidal neovascularization (CNV Conv.) within the two-year duration of the trial.
Consequently, we introduced an additional downstream forecasting task. Specically, we framed the problem
as predicting the conversion to GA/CNV based on the baseline scan data of these fellow eyes. Predicting
the conversion to GA or CNV in fellow eyes provides valuable insights for proactive management and early
detection of progressive retinal conditions, contributing to improved patient outcomes.e training was run
on an NVIDIA A100 40GB GPU with a batch size of 128 (fundus input)/64 (OCT input) in the linear-prob-
ing setup and 64 (fundus input)/32 (OCT input) in the ne-tuning setup, while the learning rate was found
by performing a hyper-parameter search on the
[1e6,1e3]
interval. We adopt an AdamW optimizer47
with the initial learning rate found during the hyper-parameter search and a weight decay of 0.1. Moreover,
a reduce-on-plateau learning rate scheduler was applied. e training was performed for a maximum of 100
epochs or until the training loss demonstrated convergence on the validation data. Early stopping was applied
to avoid overtting: the training was stopped if the loss on the validation set no longer decreased for ve ep-
ochs. We used mean squared error loss for the regression tasks, while binary cross-entropy and cross-entropy
losses in the classication and forecasting tasks, respectively.
e
R2
-score and the root-mean-square error (RMSE) were used to evaluate the regression models. In contrast,
the weighted area under the receiver operating characteristic curve (AUROC) and average precision score (AP)
were used to assess the performance of the models for classication. We computed the weighted one-vs-rest
AUROC and AP scores for the multi-class setup. We applied bootstrapping on the hold-out test sets to estimate
the models’ performance variance.
Uni-modal baselines
Our multi-modal pre-training was benchmarked against three pertinent uni-modal SSL models-SimCLR39,
BYOL48, and VICReg49. SimCLR focuses on maximizing agreement between augmented views of the same image,
learning representations by contrasting positive pairs against a large set of negative pairs. BYOL employs a dual
neural network architecture, where one network predicts the other’s representations, fostering self-supervision
without the need for negative pairs. VICReg introduces a variant of contrastive learning with virtual instances,
focusing on leveraging only positive pairs in its training process by incorporating virtual augmentations. Its
specially designed objective function serves to prevent dimension and mode collapse, addressing challenges
associated with other contrastive learning methods and ensuring the stability and ecacy of the learning
process. Comparing these methods allows for an assessment of the eectiveness of our CLIP/CLOOB-based
approach in the context of retinal imaging and understanding whether combining 2D and 3D information
provides advantages over more traditional uni-modal SSL techniques.
For all uni-modal methods, we utilized ResNet18 and VideoResNet18 as encoders. While the multi-modal
pipeline does not rely on an additional projection layer, the uni-modal approaches do. For this, we used a single
128-dimensional linear layer that was then discarded for the downstream tasks. ese SSL models were trained
on the same pre-training dataset using identical hyper-parameters and optimizer setup as our multi-modal
method, except for the batch size, which had to be adjusted according to the computational resource limitations.
In the 2D setup, the batch size could be increased to 256 per GPU, while in the 3D setup, it had to be decreased
to 80.
Interchanging retinal imaging modalities
We analyzed the feasibility of seamlessly swapping OCT embeddings with fundus embeddings for the OCT-
based trained prediction models. is approach explores whether models designed for one imaging modality
can generalize across dierent modalities without requiring extensive re-training, which could simplify model
deployment in clinical settings. To test this hypothesis, we extracted the latent embeddings of the fundus
images within the test sets of the respective downstream tasks using the frozen pre-trained fundus encoder.
Subsequently, we passed these fundus embeddings as input to our OCT-based prediction models, which had
been trained using the linear probing setup. Linear probing preserves the contrastive pre-training constraints,
precisely the objective of mapping corresponding fundus image and OCT volume pairs closer together. As such,
Scientic Reports | (2024) 14:26802 5
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
it serves as an ideal testing ground for evaluating our hypothesis of imaging modality interchangeability. e
ability to seamlessly utilize embeddings from one modality as input for models trained on a dierent modality
implies greater exibility in clinical applications, potentially reducing the need for modality-specic models and
enabling a more streamlined and versatile approach to predictive tasks in retinal imaging.
Results
Multi-modal contrastive pre-training yields an eective retrieval system
First, we evaluate the eectiveness of the pre-training by assessing the model’s prociency in performing the
pre-training task, namely, retrieving the corresponding OCT volume for a given fundus image from the same
eye and vice versa using the top-k accuracy score. Top-1 accuracy refers to the accuracy rate that the class with
the highest probability is the correct class, while for
k>1
, top-k accuracy refers to the accuracy rate that the
top-k pairs with the highest probability contain the correct pair. To intensify the challenge, we enhance the task
complexity by considering all samples per patient. It is important to note that this task is nearly impossible for
human experts to perform accurately.
e top-k accuracy scores for the contrastive models on the retrieval task, with k values of 1, 5, and 10, are
summarized in Table 1, and illustrative examples are presented in Fig.2. A retrieval score of less than 0.1%
in the baseline setup (ImageNet/Kinetics initialized encoders) proves that the corresponding image-volume
pairs are not consistently mapped close in the latent space by default, which is what we expect. In the more
straightforward, single sample per patient case (189 fundus image - OCT volume pairs from 189 patients), CLIP
ranked the correct OCT volume rst in 78.1% of cases, while CLOOB ranked the correct OCT volume rst in
80% of cases, the top-5 and top-10 accuracy being above 94% for both. e models’ ability to precisely retrieve
the corresponding OCT volume from a given fundus image, and vice versa, establishes a signicant opportunity
for interchangeability between OCT and fundus embeddings within the downstream prediction tasks.
On the complete hold-out set of 6,948 fundus image and OCT volume pairs, CLIP ranked the correct OCT
volume rst in 9.70% of cases, while CLOOB ranked the correct OCT volume rst in 10.90% of cases. Moreover,
we found that, on average, 58.20% of the retrieved images/volumes came from the same patient in the case of
CLIP and 62.90% in the case of CLOOB, suggesting that patient scans from dierent time points are inherently
mapped closely together in both models. However, preliminary cluster analysis of these representations revealed
no interpretable insights. Nonetheless, it opens opportunities for further investigation.
With the inherent characteristics of contrastive pre-training, we expected that the embeddings for the two
distinct imaging modalities would interweave, leading to larger cosine similarities between the positive pairs.
e embeddings derived from CLIP and CLOOB had similarity scores of 0.698 (SD = 0.084) and 0.385 (SD
= 0.072), respectively, while for the embeddings obtained with the ImageNet/Kinetics pre-trained encoders
were only 0.108 (SD = 0.000). is behavior indicates that scans from the same eye (positive pairs) cluster
closely together, while those from dierent eyes (negative pairs) disperse more widely (Supplementary Fig. A2a).
Moreover, we have empirically observed that the model primarily segregates the scans based on the vendor
(Supplementary Fig. A2b). is outcome was somewhat expected since scans obtained from dierent scanners
exhibit signicant dissimilarities, and accounting for them was out of the scope of this work.
The learned representations improve predictive performance on external downstream tasks
e outcomes of the linear probing experiments illustrate the advantages of domain-specic self-supervised
pre-training over natural image-based or uni-modal contrastive pre-training for the encoders across most
downstream tasks in both the regression and classication/forecasting congurations (Fig. 3). As anticipated,
Retrieval task Test set Model
Accuracy score
Top-1 Top-5 Top-10
Fundus to
OCT
One sample/
patient
Baseline 0.005 (0.000, 0.040) 0.022 (0.005, 0.062) 0.041 (0.016, 0.092)
CLIP 0.781 (0.704, 0.848) 0.959 (0.908, 0.984) 0.980 (0.938, 0.995)
CLOOB 0.800 (0.720, 0.861) 0.947 (0.898, 0.979) 0.974 (0.927, 0.992)
All samples/patient
Baseline 0.000 (0.000, 0.001) 0.001 (0.000, 0.002) 0.002 (0.001, 0.003)
CLIP 0.097 (0.090, 0.104) 0.310 (0.299, 0.321) 0.457 (0.445, 0.468)
CLOOB 0.109 (0.102, 0.117) 0.333 (0.322, 0.344) 0.482 (0.470, 0.494)
OCT to fundus
One sample/
patient
Baseline 0.005 (0.000, 0.040) 0.025 (0.005, 0.062) 0.050 (0.021, 0.102)
CLIP 0.768 (0.689, 0.836) 0.949 (0.898, 0.979) 0.978 (0.938, 0.995)
CLOOB 0.799 (0.720, 0.861) 0.945 (0.889, 0.975) 0.973 (0.927, 0.992)
All samples/patient
Baseline 0.000 (0.000, 0.001) 0.001 (0.000, 0.001) 0.001 (0.001, 0.003)
CLIP 0.093 (0.086, 0.100) 0.293 (0.283, 0.304) 0.437 (0.425, 0.448)
CLOOB 0.103 (0.096, 0.110) 0.334 (0.323, 0.345) 0.484 (0.473, 0.496)
Tab le 1. Results of the retrieval task. Given a fundus image, the correct OCT volume must be selected from a
set of candidates. Top-1, top-5, and top-10 accuracy are shown for the hold-out test set, along with the upper
and lower limits for a 95% condence interval (CI). Here, Baseline refers to the encoders initialized with
ImageNet/Kinetics-pre-trained weights, CLIP and CLOOB to the encoders obtained using the CLIP- and
CLOOB-based pre-training, respectively. e best performance for each setup is indicated in bold.
Scientic Reports | (2024) 14:26802 6
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Figure 3. e predictive models’ average performance on the external downstream tasks in linear probing,
evaluated on the hold-out test set via bootstrapping-based validation. Fundus image-based results are depicted
as overlapped over the OCT volume-based results, colored less opaquely. Supplementary Table A3 provides a
detailed overview of the numerical results.
Figure 2. Example results for the retrieval task on a hold-out test set. e 10 OCT volumes for which
representations are the most similar to the query fundus image are shown along with their corresponding
fundus images. Orange bounding boxes mark the correct pair.
Scientic Reports | (2024) 14:26802 7
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
the models based on the more detailed OCT scans exhibited superior performance compared to their fundus
imaging-based counterparts in all tasks, albeit with a minimal margin in some classication instances. is
suggests that the learned fundus embeddings benet from the richer information in the higher-dimensional
OCT volumes.
In tasks involving regression based on fundus images, where many models struggle to capture the variance
in the dependent variable, resulting in negative
R2
-scores indicative of predictive power worse than a simple
average, the adoption of multi-modal pre-training proves to be an eective strategy. Specically, in predicting
BCVA from fundus images in the HARBOR dataset, only models pre-trained with CLIP and CLOOB yielded
positive
R2
-scores. e CLOOB pre-trained model demonstrated similar magnitudes when using OCT volumes
as inputs. Notably, the CLIP-based model achieved the highest
R2
-scores of 0.308 for fundus images and 0.462
for OCT volumes. Similarly, in CST prediction, eective explanatory power for the dependent variable was
primarily achieved using OCT volumes and multi-modal pre-trained weights. On BCVA prediction tasks with
fundus inputs in the OLIVES dataset, multi-modal approaches were outperformed by BYOL and VICReg pre-
trained models, with only BYOL achieving a positive albeit low
R2
-score. CLIP-based pre-training was best for
predicting CST from fundus images, reaching an
R2
-score of 0.286. In contrast, using OCT volumes as inputs
signicantly improved performance, with CLOOB-based models achieving the highest scores of 0.364 and 0.864
for BCVA and CST, respectively.
Across all downstream classication and forecasting tasks, models leveraging multi-modal pre-training
consistently demonstrated superior performance compared to their uni-modal counterparts, albeit by a marginal
increase in certain instances. Notably, the ability to predict the presence of uid from fundus images using
the multi-modal pre-trained encoders approached the ecacy of OCT-based methods. e best performance
was achieved by the CLIP-based pre-training, leading to an AUROC of 0.755, while in the OCT-based case,
CLOOB yielded an AUROC of 0.871. is development holds promise for an essential pre-screening tool based
on fundus images, a modality that, as previously discussed, is more cost-eective and widely accessible.
While the models exhibited diminished performance in more intricate forecasting tasks, stratifying treatment
requirements proved more manageable, particularly with OCT volumes. e CLOOB-based model achieved
an AUROC of 0.755, outperforming other tasks. Predicting whether an eye will converge to GA or CNV within
24 months poses an ambitious task due to the many changes in the diseased retina over such an extended time
frame. Nevertheless, the multi-modal pre-trained models showcased their superiority even in these challenging
tasks. Interestingly, fundus-based models initialized with uni-modal pre-trained weights slightly outperformed
their OCT-based counterparts in specic scenarios. For instance, in the case of CNV convergence, SimCLR
using fundus images as inputs achieved an AUROC score of 0.565, surpassing the OCT-based model, which
attained only 0.513. is suggests that the learned 3D representations may not fully encapsulate patterns relevant
to this task.
e disease classication tasks have the highest performance on both the OLIVES and MIX datasets. Despite
not being exposed to non-AMD cases during pre-training, the resulting latent representations of fundus images
and OCT scans depicting various diseases within external datasets demonstrate clear separability(Supplementary
Fig. A2). is implies that the latent representations eectively capture disease-related patterns and generalize
well to domain shis. Furthermore, achieving an AUROC above 0.900 with fundus-based classiers again
provides a good basis for adopting such models in clinical practices, particularly in scenarios where the more
expensive OCT modality is not readily available.
Comparing the linear probing results with the outcomes from the ne-tuning setup reveals that the linear
probing performances using multi-modal pre-trained encoders closely approach those achieved through ne-
tuning (Fig. 4). Hence, these pre-trained models could readily be used as feature extractors even without needing
strong computational resources to ne-tune the whole model. In the case of the classication and forecasting
tasks, the CLIP or CLOOB pre-trained models always outperform the uni-modal and fully-supervised
approaches. However, the results of the regression tasks in the HARBOR dataset for the BCVA task resulted
in notably better performance with SimCLR initialization, yielding an -score of 0.670. While the ne-tuned
multi-modal approaches led to the highest performance for CST regression, a signicant improvement can be
observed in the ne-tuned uni-modal approaches over their linear probing counterparts. We hypothesize that
the features extracted by the encoders during linear probing may not fully capture the information needed for
CST, which involves measuring the relative distance between the top and bottom of the retina-a task requiring
more global context. is global context may not be adequately represented during contrastive pre-training.
However, aer ne-tuning, the model adjusts to integrate the necessary task-specic information, leading to
improved performance.
Overall, the ecacy of various pre-training methods depends on factors like task diculty and available
sample size for training. Fully-supervised training or natural image-based pre-trained weights can yield results
comparable to domain-specic pre-training under certain conditions. e absence of a denitive conclusion on
the superior multi-modal pre-training method suggests that each exhibits unique strengths tailored to specic
tasks. Nevertheless, our ndings highlight the eectiveness of our approach and emphasize the signicance of
leveraging contrastive pre-training for enhancing model performance and feature extraction capabilities.
Interchanging retinal imaging modalities can preserve predictive power
e results of our investigation into the interchangeability of imaging modalities, particularly swapping OCT
embeddings with fundus embeddings, are shown in Fig. 5. e performance decay on the dierent tasks is
highly varied and especially large on the OLIVES dataset, which might be related to the quality of the acquired
fundus images. Nevertheless, in most cases, this decay was
20%
, proving that interchangeability is somewhat
possible. e performance drop is smaller in the case of CLIP-based pre-training, which was expected, as this
pre-training approach yielded higher similarities between the corresponding fundus image and OCT volume
Scientic Reports | (2024) 14:26802 8
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
embeddings. Although the fundus-image-based ne-tuned models achieve higher performance, by leveraging
the acquired embeddings, we can bridge the gap between the two retinal imaging modalities, making it possible
to extrapolate insights and predictions from one imaging modality to another. is versatility holds the potential
to use predictive models in scenarios where access to the more expensive OCT imaging modality is limited or
unavailable.
Discussion
is work proposes a contrastive SSL pre-training method for ophthalmology. SSL holds signicant promise
for advancing the eld, as multi-modal data, such as OCT volumes and fundus images, are oen available, but
annotations are scarce and costly. By leveraging large amounts of unlabeled data, we show that we can generate
robust feature representations without the need for extensive manual labeling. Our pre-training applies multi-
Figure 5. e predictive models’ performance on the external classication tasks when the fundus embeddings
are used as classier inputs.
Figure 4. e predictive models’ average performance on the external downstream tasks in ne-tuning,
evaluated on the hold-out test set via bootstrapping-based validation. Fundus image-based results are depicted
as overlapped over the OCT volume-based results, colored less opaquely. Supplementary Table A4 provides a
detailed overview of the numerical results.
Scientic Reports | (2024) 14:26802 9
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
modal contrastive learning on a 3D OCT volume encoder and a 2D fundus image encoder, with the help of two
contrastive objectives, InfoNCE and InfoLOOB, and in the case of the latter, the use of modern Hopeld layers
to store reference embeddings.
Our method is designed to advance personalized patient management and disease progression modeling in
ophthalmology by simultaneously considering dierent imaging modalities, namely OCT volumes and fundus
images. Our overarching goal is to contribute meaningfully to the ongoing endeavors to establish multi-modal
foundational models for various ophthalmological applications. It is important to note that our approach diers
from the recent work conducted by Zhou et al.34, where they recently introduced a foundation model RETFound
for 2D retinal images based on color fundus images and OCT B-scans; however, they built separate models
for the two modalities, and as such, do not exploit the complementary information that these modalities can
provide. Unlike their work, our study underscores the advantages of integrating volumetric data from OCT
scans and eectively harmonizing information between these two essential modalities, paving the way for more
comprehensive and robust medical imaging solutions.
e proposed self-supervised approach provides not only robust encoders to obtain comprehensive OCT
and fundus image representations with interpretable features suitable for downstream clinical tasks but also
a retrieval system. e research ndings demonstrate that multi-modal contrastive pre-training enhances
model performance across various clinically relevant supervised tasks, encompassing binary classication and
regression. Our supervised models showcase remarkable robustness to challenges such as class imbalance and
overtting, even when dealing with limited data (treatment requirement, GA, and CNV prediction cases, for
example). Additionally, the performance achieved by our models on external datasets is comparable to previous
works, with some cases surpassing the reported results.
Our results align well with existing state-of-the-art methods across several studies. Kawczynski et al.50 used
a ResNet50v2 CNN on 3D OCT volumes to predict BCVA scores, achieving
R2=0.660
(RMSE = 11.750),
while our CLOOB-based model achieved
R2=0.629
(RMSE = 12.218). Romo-Bucheli et al.46 used a deep
learning model on longitudinal OCT data for nAMD treatment prediction, with an AUROC of 0.810, compared
to our CLOOB-based method at 0.755 and CLIP-based method at 0.796. Our model for predicting GA/CNV
conversion over 24 months yielded AUROC scores of 0.701 and 0.698, paralleling Schmidt-Erfurt et al.51, but
without extensive image segmentation. Finally, Kokilepersaud et al.52 used supervised contrastive learning
on the OLIVES dataset, achieving an AUROC of around 0.800, which our multi-modal pre-trained models
outperformed in DME/DR prediction tasks and matched in fundus image-based experiments.
In addition to the good performance across a diverse spectrum of clinically pertinent external downstream
tasks, we conducted a feasibility analysis to explore the intriguing prospect of interchanging imaging modalities
for predictive purposes. e fundamental attribute of contrastive learning, characterized by the close mapping
of corresponding pairs within the latent space, empowers us to extend insights and predictions seamlessly from
one imaging modality to another. Our results show a performance decay of
20
%, underscoring the potential
utility of predictive models even in clinical scenarios where the more resource-intensive OCT modality may be
inaccessible. is emphasizes the robustness and adaptability of our approach in diverse clinical contexts.
Nonetheless, this work also has a few limitations. First, the contrastive pre-training of the encoders was done
on data acquired from nAMD patients only, and it is unclear how training with a larger dataset encompassing
various retinal diseases would aect the latent space. Second, we did not apply any mechanism to adjust for
image domain shi resulting from using devices from several vendors for image acquisition, which was reected
in the latent space. is suggests that the models learn to extract relevant features from each type of image;
however, they do not share much information between the vendors. Furthermore, while the subsampled OCT
volumes used in this work capture more information about the retinas structure and disease-related changes
than existing 2D approaches that focus solely on the central B-scan, this may limit the model from reaching
its full potential. Due to computational constraints, an oen-faced disadvantage of SSL methods, we had to
balance eciency and accuracy by subsampling the OCT volumes. In future work, we plan to explore methods
for encoding the full 3D OCT volume, which would enable the capture of more detailed spatial information
and potentially enhance the model’s performance, particularly for tasks requiring comprehensive volumetric
analysis. Lastly, we performed the training and evaluation of the supervised models in a cross-sectional manner,
ignoring temporal correlations and treatment eects. Exploiting temporal modeling techniques could improve
performance and should be further explored.
In conclusion, this work highlights the potential of multi-modal contrastive deep-learning models to
leverage vast amounts of unlabeled data, oering promising avenues for various ophthalmology-image-
interpretation tasks. is approach reduces the dependency on annotated datasets and mitigates ineciencies
in clinical workows stemming from extensive labeling eorts. Our proposed method is a simple yet eective
approach to developing 3D and multi-modal AI models and holds promise in advancing medical imaging tasks
and facilitating more ecient and accurate patient care in ophthalmology. Future research should focus on
enhancing the strength of multi-modal contrastive pre-training by addressing technical challenges related to
limited training data and computational resources. Additionally, exploring the integration of further modalities
to create robust and transferable encoders for predicting clinical visual function measures presents an intriguing
avenue for further exploration in the clinical setting.
Data availability
e raw datasets are not publicly accessible and are the property of the respective companies. However, the data
may be available from the Medical University of Vienna subject to local and national ethical approvals and can
be requested from the authors via email at hrvoje.bogunovic@meduniwien.ac.at. e OLIVES dataset is publicly
available at https://zenodo.org/records/7105232.
Scientic Reports | (2024) 14:26802 10
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Code availability
Implementing the CLIP/CLOOB-based contrastive pre-training was based on the code available on GitHub,
accessible via this link https://github.com/ml-jku/cloob.
Received: 2 July 2024; Accepted: 31 October 2024
References
1. Esteva, A. et al. Dermatologist-level classication of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
2. Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017).
3. Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
4. Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4, 5 (2021).
5. Fink, O. et al. Potential, challenges and future directions for deep learning in prognostics and health management applications.
Eng. Appl. Artif. Intell. 92, 103678 (2020).
6. Jing, L. & Tian, Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach.
Intell. 43, 4037–4058 (2020).
7. Le-Khac, P. H., Healy, G. & Smeaton, A. F. Contrastive representation learning: A framework and review. IEEE Access 8, 193907–
193934 (2020).
8. He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 16000–16009 (2022).
9. Albelwi, S. Survey on self-supervised learning: Auxiliary pretext tasks and contrastive learning methods in imaging. Entropy 24,
551 (2022).
10. Rani, V., Nabi, S. T., Kumar, M., Mittal, A. & Kumar, K. Self-supervised learning: A succinct review. Arch. Comput. Methods Eng.
30, 2761–2775 (2023).
11. Huang, S.-C. et al. Self-supervised learning for medical image classication: A systematic review and implementation guidelines.
NPJ Digit. Med. 6, 74 (2023).
12. Nielsen, M., Wenderoth, L., Sentker, T. & Werner, R. Self-supervision for medical image classication: State-of-the-art performance
with
100 labeled training samples per class. Bioengineering 10, 895 (2023).
13. You, C., Zhao, R., Staib, L.H. & Duncan, J.S. Momentum contrastive voxel-wise representation learning for semi-supervised
volumetric medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted
Intervention. 639–652 (Springer, 2022).
14. You, C., Dai, W., Min, Y., Staib, L. & Duncan, J.S. Bootstrapping semi-supervised medical image segmentation with anatomical-
aware contrastive distillation. In International Conference on Information Processing in Medical Imaging. 641–653 (Springer, 2023).
15. You, C. et al. Mine your own anatomy: Revisiting medical image segmentation with extremely limited labels. In IEEE Transactions
on Pattern Analysis and Machine Intelligence (2024).
16. Azad, B. et al. Foundational models in medical imaging: A comprehensive survey and future vision. arXiv preprint
[SPACE]arXiv:2310.18689 (2023).
17. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis
using medical twitter. Nat. Med. 29, 2307–2316 (2023).
18. Moor, M. et al. Foundation models for generalist medical articial intelligence. Nature 616, 259–265 (2023).
19. Schneider, J., Meske, C. & Kuss, P. Foundation models: A new paradigm for articial intelligence. Bus. Inf. Syst. Eng. 1–11 (2024).
20. Tu, T. et al. Towards Generalist Biomedical AI. arXiv preprint[SPACE]arXiv:2307.14334 (2023).
21. Singhal, K. et al. Large language models encode clinical knowledge. Nature 1–9 (2023).
22. Moor, M. et al. Med-Flamingo: A Multimodal Medical Few-shot Learner. arXiv:2307.15189 (2023).
23. Zakka, C. et al. Almanac: Retrieval-Augmented Language Models for Clinical Medicine (2023). arXiv:2303.01229
24. Schmidt-Erfurth, U., Sadeghipour, A., Gerendas, B. S., Waldstein, S. M. & Bogunović, H. Articial intelligence in retina. Prog.
Retinal Eye Res. 67, 1–29. https://doi.org/10.1016/j.preteyeres.2018.07.004 (2018).
25. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350. h t t p s : / /
d o i . o r g / 1 0 . 1 0 3 8 / s 4 1 5 9 1 - 0 1 8 - 0 1 0 7 - 6 (2018).
26. Abràmo, M.D., Lavin, P.T., Birch, M., Shah, N. & Folk, J.C. Pivotal trial of an autonomous AI-based diagnostic system for
detection of diabetic retinopathy in primary care oces. npj Digit. Med. 1, article number: 39. h t t p s : / / d o i . o r g / 1 0 . 1 0 3 8 / s 4 1 7 4 6 - 0 1
8 - 0 0 4 0 - 6 (2018).
27. Keane, P.A. & Topol, E.J. With an eye to AI and autonomous diagnosis. npj Digit. Med. 1, 40 h t t p s : / / d o i . o r g / 1 0 . 1 0 3 8 / s 4 1 7 4 6 - 0 1 8 - 0
0 4 8 - y (2018).
28. Yoo, T. K. et al. e possibility of the combination of OCT and fundus images for improving the diagnostic accuracy of deep
learning for age-related macular degeneration: a preliminary experiment. Med. Biol. Eng. Comput. 57, 677–687 (2019).
29. Vaghe, E., Hill, S., Kersten, H.M. & Squirrell, D. Multimodal retinal image analysis via deep learning for the diagnosis of
intermediate dry age-related macular degeneration: A feasibility study. J. Ophthalmol. 2020 (2020).
30. Jin, K. et al. Multimodal deep learning with feature level fusion for identication of choroidal neovascularization activity in age-
related macular degeneration. Acta Ophthalmol. 100, e512–e520 (2022).
31. Li, X., Jia, M., Islam, M. T., Yu, L. & Xing, L. Self-supervised feature learning via exploiting multi-modal data for retinal disease
diagnosis. IEEE Trans. Med. Imaging 39, 4023–4033 (2020).
32. Holmberg, O. G. et al. Self-supervised retinal thickness prediction enables deep learning from unlabelled data to boost classication
of diabetic retinopathy. Nat. Mach. Intell. 2, 719–726 (2020).
33. Azizi, S. et al. Robust and ecient medical imaging with self-supervision. arXiv preprint[SPACE]arXiv:2205.09723 (2022).
34. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 1–8 (2023).
35. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine
Learning. 8748–8763 (PMLR, 2021).
36. Fürst, A. et al. CLOOB: Modern Hopeld networks with InfoLOOB outperform CLIP. In Advances in Neural Information Processing
Systems (Koyejo, S. et al. eds.) . Vol.35. 20450–20468 (Curran Associates, Inc., 2022).
37. Busbee, B.G. et al. Twelve-month ecacy and safety of 0.5 mg or 2.0 mg ranibizumab in patients with subfoveal neovascular age-
related macular degeneration. Ophthalmology 120, 1046–1056 (2013).
38. Prabhushankar, M. et al. OLVIES dataset: Ophthalmic labels for investigating visual eye semantics. Adv. Neural Inf. Process. Syst.
35, 9201–9216 (2022).
39. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In
International Conference on Machine Learning. 1597–1607 (PMLR, 2020).
40. Oord, A. V.D., Li, Y. & Vinyals, O. R epresentation learning with contrastive predictive coding. arXiv preprint[SPACE]arXiv:1807.03748
(2018).
41. Poole, B., Ozair, S., Van Den Oord, A., Alemi, A. & Tucker, G. On variational bounds of mutual information. In International
Conference on Machine Learning. 5171–5180 (PMLR, 2019).
Scientic Reports | (2024) 14:26802 11
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
42. Wang, T. & Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In
Proceedings of the 37th International Conference on Machine Learning (III, H.D. & Singh, A. eds.). Vol. 119. Proceedings of Machine
Learning Research. 9929–9939 (PMLR, 2020).
43. Ramsauer, H. et al. Hopeld networks is all you need. In International Conference on Learning Representations (2021).
44. Kay, W. et al. e Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
45. Wortsman, M. et al. Robust ne-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 7959–7971 (2022).
46. Romo-Bucheli, D., Erfurth, U. S. & Bogunović, H. End-to-end deep learning model for predicting treatment requirements in
neovascular AMD from longitudinal retinal OCT imaging. IEEE J. Biomed. Health Inform. 24, 3456–3465 (2020).
47. Loshchilov, I. & Hutter, F. Fixing weight decay regularization in Adam. CoRR abs/1711.05101 (2017). arXiv:1711.05101
48. Grill, J.-B. et al. Bootstrap your own latent—A new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–
21284 (2020).
49. Bardes, A., Ponce, J. & LeCun, Y. VICReg: Variance-invariance-covariance regularization for self-supervised learning. arXiv
preprint arXiv:2105.04906 (2021).
50. Kawczynski, M. G. et al. Development of deep learning models to predict best-corrected visual acuity from optical coherence
tomography. Transl. Vis. Sci. Technol. 9, 51–51 (2020).
51. Schmidt-Erfurth, U. et al. Prediction of individual disease conversion in early AMD using articial intelligence. Invest. Ophthalmol.
Vis. Sci. 59, 3199–3208 https://doi.org/10.1167/iovs.18-24106 (2018). h t t p s : / / a r v o j o u r n a l s . o r g / a r v o / c o n t e n t _ p u b l i c / j o u r n a l / i o v s / 9
3 7 3 6 2 / i 1 5 5 2 - 5 7 8 3 - 5 9 - 8 - 3 1 9 9 . p d f .
52. Kokilepersaud, K., Corona, S.T., Prabhushankar, M., AlRegib, G. & Wyko, C. Clinically labeled contrastive learning for OCT
biomarker classication. IEEE J. Biomed. Health Inform. (2023).
Acknowledgements
is work received nancial support from the Austrian Science Fund (FWF), Grant-DOI:10.55776/FG9. e
funder played no role in the study design, data collection, analysis, and interpretation of data or the writing of
this manuscript.For the purpose of open access, the author has applied a CC BY public copyright licence to any
Author Accepted Manuscript version arising from this submission.
Author contributions
E.S. and H.B. designed the experiments, with E.S. executing and E.S. and H.B. analyzing the outcomes. E.R.,
N.S., A.M., and G.K. contributed their expertise in the CLOOB model and oered technical assistance for its ap-
plication to the specic use case. U.S.E., H.B., and G.K. oversaw the project. All authors reviewed and endorsed
the nal manuscript.
Declarations
Competing interests
H.B.: Contract Research to the Medical University of Vienna: Heidelberg Engineering. U.S-E: Scientic
Consultant: Genentech, Heidelberg Engineering, Kodiak, Novartis, RetInSight, Roche, Contract Research to
the Medical University of Vienna: Apellis, Genentech, Kodiak. e other remaining authors have no competing
interests to declare.
Additional information
Supplementary Information e online version contains supplementary material available at h t t p s : / / d o i . o r g / 1
0 . 1 0 3 8 / s 4 1 5 9 8 - 0 2 4 - 7 8 5 1 5 - y .
Correspondence and requests for materials should be addressed to E.S. or H.B.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and
indicate if changes were made. e images or other third party material in this article are included in the article’s
Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy
of this licence, visit http://creativecommons.org/licenses/by/4.0/.
© e Author(s) 2024
Scientic Reports | (2024) 14:26802 12
| https://doi.org/10.1038/s41598-024-78515-y
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... However, their performance remains suboptimal, as they fail to capture the detailed retinal features and do not effectively address noise and degradation in diabetic retinal images. Furthermore, [25] proposed a multimodal method that uses self-supervised learning for multimodal representation to enhance retinal imaging. To further improve retinal image quality, we recently developed a robust PCA method [26] that incorporates contrast-limited adaptive histogram equalization (CLAHE). ...
Article
Full-text available
Medical imaging, especially cancer and retinal fundus analysis, is often compromised by artifacts and heavy noise and artifact, which can hinder accurate diagnosis. Existing low-rank sparse component methods, such as RPCA with the conventional nuclear norm, assume uniform singular value weights, which may not hold true due to noise variations in images. We recently developed RPCA with the log-weighted nuclear norm, which addresses some of these issues but still relies on weight selection, potentially introducing bias. To overcome these limitations, we propose a novel method that integrates RPCA with Log-Schatten Norm (LSN) and Adaptive Histogram Equalization (AHE) for medical imaging and clinical purposes. The Log-Schatten Norm improves singular value penalization and structure preservation, while AHE enhances contrast and reduces noise. The method is formulated as an optimization problem and solved using the Alternating Direction Method for Multipliers (ADMM). Experimental results on publicly available retinal and cancer image datasets demonstrate that our method outperforms existing methods in enhancing overall image quality, making it a promising tool for medical imaging applications.
... Color fundus images, despite being from a different domain, capture fundamental retinal features-such as tissue structures, vessels, and abnormalities-shared across ophthalmic conditions. These features help the model learn generalizable representations for segmenting the FAZ in OCTA images, and valuable context for differentiating between normal and pathological structures [73][74][75]. Many large datasets of fundus images are publicly available, making them a good choice for pretraining foundation models [76,77]. ...
Article
Full-text available
Optical coherence tomography angiography (OCTA) is a noninvasive imaging technique used to visualize retinal blood flow and identify changes in vascular density and enlargement or distortion of the foveal avascular zone (FAZ), which are indicators of various eye diseases. Although several automated FAZ detection and segmentation algorithms have been developed for use with OCTA, their performance can vary significantly due to differences in data accessibility of OCTA in different retinal pathologies, and differences in image quality in different subjects and/or different OCTA devices. For example, data from subjects with direct macular damage, such as in age-related macular degeneration (AMD), are more readily available in eye clinics, while data on macular damage due to systemic diseases like Alzheimer’s disease are often less accessible; data from healthy subjects may have better OCTA quality than subjects with ophthalmic pathologies. Typically, segmentation algorithms make use of convolutional neural networks and, more recently, vision transformers, which make use of both long-range context and fine-grained detail. However, transformers are known to be data-hungry, and may overfit small datasets, such as those common for FAZ segmentation in OCTA, to which there is limited access in clinical practice. To improve model generalization in low-data or imbalanced settings, we propose a multi-condition transformer-based architecture that uses four teacher encoders to distill knowledge into a shared base model, enabling the transfer of learned features across multiple datasets. These include intra-modality distillation using OCTA datasets from four ocular conditions: healthy aging eyes, Alzheimer’s disease, AMD, and diabetic retinopathy; and inter-modality distillation incorporating color fundus photographs of subjects undergoing laser photocoagulation therapy. Our multi-condition model achieved a mean Dice Index of 83.8% with pretraining, outperforming single-condition models (mean of 83.1%) across all conditions. Pretraining on color fundus photocoagulation images improved the average Dice Index by a small margin on all conditions except AMD (1.1% on single-condition models, and 0.1% on multi-condition models). Our architecture demonstrates potential for broader applications in detecting and analyzing ophthalmic and systemic diseases across diverse imaging datasets and settings.
Article
Full-text available
Article
Full-text available
Medical artificial intelligence (AI) offers great potential for recognizing signs of health conditions in retinal images and expediting the diagnosis of eye diseases and systemic disorders ¹ . However, the development of AI models requires substantial annotation and models are usually task-specific with limited generalizability to different clinical applications ² . Here, we present RETFound, a foundation model for retinal images that learns generalizable representations from unlabelled retinal images and provides a basis for label-efficient model adaptation in several applications. Specifically, RETFound is trained on 1.6 million unlabelled retinal images by means of self-supervised learning and then adapted to disease detection tasks with explicit labels. We show that adapted RETFound consistently outperforms several comparison models in the diagnosis and prognosis of sight-threatening eye diseases, as well as incident prediction of complex systemic disorders such as heart failure and myocardial infarction with fewer labelled data. RETFound provides a generalizable solution to improve model performance and alleviate the annotation workload of experts to enable broad clinical AI applications from retinal imaging.
Article
Full-text available
The lack of annotated publicly available medical images is a major barrier for computational research and education innovations. At the same time, many de-identified images and much knowledge are shared by clinicians on public forums such as medical Twitter. Here we harness these crowd platforms to curate OpenPath, a large dataset of 208,414 pathology images paired with natural language descriptions. We demonstrate the value of this resource by developing pathology language–image pretraining (PLIP), a multimodal artificial intelligence with both image and text understanding, which is trained on OpenPath. PLIP achieves state-of-the-art performances for classifying new pathology images across four external datasets: for zero-shot classification, PLIP achieves F1 scores of 0.565–0.832 compared to F1 scores of 0.030–0.481 for previous contrastive language–image pretrained model. Training a simple supervised classifier on top of PLIP embeddings also achieves 2.5% improvement in F1 scores compared to using other supervised model embeddings. Moreover, PLIP enables users to retrieve similar cases by either image or natural language search, greatly facilitating knowledge sharing. Our approach demonstrates that publicly shared medical information is a tremendous resource that can be harnessed to develop medical artificial intelligence for enhancing diagnosis, knowledge sharing and education.
Article
Full-text available
Is self-supervised deep learning (DL) for medical image analysis already a serious alternative to the de facto standard of end-to-end trained supervised DL? We tackle this question for medical image classification, with a particular focus on one of the currently most limiting factor of the field: the (non-)availability of labeled data. Based on three common medical imaging modalities (bone marrow microscopy, gastrointestinal endoscopy, dermoscopy) and publicly available data sets, we analyze the performance of self-supervised DL within the self-distillation with no labels (DINO) framework. After learning an image representation without use of image labels, conventional machine learning classifiers are applied. The classifiers are fit using a systematically varied number of labeled data (1–1000 samples per class). Exploiting the learned image representation, we achieve state-of-the-art classification performance for all three imaging modalities and data sets with only a fraction of between 1% and 10% of the available labeled data and about 100 labeled samples per class.
Article
Full-text available
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model¹ (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM² on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA³, MedMCQA⁴, PubMedQA⁵ and Measuring Massive Multitask Language Understanding (MMLU) clinical topics⁶), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
Article
Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping. However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised medical image segmentation framework termed Mine y O ur ow N Anatomy ( MONA ), and make three contributions. First, prior work argues that every pixel equally matters to the training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, we both empirically and theoretically, demonstrate the efficacy of our MONA on three benchmark datasets with different labeled settings, achieving new state-of-the-art under different labeled semi-supervised settings.
Chapter
Contrastive learning has shown great promise over annotation scarcity problems in the context of medical image segmentation. Existing approaches typically assume a balanced class distribution for both labeled and unlabeled medical images. However, medical image data in reality is commonly imbalanced (i.e., multi-class label imbalance), which naturally yields blurry contours and usually incorrectly labels rare objects. Moreover, it remains unclear whether all negative samples are equally negative. In this work, we present ACTION, an Anatomical-aware ConTrastive dIstillatiON framework, for semi-supervised medical image segmentation. Specifically, we first develop an iterative contrastive distillation algorithm by softly labeling the negatives rather than binary supervision between positive and negative pairs. We also capture more semantically similar features from the randomly chosen negative set compared to the positives to enforce the diversity of the sampled data. Second, we raise a more important question: Can we really handle imbalanced samples to yield better performance? Hence, the key innovation in ACTION is to learn global semantic relationship across the entire dataset and local anatomical features among the neighbouring pixels with minimal additional memory footprint. During the training, we introduce anatomical contrast by actively sampling a sparse set of hard negative pixels, which can generate smoother segmentation boundaries and more accurate predictions. Extensive experiments across two benchmark datasets and different unlabeled settings show that ACTION significantly outperforms the current state-of-the-art semi-supervised methods.KeywordsContrastive LearningKnowledge DistillationActive SamplingSemi-Supervised LearningMedical Image Segmentation
Article
Denoising diffusion models, a class of generative models, have garnered immense interest lately in various deep-learning problems. A diffusion probabilistic model defines a forward diffusion stage where the input data is gradually perturbed over several steps by adding Gaussian noise and then learns to reverse the diffusion process to retrieve the desired noise-free data from noisy data samples. Diffusion models are widely appreciated for their strong mode coverage and quality of the generated samples in spite of their known computational burdens. Capitalizing on the advances in computer vision, the field of medical imaging has also observed a growing interest in diffusion models. With the aim of helping the researcher navigate this profusion, this survey intends to provide a comprehensive overview of diffusion models in the discipline of medical imaging. Specifically, we start with an introduction to the solid theoretical foundation and fundamental concepts behind diffusion models and the three generic diffusion modeling frameworks, namely, diffusion probabilistic models, noise-conditioned score networks, and stochastic differential equations. Then, we provide a systematic taxonomy of diffusion models in the medical domain and propose a multi-perspective categorization based on their application, imaging modality, organ of interest, and algorithms. To this end, we cover extensive applications of diffusion models in the medical domain, including image-to-image translation, reconstruction, registration, classification, segmentation, denoising, 2/3D generation, anomaly detection, and other medically-related challenges. Furthermore, we emphasize the practical use case of some selected approaches, and then we discuss the limitations of the diffusion models in the medical domain and propose several directions to fulfill the demands of this field. Finally, we gather the overviewed studies with their available open-source implementations at our GitHub.1 We aim to update the relevant latest papers within it regularly.
Article
This paper presents a novel positive and negative set selection strategy for contrastive learning of medical images based on labels that can be extracted from clinical data . In the medical field, there exists a variety of labels for data that serve different purposes at different stages of a diagnostic and treatment process. Clinical labels and biomarker labels are two examples. In general, clinical labels are easier to obtain in larger quantities because they are regularly collected during routine clinical care, while biomarker labels require expert analysis and interpretation to obtain. Within the field of ophthalmology, previous work has shown that clinical values exhibit correlations with biomarker structures that manifest within optical coherence tomography (OCT) scans. We exploit this relationship by using the clinical data as pseudo-labels for our data without biomarker labels in order to choose positive and negative instances for training a backbone network with a supervised contrastive loss. In this way, a backbone network learns a representation space that aligns with the clinical data distribution available. Afterwards, we fine-tune the network trained in this manner with the smaller amount of biomarker labeled data with a cross-entropy loss in order to classify these key indicators of disease directly from OCT scans. We also expand on this concept by proposing a method that uses a linear combination of clinical contrastive losses. We benchmark our methods against state of the art self-supervised methods in a novel setting with biomarkers of varying granularity. We show performance improvements by as much as 5% in total biomarker detection AUROC.