Improved Automatic Diabetic Retinopathy
Severity Classiﬁcation Using Deep Multimodal
Fusion of UWF-CFP and OCTA Images
Mostafa El Habib Daho1,2,∗, Yihao Li1,2,∗, Rachid Zeghlache1,2, Yapo Cedric
Atse1,2, Hugo Le Boit´e3,4, Sophie Bonnin5, Deborah Cosette6, Pierre
Deman7,8, Laurent Borderie8, Capucine Lepicard9, Ramin Tadayoni3,5,
B´eatrice Cochener1,2,10 , Pierre-Henri Conze11,2, Mathieu Lamard1,2, and
1Univ Bretagne Occidentale, Brest, France
2LaTIM UMR 1101, Inserm, Brest, France
3Ophthalmology department, Lariboisiere Hospital, APHP, Paris, France
4Paris Cit´e University, Paris, France
5Ophthalmology Department, Rothschild Foundation Hospital, Paris, France
6Carl Zeiss Meditec Inc, Dublin, CA, United States
7ADCIS, Saint-Contest, F-14280 France
8Evolucare Technologies, Le Pecq, F-78230 France
9AP-HP, Paris, France
10 Ophthalmology Department, CHRU Brest, Brest, France
11 IMT Atlantique, Brest, France
Abstract. Diabetic Retinopathy (DR), a prevalent and severe complica-
tion of diabetes, aﬀects millions of individuals globally, underscoring the
need for accurate and timely diagnosis. Recent advancements in imag-
ing technologies, such as Ultra-WideField Color Fundus Photography
(UWF-CFP) imaging and Optical Coherence Tomography Angiography
(OCTA), provide opportunities for the early detection of DR but also
pose signiﬁcant challenges given the disparate nature of the data they
produce. This study introduces a novel multimodal approach that lever-
ages these imaging modalities to notably enhance DR classiﬁcation. Our
approach integrates 2D UWF-CFP images and 3D high-resolution 6x6
mm3OCTA (both structure and ﬂow) images using a fusion of ResNet50
and 3D-ResNet50 models, with Squeeze-and-Excitation (SE) blocks to
amplify relevant features. Additionally, to increase the model’s general-
ization capabilities, a multimodal extension of Manifold Mixup, applied
to concatenated multimodal features, is implemented. Experimental re-
sults demonstrate a remarkable enhancement in DR classiﬁcation per-
formance with the proposed multimodal approach compared to methods
relying on a single modality only. The methodology laid out in this work
holds substantial promise for facilitating more accurate, early detection
of DR, potentially improving clinical outcomes for patients.
Keywords: Diabetic Retinopathy Classiﬁcation ·Multimodal Informa-
tion Fusion ·Deep learning ·UWF-CFP ·OCTA
arXiv:2310.01912v1 [eess.IV] 3 Oct 2023
2 M. El Habib Daho et al.
Diabetic Retinopathy (DR), a common ocular complication of diabetes, is a lead-
ing cause of blindness globally . The disease is characterized by progressive
damage to the retina due to prolonged hyperglycemia and is estimated to aﬀect
approximately one-third of all people with diabetes. As such, timely and accu-
rate diagnosis of DR is crucial for eﬀective management and treatment. However,
the subtle and complex nature of the disease’s early stages presents a challenge
for such a diagnosis.
Recent advances in imaging techniques have signiﬁcantly enhanced the ability
to detect and classify DR. Ultra-WideField Color Fundus Photography (UWF-
CFP) imaging and Optical Coherence Tomography Angiography (OCTA) are
two such techniques that have shown great promise. UWF-CFP imaging oﬀers a
panoramic view of the retina, allowing for a more comprehensive assessment ,
while OCTA provides depth-resolved images of retinal blood ﬂow, revealing de-
tailed microvascular changes indicative of DR .
Despite the individual merits of these imaging modalities, each oﬀers a unique
perspective on retinal pathology. Leveraging the information from both could
potentially enhance the diagnosis and classiﬁcation of DR [8, 24]. However, the
integration of these modalities poses a signiﬁcant challenge due to the disparate
nature of the data they produce, especially in terms of dimensionality (2D versus
3D) and ﬁeld of view.
In recent years, deep learning (DL) has emerged as a powerful tool for medi-
cal image analysis, demonstrating great performance in a wide range of tasks
[7,9,14, 15]. These models, particularly Convolutional Neural Networks (CNNs),
have shown their ability to learn complex, hierarchical representations from raw
image data, making them a natural choice for multimodal image fusion.
In the quest to enhance DL models, the ﬁeld has beneﬁtted signiﬁcantly from
incorporating innovative techniques like the Manifold Mixup . Through its
unique method of generating virtual training examples via the convex combina-
tions of hidden state representations, this technique has made a profound impact
by signiﬁcantly reducing a model’s sensitivity to the data distribution and en-
couraging smoother decision boundaries.
Building upon these advanced techniques, several proposed methods in the state
of the art have employed multimodal imaging [10, 17]. These methods aim to
utilize the complementary information available in diﬀerent types of images.
Recent works have eﬀectively used mixing strategies to enhance multimodal
DL models. For example, the M3ixup approach  leverages a mixup strategy
to enhance multimodal representation learning and increase robustness against
missing modalities by mixing diﬀerent modalities and aligning mixed views with
original multimodal representations. The LeMDA (Learning Multimodal Data
Augmentation)  method automatically learns to jointly augment multimodal
data in feature space, enhancing the performance of multimodal deep learning
architectures and achieving good results across various applications. MixGen 
introduces a joint data augmentation for vision-language representation learning
to boost data eﬃciency, generating new image-text pairs while preserving seman-
Deep multimodal fusion of UWF-CFP and OCTA images 3
tic relationships. This method has shown remarkable performance improvements
across various vision-language tasks. Furthermore, TMMDA (Token Mixup Mul-
timodal Data Augmentation)  for Multimodal Sentiment Analysis (MSA)
generates virtual modalities from the mixed token-level representation of raw
modalities, enhancing representation learning on limited labeled datasets.
Despite the signiﬁcant results obtained, these methods are proposed for vision-
language and vision-audio fusion but are not suitable for 2D image/3D volume
fusion. This study proposes a new multimodal DL approach for DR classiﬁca-
tion, integrating 2D UWF-CFP images and 3D OCTA images and incorporating
a custom mixing strategy. Regarding the used modalities in this work, recent re-
search has used UWF-CFP and OCTA imaging for the diagnosis of diseases such
as Alzheimer . However, to the best of our knowledge, our study is the ﬁrst
to develop a DL model for the classiﬁcation of DR using both UWF-CFP and
OCTA imaging modalities, which contributes signiﬁcantly to the existing body
2.1 Model architecture
We utilize two separate CNN architectures, ResNet50 and 3D-ResNet50, de-
signed to process 2D UWF-CFP and 3D OCTA images, to extract features from
each imaging modality. ResNet50 was chosen as a backbone for feature extrac-
tion due to its remarkable performance in computer vision tasks. Its structure
provides a balance between depth and complexity, allowing the network to learn
complex patterns without suﬀering from overﬁtting. To further improve such
models’ performance, Squeeze-and-Excitation (SE) blocks have gained attention
in the DL community . As shown in Fig.1(d), the SE blocks dynamically
recalibrate channel-wise feature responses by explicitly modeling the interde-
pendencies between channels, thus helping the model focus on more informative
features. They have been demonstrated to signiﬁcantly improve the represen-
tational power of deep networks without a signiﬁcant additional computational
The 3D-ResNet50, a 3D extension of the ResNet50 architecture, integrated with
SE blocks, is applied to process OCTA images (Fig.1(a)). This model expands
traditional 2D convolution operations into the 3D space, making it particularly
appropriate for volumetric image data. This enables the network to decipher
spatial hierarchies inherent in volumetric data, thus facilitating a comprehensive
feature extraction from OCTA volumes. SE blocks in the 3D-ResNet50 model
perform a similar role as in the 2D ResNet50 model, thus enhancing the per-
formance of the 3D backbone. For the rest of the paper, we will refer to these
models as SE-ResNet50 and SE-3D-ResNet50.
4 M. El Habib Daho et al.
Fig. 1. Proposed pipeline.
2.2 Fusion strategy
The fusion of multiple modalities has been an area of active research due to the
enhanced performances it oﬀers [2, 13, 28]. Such fusion can be executed at input,
feature, and decision levels, each oﬀering distinct advantages and disadvantages.
In this work, we employ an input-level fusion for merging the structure and ﬂow
information embedded in OCTA images. Numerous studies aﬃrm that merg-
ing these distinct types of information can signiﬁcantly enhance the accuracy of
DR diagnosis [10,25]. Input-level fusion involves integrating multiple modalities
into a single data tensor subsequently processed by a DL model Fig. 1(a). This
method is eﬀective without the need for registration, as the structure and ﬂow
data align with each other by design.
On the other hand, the fusion of UWF-CFP and OCTA images is performed
through a diﬀerent approach, primarily due to the absence of inherent align-
ment between these imaging modalities. Here, a feature-level fusion strategy
is adopted, which allows us to use diﬀerent backbones for each modality (SE-
ResNet50 and 3D-SE-ResNet50), thus eﬀectively addressing the alignment chal-
lenge. We have chosen feature-level fusion over decision-level fusion to capitalize
on the rich interplay between the modalities at the feature level. This strategy
facilitates the extraction of features and the fusion of high-dimensional feature-
level information, making it especially suited for unregistered or dimensionally
diverse data [3, 4, 22, 23].
2.3 Manifold Mixup
To enhance the model’s robustness and generalization capabilities, we imple-
mented a multimodal extension of Manifold Mixup into our training process.
Deep multimodal fusion of UWF-CFP and OCTA images 5
The original Manifold Mixup method  is a recently introduced regularization
technique. It generates virtual training examples by forming convex combina-
tions of the hidden state representations of two randomly chosen training exam-
ples and their associated labels.
Extending the concept of Input Mixup  to the hidden layers, Manifold Mixup
serves as a robust regularization method that provokes neural networks to predict
interpolated hidden representations with lesser conﬁdence. It leverages semantic
interpolations as an auxiliary training signal, leading to the cultivation of neural
networks with smoother decision boundaries across multiple representation lev-
els. Consequently, neural networks trained with Manifold Mixup can learn class
representations with reduced directions of variance, thus yielding a model that
exhibits enhanced performance on unseen data . The operational process of
the Manifold Mixup approach is as follows:
1. The original Manifold Mixup performs the mixing of the hidden representa-
tion randomly on a set of predeﬁned eligible layers. Instead, in our proposed
implementation, we have purposefully selected the layer containing the con-
catenated feature maps from UWF-CFP and OCTA images to process the
Manifold Mixup. This strategic choice is not only the simplest way to intro-
duce Manifold Mixup but also ensures we are capitalizing on a layer that
encapsulates a high-dimensional, multimodal feature space. Creating numer-
ous virtual training samples from the fusion layer signiﬁcantly improves the
model’s ability to generalize to new data.
2. Feed two images into the neural network until the selected layer is reached.
3. Extract the feature representations (zifor multimodal data xiand zjfor
multimodal data xj).
4. Mix the extracted feature representations according to Eq.1 in order to derive
the new representation (new features z′associated with new label y′).
(z′, y′)=(λzi+ (1 −λ)zj, λyi+ (1 −λ)yj) (1)
where ziand zjare the features of two random training examples, and
yiand yjare their corresponding labels. λ∈[0,1] is a Mixup coeﬃcient
sampled from a Beta distribution Beta(α, α), where αis a hyperparameter
that determines the shape of the Beta distribution.
5. Carry out the forward pass in the network for the remaining layers with the
6. Use the output of the mixed data to compute the loss and gradients. Given
Lthe original loss function, the new loss L′is computed as:
L′=λL(yi, y′) + (1 −λ)L(yj, y′) (2)
Through this process, Manifold Mixup enhances our fusion strategy by op-
erating on the joint feature representation (Fig.1(b)), thereby ensuring that the
model can generalize from the learned features of UWF-CFP and OCTA images.
6 M. El Habib Daho et al.
3 Experiments and Results
The data used in this study arise from the ”´
Evaluation Intelligente de la R´etinopathie
diab´etique” (EviRed) project1, a comprehensive initiative that collected data be-
tween 2020 and 2022 from 14 hospitals and recruitment centers across France.
This database included UWF-CFP images and OCTA images from patients at
various stages of DR. The dataset comprised images of 875 eyes belonging to
444 patients and was carefully divided into one (ﬁxed) test set, and multiple
train and validation sets (through 5-fold cross-validation) to ensure a broad
representation and unbiased learning. Each patient’s eye was labeled by an oph-
thalmologist into one of the 6 DR classes: Normal, mild nonproliferative diabetic
retinopathy (NPDR), moderate NPDR, severe NPDR, proliferative DR (PDR),
or Pan-Retinal Photocoagulation (PRP).
The UWF-CFP images in the dataset, captured using the Clarus 500 (Carl
Zeiss Meditec Inc., Dublin, CA, USA), varied in size, ranging from 3900×3900
to 7900×4900 pixels. This size variation arises from the image stitching process
for montage creation, not from changes in the device’s resolution. Considering
the clinicians’ focus on the seven Early Treatment Diabetic Retinopathy Study
(ETDRS) ﬁelds , we carried out center cropping on each image to 3584×3584.
This process ensured that all seven ﬁelds were included in the image. Subse-
quently, we resized these cropped images to 1024×1024, a size that guarantees
no loss of details.
The high-resolution 6x6 mm3OCTA images, oﬀering 500×224×500 voxels and
centered on the macula, were captured using the Zeiss PLEX Elite 9000. Each
OCTA volume includes 2-D en-face localizer, structural, and ﬂow 3D volumes.
Due to the restrictions posed by the graphics processing unit (32Gb GPU) hard-
ware, our 3D-SE-ResNet50 could only accommodate inputs up to 224 ×224 ×
224 ×2 input tensors. This limitation guided our data pre-processing. In the
training step of our deep learning network, we employed random crop processing.
During the prediction process, we extracted multiple volumes from the OCTA
image using N=10 times random crop, which were simultaneously processed
with the full UWF-CFP image to make predictions. The ﬁnal prediction for an
examination was determined based on the severest prediction among these N
predictions (test-time augmentation).
3.2 Implementation details
Our models were implemented using the PyTorch2deep learning library, and
all experiments were conducted using an NVIDIA Tesla V100s GPU. For UWF-
CFP images, we used the SE-ResNet50 architecture with weights pre-trained on
ImageNet, while for OCTA images, we trained from scratch our implementation
Deep multimodal fusion of UWF-CFP and OCTA images 7
of the 3D-SE-ResNet50 backbone with input-level fusion for structure and ﬂow
volumes. The key to our model enhancement process included incorporating
SE blocks in both ResNet models and using Manifold Mixup on multimodal
features for model regularization. In our implementation, we set the reduction
ratio, a crucial SE hyperparameter, to 16, following the practice from the original
SE network paper . For Mixup, we carried out a grid search focusing on
the αparameter, which is essential for deriving the adequate Beta distribution
Beta(α, α) for sampling the right λinterpolation parameter during Manifold
Mixup training. This comprehensive exploration determined 0.2 as the optimal
value for α, which yielded the best model performance. The two models were
trained jointly on the UWF-CFP and OCTA datasets, using a cross-entropy loss
function and an AdamW optimizer. During training, we used a learning rate of
0.001 with the OneCycle scheduler, a decay factor of 0.0001, and a batch size of
4 over 200 epochs.
3.3 Results and discussion
To compare the performance of our proposed method with the individual modal-
ities, we trained standalone models using either UWF-CFP or OCTA images
with the same training settings as described above. This provided a baseline
performance for each modality, against which the performance of the multi-
modal approach was compared. In addition, an ablation study was conducted to
further understand each component’s impact and contribution to our pipeline.
We compared the performance of our model without the Manifold Mixup and
the SE blocks.
The performance of the proposed models was evaluated in terms of the Area
Under the Receiver Operating Characteristic (ROC) Curve (AUC). This metric
was chosen due to its ability to provide an aggregate measure of performance
across the four DR severity cutoﬀs (≥mild NPDR, ≥moderate NPDR, ≥severe
Tab.1 presents the performance of the diﬀerent models: the ResNet50 model
trained on UWF-CFP images, the 3D-ResNet50 model trained on OCTA im-
ages, the proposed multimodal pipeline, the multimodal models without SE, the
pipeline without Manifold Mixup (MM in the table), and the pipeline without
SE and Manifold Mixup.
Data SE MM ≥mild NPDR ≥moderate NPDR ≥severe NPDR ≥PDR
UWF-CFP 0.7983 0.7925 0.7906 0.9159
OCTA 0.8316 0.7627 0.7338 0.7576
Multimodal ✓ ✓ 0.8566 0.8037 0.7922 0.8820
Multimodal ✓0.8241 0.7969 0.7682 0.8522
Multimodal ✓0.8431 0.7782 0.7566 0.8420
Multimodal 0.8140 0.7775 0.7525 0.8164
Table 1. Performance of Models in DR Classiﬁcation
8 M. El Habib Daho et al.
Our approach that combines both UWF-CFP and OCTA images using a
multimodal pipeline notably outperformed models based on individual modal-
ities. Speciﬁcally, when evaluating DR severity cutoﬀs, the multimodal model
achieved an AUC score of 0.8566 for ≥mild NPDR, notably higher than 0.7983
for UWF-CFP alone and 0.8316 for OCTA alone. This trend continued with ≥
moderate NPDR and ≥severe NPDR, where our multimodal model attained
AUC scores of 0.8037 and 0.7922, respectively, compared to 0.7925 and 0.7906
for UWF-CFP and 0.7627 and 0.7338 for OCTA. These outcomes underscore the
importance of capitalizing on diverse image modalities to provide a more compre-
hensive, holistic analysis, thereby enhancing the robustness and accuracy of DR
classiﬁcation. Our study suggests that each imaging modality captures distinct
aspects of DR, and the concurrent utilization of both modalities in our models
appears to improve the diagnosis, which is aligned with clinical studies [8,24].
The greater success of UWF-CFP in identifying the cutoﬀ ≥PDR can be at-
tributed to its wide-ﬁeld view of the retina, which allows for the detection of
peripheral lesions and signs of PRP laser impacts. Conversely, OCTA images
proved to be particularly useful for ≥mild NPDR detection due to their central
focus on the macula and the high-resolution imaging of the microvasculature.
Regarding the added components in our pipeline, the Manifold Mixup and the
SE blocks were proven to enhance the model’s performance. For example, omit-
ting the SE blocks caused a decrease in AUC scores across all DR severities.
This indicates the critical role of SE blocks in bolstering feature representations
and overall model robustness. Similarly, when the Manifold Mixup was excluded,
there was a noticeable drop in performance, corroborating the eﬀectiveness of
such a regularization technique in improving model generalization.
Our ﬁndings demonstrate the eﬃcacy of the proposed multimodal model in im-
proving DR classiﬁcation. This model, which integrates UWF-CFP and OCTA
images using a feature-level fusion strategy and employing both our proposed
adaption of the Manifold Mixup technique and SE blocks, delivers a compelling
performance. The ablation study further attests to the signiﬁcance of each com-
ponent within our pipeline. These ﬁndings reiterate the necessity and potency of
multimodal approaches coupled with advanced regularization techniques, such
as Manifold Mixup and SE blocks, for medical image classiﬁcation tasks.
To the best of our knowledge, our study is the ﬁrst to propose a pipeline for
the classiﬁcation of DR using both UWF-CFP and OCTA images. However, we
believe several improvements and extensions could further enhance the classiﬁca-
tion performance. The application of cross-modal attention mechanisms may pro-
vide a more eﬀective way of fusing features from diﬀerent modalities by focusing
on the most relevant information from each. Similarly, implementing Manifold
Mixup at diﬀerent levels of the model, rather than solely at the concatenation
layer, could provide further regularization and performance improvements. More-
Deep multimodal fusion of UWF-CFP and OCTA images 9
over, introducing novel components, such as Transformer blocks, might prove
beneﬁcial in capturing complex relationships within and across modalities.
The work takes place in the framework of Evired, an ANR RHU project. This
work beneﬁts from State aid managed by the French National Research Agency
under “Investissement d’Avenir” program bearing the reference ANR-18-RHUS-
1. Early treatment diabetic retinopathy study design and baseline patient character-
istics: Etdrs report number 7. Ophthalmology 98(5, Supplement), 741–756 (1991).
2. Akhavan Aghdam, M., Shariﬁ, A., Pedram, M.M.: Combination of rs-fmri and smri
data to discriminate autism spectrum disorders in young children using deep belief
network. Journal of digital imaging 31, 895–903 (2018)
3. Al-Absi, H.R., Islam, M.T., Refaee, M.A., Chowdhury, M.E., Alam, T.: Cardio-
vascular disease diagnosis from dxa scan and retinal images using deep learning.
Sensors 22(12), 4310 (2022)
4. El-Sappagh, S., Abuhmed, T., Islam, S.R., Kwak, K.S.: Multimodal multitask deep
learning model for alzheimer’s disease progression detection based on time series
data. Neurocomputing 412, 197–215 (2020)
5. Hao, X., Zhu, Y., Appalaraju, S., Zhang, A., Zhang, W., Li, B., Li, M.: Mixgen: A
new multi-modal data augmentation (2023)
6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018).
7. Lahsaini, I., El Habib Daho, M., Chikh, M.A.: Deep transfer learning based classi-
ﬁcation model for covid-19 using chest ct-scans. Pattern Recognition Letters 152,
122–128 (2021). https://doi.org/10.1016/j.patrec.2021.08.035
8. Li, J., Wei, D., Mao, M., Li, M., Liu, S., Li, F., Chen, L., Liu, M., Leng, H., Wang,
Y., Ning, X., Liu, Y., Dong, W., Zhong, J.: Ultra-wideﬁeld color fundus photogra-
phy combined with high-speed ultra-wideﬁeld swept-source optical coherence to-
mography angiography for non-invasive detection of lesions in diabetic retinopathy.
Frontiers in Public Health 10 (2022). https://doi.org/10.3389/fpubh.2022.1047608
9. Li, T., Bo, W., Hu, C., Kang, H., Liu, H., Wang, K., Fu, H.: Applications of deep
learning in fundus images: A review (2021), https://arxiv.org/abs/2101.09864
10. Li, Y., El Habib Daho, M., Conze, P.H., Al Hajj, H., Bonnin, S., Ren, H.,
Manivannan, N., Magazzeni, S., Tadayoni, R., Cochener, B., Lamard, M., Quel-
lec, G.: Multimodal information fusion for glaucoma and diabetic retinopathy
classiﬁcation. In: Ophthalmic Medical Image Analysis. pp. 53–62. Cham (2022).
11. Lin, R., Hu, H.: Adapt and explore: Multimodal mixup for representation learning.
Available at SSRN (2023). https://doi.org/10.2139/ssrn.4461697
12. Liu, Z., Tang, Z., Shi, X., Zhang, A., Li, M., Shrivastava, A., Wilson, A.G.: Learning
multimodal data augmentation in feature space (2023)
10 M. El Habib Daho et al.
13. Qian, X., Zhang, B., Liu, S., Wang, Y., Chen, X., Liu, J., Yang, Y., Chen, X., Wei,
Y., Xiao, Q., et al.: A combined ultrasonic b-mode and color doppler system for
the classiﬁcation of breast masses using neural network. European Radiology 30,
14. Quellec, G., Al Hajj, H., Lamard, M., Conze, P.H., Massin, P., Cochener, B.: Ex-
plain: Explanatory artiﬁcial intelligence for diabetic retinopathy diagnosis. Medical
Image Analysis 72, 102118 (2021). https://doi.org/10.1016/j.media.2021.102118
15. Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H.:
Transformers in medical imaging: A survey. Medical Image Analysis 88, 102802
16. Silva, P.S., Dela Cruz, A.J., Ledesma, M.G., van Hemert, J., Rad-
wan, A., Cavallerano, J.D., Aiello, L.M., Sun, J.K., Aiello, L.P.: Diabetic
retinopathy severity and peripheral lesions are associated with nonperfusion
on ultrawide ﬁeld angiography. Ophthalmology 122(12), 2465–2472 (2015).
17. Sleeman, W.C., Kapoor, R., Ghosh, P.: Multimodal classiﬁcation: Current land-
scape, taxonomy and future directions. ACM Comput. Surv. 55(7) (dec 2022).
18. Sun, Z., Yang, D., Tang, Z., et al.: Optical coherence tomography angiogra-
phy in diabetic retinopathy: an updated review. Eye 35(11), 149–161 (2021).
19. Teo, Z.L., Tham, Y.C., Yu, M., Chee, M.L., Rim, T.H., Cheung, N., Bikbov, M.M.,
Wang, Y.X., Tang, Y., Lu, Y., et al.: Global prevalence of diabetic retinopathy and
projection of burden through 2045: systematic review and meta-analysis. Ophthal-
mology 128(11), 1580–1591 (2021)
20. Verma, V., Lamb, A., Beckham, C., Najaﬁ, A., Mitliagkas, I., Courville, A., Lopez-
Paz, D., Bengio, Y.: Manifold mixup: Better representations by interpolating hid-
den states (2019)
21. Wisely, C.E., Wang, D., Henao, R., Grewal, D.S., Thompson, A.C., Robbins, C.B.,
Yoon, S.P., Soundararajan, S., Polascik, B.W., Burke, J.R., Liu, A., Carin, L.,
Fekrat, S.: Convolutional neural network to identify symptomatic alzheimer’s dis-
ease using multimodal retinal imaging. British Journal of Ophthalmology 106(3),
388–395 (2022). https://doi.org/10.1136/bjophthalmol-2020-317659
22. Wu, J., Fang, H., Li, F., Fu, H., Lin, F., Li, J., Huang, L., Yu, Q., Song, S., Xu,
X., et al.: Gamma challenge: glaucoma grading from multi-modality images. arXiv
preprint arXiv:2202.06511 (2022)
23. Xiong, J., Li, F., Song, D., Tang, G., He, J., Gao, K., Zhang, H., Cheng, W., Song,
Y., Lin, F., et al.: Multimodal machine learning using visual ﬁelds and peripapillary
circular oct scans in detection of glaucomatous optic neuropathy. Ophthalmology
129(2), 171–180 (2022)
24. Yang, J., Zhang, B., Wang, E., et al.: Ultra-wide ﬁeld swept-source op-
tical coherence tomography angiography in patients with diabetes without
clinically detectable retinopathy. BMC Ophthalmology 21(1), 192 (2021).
25. Zang, P., Hormel, T.T., Wang, X., Tsuboi, K., Huang, D., Hwang, T.S., Jia, Y.:
A diabetic retinopathy classiﬁcation framework based on deep-learning analysis of
oct angiography. Translational Vision Science & Technology 11(7), 10–10 (2022)
26. Zhang, H., Ciss´e, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical
risk minimization. CoRR abs/1710.09412 (2017), http://arxiv.org/abs/1710.
Deep multimodal fusion of UWF-CFP and OCTA images 11
27. Zhao, X., Chen, Y., Liu, S., Zang, X., Xiang, Y., Tang, B.: Tmmda: A new token
mixup multimodal data augmentation for multimodal sentiment analysis. In: Pro-
ceedings of the ACM Web Conference 2023. p. 1714–1722. WWW ’23, Association
for Computing Machinery (2023). https://doi.org/10.1145/3543507.3583406
28. Zong, W., Lee, J.K., Liu, C., Carver, E.N., Feldman, A.M., Janic, e.a.: A deep dive
into understanding tumor foci classiﬁcation using multiparametric mri based on
convolutional neural network. Medical physics 47(9), 4077–4086 (2020)