ChapterPDF Available

Improved Automatic Diabetic Retinopathy Severity Classification Using Deep Multimodal Fusion of UWF-CFP and OCTA Images



Diabetic Retinopathy (DR), a prevalent and severe complication of diabetes, affects millions of individuals globally, underscoring the need for accurate and timely diagnosis. Recent advancements in imaging technologies, such as Ultra-WideField Color Fundus Photography (UWF-CFP) imaging and Optical Coherence Tomography Angiography (OCTA), provide opportunities for the early detection of DR but also pose significant challenges given the disparate nature of the data they produce. This study introduces a novel multimodal approach that leverages these imaging modalities to notably enhance DR classification. Our approach integrates 2D UWF-CFP images and 3D high-resolution 6\(\,\times \,\)6 mm\(^3\) OCTA (both structure and flow) images using a fusion of ResNet50 and 3D-ResNet50 models, with Squeeze-and-Excitation (SE) blocks to amplify relevant features. Additionally, to increase the model’s generalization capabilities, a multimodal extension of Manifold Mixup, applied to concatenated multimodal features, is implemented. Experimental results demonstrate a remarkable enhancement in DR classification performance with the proposed multimodal approach compared to methods relying on a single modality only. The methodology laid out in this work holds substantial promise for facilitating more accurate, early detection of DR, potentially improving clinical outcomes for patients.
Improved Automatic Diabetic Retinopathy
Severity Classification Using Deep Multimodal
Fusion of UWF-CFP and OCTA Images
Mostafa El Habib Daho1,2,, Yihao Li1,2,, Rachid Zeghlache1,2, Yapo Cedric
Atse1,2, Hugo Le Boit´e3,4, Sophie Bonnin5, Deborah Cosette6, Pierre
Deman7,8, Laurent Borderie8, Capucine Lepicard9, Ramin Tadayoni3,5,
eatrice Cochener1,2,10 , Pierre-Henri Conze11,2, Mathieu Lamard1,2, and
Gwenol´e Quellec2
1Univ Bretagne Occidentale, Brest, France
2LaTIM UMR 1101, Inserm, Brest, France
3Ophthalmology department, Lariboisiere Hospital, APHP, Paris, France
4Paris Cit´e University, Paris, France
5Ophthalmology Department, Rothschild Foundation Hospital, Paris, France
6Carl Zeiss Meditec Inc, Dublin, CA, United States
7ADCIS, Saint-Contest, F-14280 France
8Evolucare Technologies, Le Pecq, F-78230 France
9AP-HP, Paris, France
10 Ophthalmology Department, CHRU Brest, Brest, France
11 IMT Atlantique, Brest, France
Abstract. Diabetic Retinopathy (DR), a prevalent and severe complica-
tion of diabetes, affects millions of individuals globally, underscoring the
need for accurate and timely diagnosis. Recent advancements in imag-
ing technologies, such as Ultra-WideField Color Fundus Photography
(UWF-CFP) imaging and Optical Coherence Tomography Angiography
(OCTA), provide opportunities for the early detection of DR but also
pose significant challenges given the disparate nature of the data they
produce. This study introduces a novel multimodal approach that lever-
ages these imaging modalities to notably enhance DR classification. Our
approach integrates 2D UWF-CFP images and 3D high-resolution 6x6
mm3OCTA (both structure and flow) images using a fusion of ResNet50
and 3D-ResNet50 models, with Squeeze-and-Excitation (SE) blocks to
amplify relevant features. Additionally, to increase the model’s general-
ization capabilities, a multimodal extension of Manifold Mixup, applied
to concatenated multimodal features, is implemented. Experimental re-
sults demonstrate a remarkable enhancement in DR classification per-
formance with the proposed multimodal approach compared to methods
relying on a single modality only. The methodology laid out in this work
holds substantial promise for facilitating more accurate, early detection
of DR, potentially improving clinical outcomes for patients.
Keywords: Diabetic Retinopathy Classification ·Multimodal Informa-
tion Fusion ·Deep learning ·UWF-CFP ·OCTA
arXiv:2310.01912v1 [eess.IV] 3 Oct 2023
2 M. El Habib Daho et al.
1 Introduction
Diabetic Retinopathy (DR), a common ocular complication of diabetes, is a lead-
ing cause of blindness globally [19]. The disease is characterized by progressive
damage to the retina due to prolonged hyperglycemia and is estimated to affect
approximately one-third of all people with diabetes. As such, timely and accu-
rate diagnosis of DR is crucial for effective management and treatment. However,
the subtle and complex nature of the disease’s early stages presents a challenge
for such a diagnosis.
Recent advances in imaging techniques have significantly enhanced the ability
to detect and classify DR. Ultra-WideField Color Fundus Photography (UWF-
CFP) imaging and Optical Coherence Tomography Angiography (OCTA) are
two such techniques that have shown great promise. UWF-CFP imaging offers a
panoramic view of the retina, allowing for a more comprehensive assessment [16],
while OCTA provides depth-resolved images of retinal blood flow, revealing de-
tailed microvascular changes indicative of DR [18].
Despite the individual merits of these imaging modalities, each offers a unique
perspective on retinal pathology. Leveraging the information from both could
potentially enhance the diagnosis and classification of DR [8, 24]. However, the
integration of these modalities poses a significant challenge due to the disparate
nature of the data they produce, especially in terms of dimensionality (2D versus
3D) and field of view.
In recent years, deep learning (DL) has emerged as a powerful tool for medi-
cal image analysis, demonstrating great performance in a wide range of tasks
[7,9,14, 15]. These models, particularly Convolutional Neural Networks (CNNs),
have shown their ability to learn complex, hierarchical representations from raw
image data, making them a natural choice for multimodal image fusion.
In the quest to enhance DL models, the field has benefitted significantly from
incorporating innovative techniques like the Manifold Mixup [20]. Through its
unique method of generating virtual training examples via the convex combina-
tions of hidden state representations, this technique has made a profound impact
by significantly reducing a model’s sensitivity to the data distribution and en-
couraging smoother decision boundaries.
Building upon these advanced techniques, several proposed methods in the state
of the art have employed multimodal imaging [10, 17]. These methods aim to
utilize the complementary information available in different types of images.
Recent works have effectively used mixing strategies to enhance multimodal
DL models. For example, the M3ixup approach [11] leverages a mixup strategy
to enhance multimodal representation learning and increase robustness against
missing modalities by mixing different modalities and aligning mixed views with
original multimodal representations. The LeMDA (Learning Multimodal Data
Augmentation) [12] method automatically learns to jointly augment multimodal
data in feature space, enhancing the performance of multimodal deep learning
architectures and achieving good results across various applications. MixGen [5]
introduces a joint data augmentation for vision-language representation learning
to boost data efficiency, generating new image-text pairs while preserving seman-
Deep multimodal fusion of UWF-CFP and OCTA images 3
tic relationships. This method has shown remarkable performance improvements
across various vision-language tasks. Furthermore, TMMDA (Token Mixup Mul-
timodal Data Augmentation) [27] for Multimodal Sentiment Analysis (MSA)
generates virtual modalities from the mixed token-level representation of raw
modalities, enhancing representation learning on limited labeled datasets.
Despite the significant results obtained, these methods are proposed for vision-
language and vision-audio fusion but are not suitable for 2D image/3D volume
fusion. This study proposes a new multimodal DL approach for DR classifica-
tion, integrating 2D UWF-CFP images and 3D OCTA images and incorporating
a custom mixing strategy. Regarding the used modalities in this work, recent re-
search has used UWF-CFP and OCTA imaging for the diagnosis of diseases such
as Alzheimer [21]. However, to the best of our knowledge, our study is the first
to develop a DL model for the classification of DR using both UWF-CFP and
OCTA imaging modalities, which contributes significantly to the existing body
of knowledge.
2 Methods
2.1 Model architecture
We utilize two separate CNN architectures, ResNet50 and 3D-ResNet50, de-
signed to process 2D UWF-CFP and 3D OCTA images, to extract features from
each imaging modality. ResNet50 was chosen as a backbone for feature extrac-
tion due to its remarkable performance in computer vision tasks. Its structure
provides a balance between depth and complexity, allowing the network to learn
complex patterns without suffering from overfitting. To further improve such
models’ performance, Squeeze-and-Excitation (SE) blocks have gained attention
in the DL community [6]. As shown in Fig.1(d), the SE blocks dynamically
recalibrate channel-wise feature responses by explicitly modeling the interde-
pendencies between channels, thus helping the model focus on more informative
features. They have been demonstrated to significantly improve the represen-
tational power of deep networks without a significant additional computational
The 3D-ResNet50, a 3D extension of the ResNet50 architecture, integrated with
SE blocks, is applied to process OCTA images (Fig.1(a)). This model expands
traditional 2D convolution operations into the 3D space, making it particularly
appropriate for volumetric image data. This enables the network to decipher
spatial hierarchies inherent in volumetric data, thus facilitating a comprehensive
feature extraction from OCTA volumes. SE blocks in the 3D-ResNet50 model
perform a similar role as in the 2D ResNet50 model, thus enhancing the per-
formance of the 3D backbone. For the rest of the paper, we will refer to these
models as SE-ResNet50 and SE-3D-ResNet50.
4 M. El Habib Daho et al.
Fig. 1. Proposed pipeline.
2.2 Fusion strategy
The fusion of multiple modalities has been an area of active research due to the
enhanced performances it offers [2, 13, 28]. Such fusion can be executed at input,
feature, and decision levels, each offering distinct advantages and disadvantages.
In this work, we employ an input-level fusion for merging the structure and flow
information embedded in OCTA images. Numerous studies affirm that merg-
ing these distinct types of information can significantly enhance the accuracy of
DR diagnosis [10,25]. Input-level fusion involves integrating multiple modalities
into a single data tensor subsequently processed by a DL model Fig. 1(a). This
method is effective without the need for registration, as the structure and flow
data align with each other by design.
On the other hand, the fusion of UWF-CFP and OCTA images is performed
through a different approach, primarily due to the absence of inherent align-
ment between these imaging modalities. Here, a feature-level fusion strategy
is adopted, which allows us to use different backbones for each modality (SE-
ResNet50 and 3D-SE-ResNet50), thus effectively addressing the alignment chal-
lenge. We have chosen feature-level fusion over decision-level fusion to capitalize
on the rich interplay between the modalities at the feature level. This strategy
facilitates the extraction of features and the fusion of high-dimensional feature-
level information, making it especially suited for unregistered or dimensionally
diverse data [3, 4, 22, 23].
2.3 Manifold Mixup
To enhance the model’s robustness and generalization capabilities, we imple-
mented a multimodal extension of Manifold Mixup into our training process.
Deep multimodal fusion of UWF-CFP and OCTA images 5
The original Manifold Mixup method [20] is a recently introduced regularization
technique. It generates virtual training examples by forming convex combina-
tions of the hidden state representations of two randomly chosen training exam-
ples and their associated labels.
Extending the concept of Input Mixup [26] to the hidden layers, Manifold Mixup
serves as a robust regularization method that provokes neural networks to predict
interpolated hidden representations with lesser confidence. It leverages semantic
interpolations as an auxiliary training signal, leading to the cultivation of neural
networks with smoother decision boundaries across multiple representation lev-
els. Consequently, neural networks trained with Manifold Mixup can learn class
representations with reduced directions of variance, thus yielding a model that
exhibits enhanced performance on unseen data [20]. The operational process of
the Manifold Mixup approach is as follows:
1. The original Manifold Mixup performs the mixing of the hidden representa-
tion randomly on a set of predefined eligible layers. Instead, in our proposed
implementation, we have purposefully selected the layer containing the con-
catenated feature maps from UWF-CFP and OCTA images to process the
Manifold Mixup. This strategic choice is not only the simplest way to intro-
duce Manifold Mixup but also ensures we are capitalizing on a layer that
encapsulates a high-dimensional, multimodal feature space. Creating numer-
ous virtual training samples from the fusion layer significantly improves the
model’s ability to generalize to new data.
2. Feed two images into the neural network until the selected layer is reached.
3. Extract the feature representations (zifor multimodal data xiand zjfor
multimodal data xj).
4. Mix the extracted feature representations according to Eq.1 in order to derive
the new representation (new features zassociated with new label y).
(z, y)=(λzi+ (1 λ)zj, λyi+ (1 λ)yj) (1)
where ziand zjare the features of two random training examples, and
yiand yjare their corresponding labels. λ[0,1] is a Mixup coefficient
sampled from a Beta distribution Beta(α, α), where αis a hyperparameter
that determines the shape of the Beta distribution.
5. Carry out the forward pass in the network for the remaining layers with the
mixed data.
6. Use the output of the mixed data to compute the loss and gradients. Given
Lthe original loss function, the new loss Lis computed as:
L=λL(yi, y) + (1 λ)L(yj, y) (2)
Through this process, Manifold Mixup enhances our fusion strategy by op-
erating on the joint feature representation (Fig.1(b)), thereby ensuring that the
model can generalize from the learned features of UWF-CFP and OCTA images.
6 M. El Habib Daho et al.
3 Experiments and Results
3.1 Dataset
The data used in this study arise from the ´
Evaluation Intelligente de la etinopathie
diab´etique” (EviRed) project1, a comprehensive initiative that collected data be-
tween 2020 and 2022 from 14 hospitals and recruitment centers across France.
This database included UWF-CFP images and OCTA images from patients at
various stages of DR. The dataset comprised images of 875 eyes belonging to
444 patients and was carefully divided into one (fixed) test set, and multiple
train and validation sets (through 5-fold cross-validation) to ensure a broad
representation and unbiased learning. Each patient’s eye was labeled by an oph-
thalmologist into one of the 6 DR classes: Normal, mild nonproliferative diabetic
retinopathy (NPDR), moderate NPDR, severe NPDR, proliferative DR (PDR),
or Pan-Retinal Photocoagulation (PRP).
The UWF-CFP images in the dataset, captured using the Clarus 500 (Carl
Zeiss Meditec Inc., Dublin, CA, USA), varied in size, ranging from 3900×3900
to 7900×4900 pixels. This size variation arises from the image stitching process
for montage creation, not from changes in the device’s resolution. Considering
the clinicians’ focus on the seven Early Treatment Diabetic Retinopathy Study
(ETDRS) fields [1], we carried out center cropping on each image to 3584×3584.
This process ensured that all seven fields were included in the image. Subse-
quently, we resized these cropped images to 1024×1024, a size that guarantees
no loss of details.
The high-resolution 6x6 mm3OCTA images, offering 500×224×500 voxels and
centered on the macula, were captured using the Zeiss PLEX Elite 9000. Each
OCTA volume includes 2-D en-face localizer, structural, and flow 3D volumes.
Due to the restrictions posed by the graphics processing unit (32Gb GPU) hard-
ware, our 3D-SE-ResNet50 could only accommodate inputs up to 224 ×224 ×
224 ×2 input tensors. This limitation guided our data pre-processing. In the
training step of our deep learning network, we employed random crop processing.
During the prediction process, we extracted multiple volumes from the OCTA
image using N=10 times random crop, which were simultaneously processed
with the full UWF-CFP image to make predictions. The final prediction for an
examination was determined based on the severest prediction among these N
predictions (test-time augmentation).
3.2 Implementation details
Our models were implemented using the PyTorch2deep learning library, and
all experiments were conducted using an NVIDIA Tesla V100s GPU. For UWF-
CFP images, we used the SE-ResNet50 architecture with weights pre-trained on
ImageNet, while for OCTA images, we trained from scratch our implementation
Deep multimodal fusion of UWF-CFP and OCTA images 7
of the 3D-SE-ResNet50 backbone with input-level fusion for structure and flow
volumes. The key to our model enhancement process included incorporating
SE blocks in both ResNet models and using Manifold Mixup on multimodal
features for model regularization. In our implementation, we set the reduction
ratio, a crucial SE hyperparameter, to 16, following the practice from the original
SE network paper [6]. For Mixup, we carried out a grid search focusing on
the αparameter, which is essential for deriving the adequate Beta distribution
Beta(α, α) for sampling the right λinterpolation parameter during Manifold
Mixup training. This comprehensive exploration determined 0.2 as the optimal
value for α, which yielded the best model performance. The two models were
trained jointly on the UWF-CFP and OCTA datasets, using a cross-entropy loss
function and an AdamW optimizer. During training, we used a learning rate of
0.001 with the OneCycle scheduler, a decay factor of 0.0001, and a batch size of
4 over 200 epochs.
3.3 Results and discussion
To compare the performance of our proposed method with the individual modal-
ities, we trained standalone models using either UWF-CFP or OCTA images
with the same training settings as described above. This provided a baseline
performance for each modality, against which the performance of the multi-
modal approach was compared. In addition, an ablation study was conducted to
further understand each component’s impact and contribution to our pipeline.
We compared the performance of our model without the Manifold Mixup and
the SE blocks.
The performance of the proposed models was evaluated in terms of the Area
Under the Receiver Operating Characteristic (ROC) Curve (AUC). This metric
was chosen due to its ability to provide an aggregate measure of performance
across the four DR severity cutoffs (mild NPDR, moderate NPDR, severe
Tab.1 presents the performance of the different models: the ResNet50 model
trained on UWF-CFP images, the 3D-ResNet50 model trained on OCTA im-
ages, the proposed multimodal pipeline, the multimodal models without SE, the
pipeline without Manifold Mixup (MM in the table), and the pipeline without
SE and Manifold Mixup.
Data SE MM mild NPDR moderate NPDR severe NPDR PDR
UWF-CFP 0.7983 0.7925 0.7906 0.9159
OCTA 0.8316 0.7627 0.7338 0.7576
Multimodal 0.8566 0.8037 0.7922 0.8820
Multimodal 0.8241 0.7969 0.7682 0.8522
Multimodal 0.8431 0.7782 0.7566 0.8420
Multimodal 0.8140 0.7775 0.7525 0.8164
Table 1. Performance of Models in DR Classification
8 M. El Habib Daho et al.
Our approach that combines both UWF-CFP and OCTA images using a
multimodal pipeline notably outperformed models based on individual modal-
ities. Specifically, when evaluating DR severity cutoffs, the multimodal model
achieved an AUC score of 0.8566 for mild NPDR, notably higher than 0.7983
for UWF-CFP alone and 0.8316 for OCTA alone. This trend continued with
moderate NPDR and severe NPDR, where our multimodal model attained
AUC scores of 0.8037 and 0.7922, respectively, compared to 0.7925 and 0.7906
for UWF-CFP and 0.7627 and 0.7338 for OCTA. These outcomes underscore the
importance of capitalizing on diverse image modalities to provide a more compre-
hensive, holistic analysis, thereby enhancing the robustness and accuracy of DR
classification. Our study suggests that each imaging modality captures distinct
aspects of DR, and the concurrent utilization of both modalities in our models
appears to improve the diagnosis, which is aligned with clinical studies [8,24].
The greater success of UWF-CFP in identifying the cutoff PDR can be at-
tributed to its wide-field view of the retina, which allows for the detection of
peripheral lesions and signs of PRP laser impacts. Conversely, OCTA images
proved to be particularly useful for mild NPDR detection due to their central
focus on the macula and the high-resolution imaging of the microvasculature.
Regarding the added components in our pipeline, the Manifold Mixup and the
SE blocks were proven to enhance the model’s performance. For example, omit-
ting the SE blocks caused a decrease in AUC scores across all DR severities.
This indicates the critical role of SE blocks in bolstering feature representations
and overall model robustness. Similarly, when the Manifold Mixup was excluded,
there was a noticeable drop in performance, corroborating the effectiveness of
such a regularization technique in improving model generalization.
4 Conclusion
Our findings demonstrate the efficacy of the proposed multimodal model in im-
proving DR classification. This model, which integrates UWF-CFP and OCTA
images using a feature-level fusion strategy and employing both our proposed
adaption of the Manifold Mixup technique and SE blocks, delivers a compelling
performance. The ablation study further attests to the significance of each com-
ponent within our pipeline. These findings reiterate the necessity and potency of
multimodal approaches coupled with advanced regularization techniques, such
as Manifold Mixup and SE blocks, for medical image classification tasks.
To the best of our knowledge, our study is the first to propose a pipeline for
the classification of DR using both UWF-CFP and OCTA images. However, we
believe several improvements and extensions could further enhance the classifica-
tion performance. The application of cross-modal attention mechanisms may pro-
vide a more effective way of fusing features from different modalities by focusing
on the most relevant information from each. Similarly, implementing Manifold
Mixup at different levels of the model, rather than solely at the concatenation
layer, could provide further regularization and performance improvements. More-
Deep multimodal fusion of UWF-CFP and OCTA images 9
over, introducing novel components, such as Transformer blocks, might prove
beneficial in capturing complex relationships within and across modalities.
The work takes place in the framework of Evired, an ANR RHU project. This
work benefits from State aid managed by the French National Research Agency
under “Investissement d’Avenir” program bearing the reference ANR-18-RHUS-
1. Early treatment diabetic retinopathy study design and baseline patient character-
istics: Etdrs report number 7. Ophthalmology 98(5, Supplement), 741–756 (1991).
2. Akhavan Aghdam, M., Sharifi, A., Pedram, M.M.: Combination of rs-fmri and smri
data to discriminate autism spectrum disorders in young children using deep belief
network. Journal of digital imaging 31, 895–903 (2018)
3. Al-Absi, H.R., Islam, M.T., Refaee, M.A., Chowdhury, M.E., Alam, T.: Cardio-
vascular disease diagnosis from dxa scan and retinal images using deep learning.
Sensors 22(12), 4310 (2022)
4. El-Sappagh, S., Abuhmed, T., Islam, S.R., Kwak, K.S.: Multimodal multitask deep
learning model for alzheimer’s disease progression detection based on time series
data. Neurocomputing 412, 197–215 (2020)
5. Hao, X., Zhu, Y., Appalaraju, S., Zhang, A., Zhang, W., Li, B., Li, M.: Mixgen: A
new multi-modal data augmentation (2023)
6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018).
7. Lahsaini, I., El Habib Daho, M., Chikh, M.A.: Deep transfer learning based classi-
fication model for covid-19 using chest ct-scans. Pattern Recognition Letters 152,
122–128 (2021).
8. Li, J., Wei, D., Mao, M., Li, M., Liu, S., Li, F., Chen, L., Liu, M., Leng, H., Wang,
Y., Ning, X., Liu, Y., Dong, W., Zhong, J.: Ultra-widefield color fundus photogra-
phy combined with high-speed ultra-widefield swept-source optical coherence to-
mography angiography for non-invasive detection of lesions in diabetic retinopathy.
Frontiers in Public Health 10 (2022).
9. Li, T., Bo, W., Hu, C., Kang, H., Liu, H., Wang, K., Fu, H.: Applications of deep
learning in fundus images: A review (2021),
10. Li, Y., El Habib Daho, M., Conze, P.H., Al Hajj, H., Bonnin, S., Ren, H.,
Manivannan, N., Magazzeni, S., Tadayoni, R., Cochener, B., Lamard, M., Quel-
lec, G.: Multimodal information fusion for glaucoma and diabetic retinopathy
classification. In: Ophthalmic Medical Image Analysis. pp. 53–62. Cham (2022). 6
11. Lin, R., Hu, H.: Adapt and explore: Multimodal mixup for representation learning.
Available at SSRN (2023).
12. Liu, Z., Tang, Z., Shi, X., Zhang, A., Li, M., Shrivastava, A., Wilson, A.G.: Learning
multimodal data augmentation in feature space (2023)
10 M. El Habib Daho et al.
13. Qian, X., Zhang, B., Liu, S., Wang, Y., Chen, X., Liu, J., Yang, Y., Chen, X., Wei,
Y., Xiao, Q., et al.: A combined ultrasonic b-mode and color doppler system for
the classification of breast masses using neural network. European Radiology 30,
3023–3033 (2020)
14. Quellec, G., Al Hajj, H., Lamard, M., Conze, P.H., Massin, P., Cochener, B.: Ex-
plain: Explanatory artificial intelligence for diabetic retinopathy diagnosis. Medical
Image Analysis 72, 102118 (2021).
15. Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H.:
Transformers in medical imaging: A survey. Medical Image Analysis 88, 102802
16. Silva, P.S., Dela Cruz, A.J., Ledesma, M.G., van Hemert, J., Rad-
wan, A., Cavallerano, J.D., Aiello, L.M., Sun, J.K., Aiello, L.P.: Diabetic
retinopathy severity and peripheral lesions are associated with nonperfusion
on ultrawide field angiography. Ophthalmology 122(12), 2465–2472 (2015).
17. Sleeman, W.C., Kapoor, R., Ghosh, P.: Multimodal classification: Current land-
scape, taxonomy and future directions. ACM Comput. Surv. 55(7) (dec 2022).
18. Sun, Z., Yang, D., Tang, Z., et al.: Optical coherence tomography angiogra-
phy in diabetic retinopathy: an updated review. Eye 35(11), 149–161 (2021).
19. Teo, Z.L., Tham, Y.C., Yu, M., Chee, M.L., Rim, T.H., Cheung, N., Bikbov, M.M.,
Wang, Y.X., Tang, Y., Lu, Y., et al.: Global prevalence of diabetic retinopathy and
projection of burden through 2045: systematic review and meta-analysis. Ophthal-
mology 128(11), 1580–1591 (2021)
20. Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Courville, A., Lopez-
Paz, D., Bengio, Y.: Manifold mixup: Better representations by interpolating hid-
den states (2019)
21. Wisely, C.E., Wang, D., Henao, R., Grewal, D.S., Thompson, A.C., Robbins, C.B.,
Yoon, S.P., Soundararajan, S., Polascik, B.W., Burke, J.R., Liu, A., Carin, L.,
Fekrat, S.: Convolutional neural network to identify symptomatic alzheimer’s dis-
ease using multimodal retinal imaging. British Journal of Ophthalmology 106(3),
388–395 (2022).
22. Wu, J., Fang, H., Li, F., Fu, H., Lin, F., Li, J., Huang, L., Yu, Q., Song, S., Xu,
X., et al.: Gamma challenge: glaucoma grading from multi-modality images. arXiv
preprint arXiv:2202.06511 (2022)
23. Xiong, J., Li, F., Song, D., Tang, G., He, J., Gao, K., Zhang, H., Cheng, W., Song,
Y., Lin, F., et al.: Multimodal machine learning using visual fields and peripapillary
circular oct scans in detection of glaucomatous optic neuropathy. Ophthalmology
129(2), 171–180 (2022)
24. Yang, J., Zhang, B., Wang, E., et al.: Ultra-wide field swept-source op-
tical coherence tomography angiography in patients with diabetes without
clinically detectable retinopathy. BMC Ophthalmology 21(1), 192 (2021).
25. Zang, P., Hormel, T.T., Wang, X., Tsuboi, K., Huang, D., Hwang, T.S., Jia, Y.:
A diabetic retinopathy classification framework based on deep-learning analysis of
oct angiography. Translational Vision Science & Technology 11(7), 10–10 (2022)
26. Zhang, H., Ciss´e, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical
risk minimization. CoRR abs/1710.09412 (2017),
Deep multimodal fusion of UWF-CFP and OCTA images 11
27. Zhao, X., Chen, Y., Liu, S., Zang, X., Xiang, Y., Tang, B.: Tmmda: A new token
mixup multimodal data augmentation for multimodal sentiment analysis. In: Pro-
ceedings of the ACM Web Conference 2023. p. 1714–1722. WWW ’23, Association
for Computing Machinery (2023).
28. Zong, W., Lee, J.K., Liu, C., Carver, E.N., Feldman, A.M., Janic, e.a.: A deep dive
into understanding tumor foci classification using multiparametric mri based on
convolutional neural network. Medical physics 47(9), 4077–4086 (2020)
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Purpose To compare the detection rate of diabetic retinopathy (DR) lesions and the agreement of DR severity grading using the ultra-widefield color fundus photography (UWF CFP) combined with high-speed ultra-widefield swept-source optical coherence tomography angiography (UWF SS-OCTA) or fluorescein angiography (FFA). Methods This prospective, observational study recruited diabetic patients who had already taken the FFA examination from November 2021 to June 2022. These patients had either no DR or any stage of DR. All participants were imaged with a 200° UWF CFP and UWF SS-OCTA using a 24 × 20 mm scan model. Images were independently evaluated for the presence or absence of DR lesions including microaneurysms (MAs), intraretinal hemorrhage (IRH), non-perfusion areas (NPAs), intraretinal microvascular abnormalities (IRMAs), venous beading (VB), neovascularization elsewhere (NVE), neovascularization of the optic disc (NVD), and vitreous or preretinal hemorrhage (VH/PRH). Agreement of DR severity grading based on UWF CFP plus UWF SS-OCTA and UWF CFP plus FFA was compared. All statistical analyses were performed using SPSS V.26.0. Results One hundred and fifty-three eyes of 86 participants were enrolled in the study. The combination of UWF CFP with UWF SS-OCTA showed a similar detection rate compared with UWF CFP plus FFA for all the characteristic DR lesions ( p >0.05), except NPAs ( p = 0.039). Good agreement was shown for the identification of VB (κ = 0.635), and very good agreement for rest of the DR lesions between the two combination methods (κ-value ranged from 0.858 to 0.974). When comparing the grading of DR severity, very good agreement was achieved between UWF CFP plus UWF SS-OCTA and UWF CFP plusr FFA (κ = 0.869). Conclusion UWF CFP plus UWF SS-OCTA had a very good agreement in detecting DR lesions and determining the severity of DR compared with UWF CFP plus FFA. This modality has the potential to be used as a fast, reliable, and non-invasive method for DR screening and monitoring in the future.
Full-text available
Multimodal information is frequently available in medical tasks. By combining information from multiple sources, clinicians are able to make more accurate judgments. In recent years, multiple imaging techniques have been used in clinical practice for retinal analysis: 2D fundus photographs, 3D optical coherence tomography (OCT) and 3D OCT angiography, etc. Our paper investigates three multimodal information fusion strategies based on deep learning to solve retinal analysis tasks: early fusion, intermediate fusion, and hierarchical fusion. The commonly used early and intermediate fusion are simple but do not fully exploit the complementary information between modalities. We developed a hierarchical fusion approach that focuses on combining features across multiple dimensions of the network, as well as exploring the correlation between modalities. These approaches were applied to glaucoma and diabetic retinopathy classification, using the public GAMMA dataset (fundus photographs and OCT) and a private dataset of PLEX®Elite 9000 (Carl Zeis Meditec Inc.) OCT angiography acquisitions, respectively. Our hierarchical fusion method performed the best in both cases and paved the way for better clinical diagnosis.KeywordsGlaucoma classificationDiabetic retinopathy classificationMultimodal information fusionDeep learningComputer-aided diagnosis
Full-text available
Purpose: Reliable classification of referable and vision threatening diabetic retinopathy (DR) is essential for patients with diabetes to prevent blindness. Optical coherence tomography (OCT) and its angiography (OCTA) have several advantages over fundus photographs. We evaluated a deep-learning-aided DR classification framework using volumetric OCT and OCTA. Methods: Four hundred fifty-six OCT and OCTA volumes were scanned from eyes of 50 healthy participants and 305 patients with diabetes. Retina specialists labeled the eyes as non-referable (nrDR), referable (rDR), or vision threatening DR (vtDR). Each eye underwent a 3 × 3-mm scan using a commercial 70 kHz spectral-domain OCT system. We developed a DR classification framework and trained it using volumetric OCT and OCTA to classify eyes into rDR and vtDR. For the scans identified as rDR or vtDR, 3D class activation maps were generated to highlight the subregions which were considered important by the framework for DR classification. Results: For rDR classification, the framework achieved a 0.96 ± 0.01 area under the receiver operating characteristic curve (AUC) and 0.83 ± 0.04 quadratic-weighted kappa. For vtDR classification, the framework achieved a 0.92 ± 0.02 AUC and 0.73 ± 0.04 quadratic-weighted kappa. In addition, the multiple DR classification (non-rDR, rDR but non-vtDR, or vtDR) achieved a 0.83 ± 0.03 quadratic-weighted kappa. Conclusions: A deep learning framework only based on OCT and OCTA can provide specialist-level DR classification using only a single imaging modality. Translational relevance: The proposed framework can be used to develop clinically valuable automated DR diagnosis system because of the specialist-level performance showed in this study.
Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as de facto operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, restoration, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the field's current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at
Multimodal classification research has been gaining popularity with new datasets in domains such as satellite imagery, biometrics, and medicine. Prior research has shown the benefits of combining data from multiple sources compared to traditional unimodal data which has led to the development of many novel multimodal architectures. However, the lack of consistent terminologies and architectural descriptions makes it difficult to compare different solutions. We address these challenges by proposing a new taxonomy for describing multimodal classification models based on trends found in recent publications. Examples of how this taxonomy could be applied to existing models are presented as well as a checklist to aid in the clear and complete presentation of future models. Many of the most difficult aspects of unimodal classification have not yet been fully addressed for multimodal datasets including big data, class imbalance, and instance level difficulty. We also provide a discussion of these challenges and future directions of research.
COVID-19 is an infectious and contagious virus. As of this writing, more than 160 million people have been infected since its emergence, including more than 125,000 in Algeria. In this work, We first collected a dataset of 4,986 COVID and non-COVID images confirmed by RT-PCR tests at Tlemcen hospital in Algeria. Then we performed a transfer learning on deep learning models that got the best results on the ImageNet dataset, such as DenseNet121, DenseNet201, VGG16, VGG19, Inception Resnet-V2, and Xception, in order to conduct a comparative study. Therefore, We have proposed an explainable model based on the DenseNet201 architecture and the GradCam explanation algorithm to detect COVID-19 in chest CT images and explain the output decision. Experiments have shown promising results and proven that the introduced model can be beneficial for diagnosing and following up patients with COVID-19.
Purpose: To develop and test a multi-modal artificial intelligence (AI) algorithm, FusionNet, using the pattern deviation probability plots (PDPs) from visual field (VF) reports and circular peripapillary optical coherence tomography (OCT) to detect glaucomatous optic neuropathy (GON). Design: Cross-sectional study. Subjects: A total of 2463 pairs of VF and OCT images from 1083 patients. Methods: A novel deep learning algorithm (FusionNet) based on bimodal input of VF-OCT paired data was developed to detect GON. VF data were collected using Humphrey Field Analyzer (HFA). OCT images were collected from three types of devices (DRI-OCT, Cirrus OCT and Spectralis OCT). A total of 2463 pairs of VF and OCT images were divided into four datasets: 1567 for training (HFA and DRI-OCT), 441 for primary validation (HFA and DRI-OCT), 255 for the internal test set (HFA and Cirrus OCT), and 200 for the external test set (HFA and Spectralis OCT). GON was defined as retinal nerve fibre layer (RNFL) thinning with corresponding VF defects. Main outcome measures: The diagnostic performance of FusionNet was compared with that of VFNet (with VF data as input) and OCTNet (with OCT data as input). Results: FusionNet achieved an area under the receiver operating characteristic curve (AUROC) of 0.950 (0.931-0.968) and outperformed VFNet (AUROC: 0.868 [0.834-0.902]), OCTNet (AUROC: 0.809 [0.768-0.850]), and two glaucomatologists (AUROC: 0.882 [0.847-0.917], AUROC: 0.883 [0.849-0.918]) in the primary validation set. In the internal and external test sets, FusionNet performance continued to be superior (AUROC: 0.917 [0.876-0.958], AUROC: 0.873 [0.822-0.924]) to VFNet (AUROC: 0.854 [0.796-0.912], AUROC: 0.772 [0.707-0.838]), and OCTNet (AUROC: 0.811 [0.753-0.869], AUROC: 0.785 [0.721-0.850]). There was no significant difference between the two glaucomatologists (AUROC: 0.869 [0.818-0.920] and 0.839 [0.777-0.901]; AUROC: (0.841 [0.780-0.902]) and FusionNet in the internal and external test sets, except for glaucomatologist 2 (AUROC:0.858 [0.805-0.912]) in the internal test set. Conclusions: FusionNet, developed using paired VF-OCT data, demonstrated superior performance to both VFNet and OCTNet in detecting GON, suggesting multi-modal machine learning models are valuable in detecting GON.