Available via license: CC BY 4.0
Content may be subject to copyright.
MamT4: Multi-view Attention Networks for
Mammography Cancer Classification
Alisher Ibragimov1, Sofya Senotrusova1, Arsenii Litvinov1, Egor Ushakov1, Evgeny Karpulevich1, and Yury Markin1
1Information Systems Department, ISP RAS, Russia
{ibragimov,senotrusova,filashkov,ushakov,karpulevich,ustas}@ispras.ru
Abstract—In this study, we introduce a novel method, called
MamT4, which is used for simultaneous analysis of four mam-
mography images. A decision is made based on one image of a
breast, with attention also devoted to three additional images:
another view of the same breast and two images of the other
breast. This approach enables the algorithm to closely replicate
the practice of a radiologist who reviews the entire set of
mammograms for a patient. Furthermore, this paper emphasizes
the preprocessing of images, specifically proposing a cropping
model (U-Net based on ResNet-34) to help the method remove
image artifacts and focus on the breast region. To the best of
our knowledge, this study is the first to achieve a ROC-AUC
of 84.0 ± 1.7 and an F1 score of 56.0 ± 1.3 on an independent
test dataset of Vietnam digital mammography (VinDr-Mammo),
which is preprocessed with the cropping model.
Index Terms—Breast cancer, Computer-aided diagnosis, Deep
learning, Multi-view mammogram
I. INTRODUCTION
Breast cancer is a leading cause of cancer-related deaths
among women [1]. Regular screening is essential for early
detection, with mammography being the primary screening
tool [2]. Mammography utilizes low-dose X-rays to detect
tissue changes in the breast, making it effective in detecting
malignancies like microcalcifications and clusters of calci-
fications [3]. Radiologists interpret mammograms based on
standard terminology and the BI-RADS classification system,
facilitating standardized reporting and risk assessment [4].
Although mammography is effective, it can result in false
positives or negatives, requiring additional tests like biop-
sies [5]. To improve the efficiency of early screening, au-
tomated approaches in mammography, such as computer-
aided diagnosis (CAD) systems, as well as solutions using
machine learning and deep learning (DL) technologies, are
being actively developed, assisting radiologists in interpreting
mammogram [6].
Deep Learning has emerged as a highly effective method
for image classification [7]. Furthermore, DL has become one
of the popular methods for detection of cancer pathologies,
particularly, on mammograms [8].
A key aspect of mammographic examinations is the acquisi-
tion of images in different projections for each breast, requir-
ing four images in total – two for each breast (MLO and CC).
Radiologists analyze the symmetry of lesions [9]. This unique
feature affects the training of DL models and the use of multi-
view, a novel approach based on learning four projections at
CNN
Classier
{cancer,
normal}
Fig. 1. An overview of the training process of the CNN and classification layer
applied to a binary classification problem using a single view. Subsequently,
the trained CNN block is employed to derive a feature vector (zi) from the
mammography image (xi).
once, presented in this study. Given the widespread signif-
icance of breast cancer diagnosis, this research on utilizing
deep learning methods offers a valuable chance to enhance
the automation of breast pathology detection and streamline
the tasks of radiologists.
To sum up, the contributions of our paper are:
1) We propose MamT4: a novel classification framework
based on Transformer Encoder that utilizes feature vec-
tor representations from four views of mammography
studies and outperforms single-view methods in classi-
fying cancer status.
2) To the best of our knowledge, this paper is the first to
achieve ROC-AUC of 84.0±1.7and F1 of 56.0±1.3
on the VinDr-Mammo dataset (test subset).
3) As a preprocessing, we propose the following method:
cropping the breast along its border using U-Net to
increase the quality of classification.
II. BACKGROU ND A ND PRELIMINARIES
A. U-Net
U-Net [10], introduced in 2015, is an encoder-decoder
network tailored for semantic segmentation, which excels
in medical image segmentation. The architecture efficiently
maps low-resolution encoder features to high-resolution inputs
through a decoder that uses pooling indices from the encoder
for precise pixelwise classification. This setup enables U-Net
to accurately delineate detailed features in medical images,
crucial for identifying and segmenting various anatomical
structures and abnormalities [11]. Its ability to handle small
datasets effectively and its adaptability to various medical
imaging modalities have made U-Net a standard choice in
arXiv:2411.01669v1 [eess.IV] 3 Nov 2024
medical image analysis, enhancing diagnostic accuracy and
aiding in clinical decision-making.
B. Transformer Encoder
Our approach draws inspiration from the ViT (Vision Trans-
former) framework [12]. Transformer Encoder (TE) block
consists of layer norm, multi-head self-attention (MSA) and
multi-layer perceptron (MLP). Also, as shown in Figure 2
the TE block accepts combined embeddings as input. For all
subsequent blocks, the inputs are the outputs from the previous
block of the TE. There is a total of Lsuch TE blocks. Inside
the TE, the inputs are first passed through a layer norm, and
then fed to MSA layer with Nheads. Once we get the outputs
from the MSA layer, these are added to the inputs (with skip
connection) to get the outputs that again get passed to layer
norm before being fed to the MLP block. The MLP consists
of two linear layers and a GELU non-linearity. The outputs
from the MLP block are again added to the inputs to get the
final output from one TE block.
C. Loss Function
We use the focal loss (FL) function was selected to achieve a
greater stability when training on both frequent (normal cases)
and rare (cancer cases) images [13]. For the case of a binary
classification, the focal loss can be written in the following
form [14]:
FL (pt) = −αt(1 −pt)γlog (pt),(1)
where γ≥0is a tunable focusing parameter, and αtis a
weighting factor for different classes to balance the importance
for positive and negative examples. Namely, α1= 1 −Nc
Nfor
class {cancer}and α0= 1 −α1for class {normal}, where
Ncis the number of images marked as {cancer}and Nis the
total number of images in dataset. For notational convenience,
ptwas defined as:
pt=(pif y= cancer
1−potherwise. (2)
In the above, yspecifies the ground-truth class and p∈[0,1]
is the model’s estimated probability for the class with label
y= cancer. This approach improved the performance of the
small number of cancer images due to the modulating factor
(1 −pt)γ.
III. METHODOLOGY
In this paper, we propose a novel method to achieve a
better quality to classify mammography images. To achieve
such a goal, the two main components of our approach are
presented in Section II: image preprocessing using U-Net and
Transformer Encoder to implement MamT4. An overview of
our proposed method is shown in Figure 2.
A. Crop mammogram
The first step in image preprocessing is selecting the region
of interest (ROI) through cropping. This process involves
segmentation of large images to focus on specific areas that
are of interest for further analysis, thereby making the task
easier for subsequent neural network predictions. Cropping
is useful as it helps concentrate the model’s attention on
relevant features without the distraction of background ar-
tifacts, which can be especially beneficial in datasets with
limited examples [15]. By selecting these regions and resizing
them uniformly, models can more consistently learn from and
recognize similar patterns in new data.
B. Multi-view Analysis
Inspired by previous works on an utilization of several
images (Section V) to predict labels in classification tasks, we
present a method that includes two training stages to construct
a classification model, considering feature vectors from all
images in mammography exams.
First, we train a CNN with replaced last layer by a clas-
sification layer with only one neuron to solve the binary
classification problem ({cancer, normal}), as shown in Fig-
ure 1. The CNN weights are initialized using a pretrained on
ImageNet [16] model. The trained model serves as a mapping
of the image xiinto the feature vector f(xi) = zi. Second, the
CNN block from previous stage are taken to build a four-view
mammograms model classifier based on Transformer Encoder
(MamT4, Figure 2). The CNN extract feature vectors (z0
i,z1
i,
z2
i,z3
i) from both breasts (left and right) and both projections
(MLO and CC). During a train in this step weights of CNN
block are not trainable. MamT4predicts labels for the x0
i
image, x1
i,x2
i,x3
iimages survey as a additional information
(or as a metadata).
Each feature vector zifrom EfficientNet-B3 (motivation to
utilize EfficientNet-B3 is shown in Section IV) has length
1536. We divide each vector into 8 tokens (the number can
be considered as a hyperparameter) with a size of 192. Four
vectors yield 32 tokens. Learnable position embeddings are
added to the token to retain positional information. Similar to
BERT’s [17] and ViT’s we add a learnable [class] token,
whose state at the output of the TE fed to the MLP head (which
is nothing but a linear layer 192 ×1) to get class predictions.
IV. EXP ER IM EN TS
We evaluate our proposed MamT4framework on cancer
classification tasks. Experiments are conducted on the VinDr-
Mammo dataset [18], which was released quite recently and
contain 5,000 mammography exams (four images per patient,
20,000 digital mammograms in total). The annotated exams
are split into a training set of 4,000 exams and a test set
of 1,000 exams. The dataset has BI-RADS classification, the
images are classified similarly to the proposed solution [19],
specifically: categories 1, 2 – “normal”, 4, 5 – “cancer”,
category 3 is not included. Thus, images have two labels
{cancer, normal}.
Embedded
Patches
Norm Norm
Multi-Head
Attention MLP
L×
+
+
CNN
{cancer,
normal}
Transformer
encoder MLP
Head
1192
8
16
24
32
2
class token
patch
embeddings positional
embeddings
...
+
...
... ...
Transformer encoder
Fig. 2. A summary of the MamT4framework that is used for cancer classification on x0
iimages. Here, x0
irepresents the primary view, while x1
iis the
corresponding ipsilateral view to the primary one. Similarly, x2
idepicts the corresponding bilateral view to the main view, whereas x3
iillustrates the ipsilateral
view of x2
i. The CNN block, which gives the features vectors (z0
i,z1
i,z2
i,z3
i) with fixed length 1536 per each one, are untrainable during this stage. Each
vector is divided into fixed-size patches, each of which is linearly embedded. After adding position embeddings, the resulting sequence of vectors is fed to
a Transformer Encoder. In order to perform classification, we use the standard approach of adding an extra learnable [class] token to the sequence. The
illustration of the Transformer Encoder was inspired by Dosovitskiy et al. [12]
Evaluation. We use ROC-AUC (the “gold standard” for
binary classification with neural networks [20]) and F1 score
(the harmonic mean of the Precision and Recall) to measure
classification performance. Although we use 5-fold cross-
validation to choose the EfficientNet-B3 model, it is not very
convenient in the proposed training method which consists
of two stages. Due to the following: for the second stage of
training MamT4, we would have to keep the split information
of the dataset from the first stage of training CNN block. Thus,
the results in the Table III, IV are obtained by training on
various 5 seeds.
A. Experimental Setup
Implementation Details. Our models are trained with the
PyTorch framework. In the training process, we set the initial
learning rate to 10−5, and start to attenuate the learning
rate when a F1 score on the test set stops improving within
5 epochs. If not specified otherwise, we use FL with the
following parameters: α0= 0.05,α1= 0.95 and γ= 2.0.
In the proposed approach, we first randomly apply the crop-
ping method to mammogram. Then the image is resized to
512 ×512 ×3. We train the model for 200 epochs with the
option to stop early if the F1 score metric does not improve
within 10 epochs on the test set. We train the MamT4model
with L=N= 12 number of TE blocks and heads of
MSA. The optimal number of L,Ncould be considered as
hyperparameters in future studies.
Baseline. Our framework is based on the EfficientNet-
B3 [21] model, pretrained on ImageNet [16], that is used as an
encoder. That model is chosen due to its superior performance
on ImageNet, publicly available pre-trained weights, and its
optimal balance between high performance and a manageable
number of parameters. We performed comparisons with other
models on the VinDr-Mammo dataset. Prior to the main
TABLE I
ENC ODE R PE RFO RMA NC E COM PARI SO N BAS ED ON ROC-AUC METRICS
ON VI NDR-M AM MO DATASE T.
Encoder ROC-AUC
EfficientNet-B3 79.2±0.8
Swin (Tiny) 58.5±3.2
Swin-V2 (Tiny) 69.5±2.3
SegFormer-B1 61.6±2.5
ResNet-18 78.1±1.6
MobileNet-V3 (Large) 76.6±1.8
TABLE II
DIS TRI BUT IO N OF DATASE TS S PLI T IN TO TR AIN IN G AND T ES TIN G
SU BSE TS ,INDICATING PERCENTAGE CONTRIBUTIONS TO THE CROPPING
DATASE T.
Split
Dataset Train (80%) Test (20%) Total
CBIS-DDSM [27] 46 16 62 (7.8%)
INbreast [28] 49 14 63 (7.9%)
KAU-BCMD [29] 51 6 57 (7.1%)
MIAS [30] 39 13 52 (6.5%)
CMMD [31] 55 11 66 (8.2%)
VinDr-Mammo [18] 400 100 500 (62.5%)
experiment, EfficientNet-B3 demonstrates the highest perfor-
mance, achieving a ROC-AUC score of 79.2±0.8. Models
including Swin (Tiny) [22], Swin-V2 (Tiny) [23], SegFormer-
B1 [24], ResNet-18 [25], and MobileNet-V3 (Large) [26] were
evaluated. Detailed results for these models, averaged over five
independent runs across different data splits, are provided in
Table I.
Preprocessing. We tried different methods for obtaining
a breast mask: a classic method based on color selection
and a neural network method. The color thresholding method
involves two stages. In the first stage, we select a color
threshold for the images, setting it at one-quarter of the mean,
and then apply the threshold to create a binary mask. After
this, we select the largest region by area. This method is
computationally effective and simple; however, it has lim-
itations because color values do not usually represent the
breast accurately, especially due to the presence of extraneous
artifacts such as labels on mammography images. It is also
challenging to separate the breast, which is the ROI, from other
body parts that may accidentally be present in the image.
The second method is based on neural network predictions
using U-Net with a ResNet-34 encoder, pretrained on Ima-
geNet. We use our dataset, where non-professional annotators,
under consultation with mammalogists, labeled the borders of
the breast without considering projection and laterality. This
dataset is compiled using images from six public datasets,
including VinDr-Mammo. The proportions of the cropping
dataset are shown in Table II. We develope a universal model
that works for each type of projection. We teste this method
and observed a quality improvement compared to the color
thresholding method. Our fine-tuned U-Net model shows a
performance of 98.6% mean IoU on the test subset.
After obtaining the mask prediction, we performed a cen-
tered crop of the breast and then fed the cropped image into
a neural network to predict label.
Augmentations. We apply augmentations, such as random
shuffling, blurring, Gaussian noise, horizontal flips, hue sat-
uration value shifts, sharpening, grid dropouts, grid distor-
tions, coarse dropouts, pixel dropouts. Cropping is applied
as follows: Each image in the training set is cropped with
a probability of 0.5 (with crop aug.) or with a probability
of 1 (with crop all), all images in the test set are cropped.
In the four-view model, we randomly replace x1
i,x2
i,x3
iwith
black images (we call it EmptyImage augmentation) to indicate
model that the dataset may not always contain four images for
one patient, with the probability of passing a black image, p,
set to 0.2 during a train and set to 0 during a test.
B. Results and Discussion
For testing, we compare a model trained on the original
dataset without cropping, which tests ordinary non-cropped
images, against a model trained with crop augmentation and
evaluated on the same test set but preprocessed with U-Net
for cropping. The results, presented in Table III, indicate
that cropping improves the quality of predictions. We also
investigate Grad-CAM++ [32] visualizations of the neural
networks’ predictions (Figure 3). Cropping creates a simpler
dataset for the deep learning model, containing only relevant
information. We also believe that cropping enhances prediction
quality because it adjusts the entire breast size to fit within the
picture area, thereby achieving scale unification. Furthermore,
with the entire image area covered by the breast, the model
can better focus on the breast, as the crucial elements, breast
tissues, appear larger than in original images.
Table III shows the performance of MamT4on the VinDr-
Mammo dataset compared with the single-view method (with
cancer normal normal cancer normal
tru
labels: cancer normalnormal cancer cancer
normal cancer cancer cancer normal
crop
Fig. 3. The visualization contrasts the predictions of two models: the model
without preprocessing and the model utilizing cropping. In the first two
columns, we demonstrate instances where the cropping model made correct
predictions on the cropped images, while the model without cropping failed
to do so on the original images. The third column presents a less common
scenario (34 instances as opposed to 84) where the model without cropping
correctly identifies the images as normal, and the cropping model errors.
Following that, there is an example where both models correctly identifies
cancer. Finally, the last column shows a case where both models make
incorrect predictions.
crop and with crop aug.) and method without using cropping
(w/o crop). Comparing the two cropping methods, all three
metrics coincide within the standard deviation, so for consis-
tency in the rest of the experiments, “with crop” means “with
crop aug.” The relative improvement of four-view MamT4
(with crop) compared to single-view EfficientNet-B3 (w/o
crop) is 4.4% of ROC-AUC, 11.3% of F1 score and 5.7% of F1
score (macro). Note that the VinDr-Mammo dataset includes
the complete set of four images for each study, with the sole
exception of one patient out of 4,000 from the training set,
who is manually removed from the dataset (4,000→3,999).
Our approach with EmptyImage augmentation can be used
for datasets like the CBIS-DDSM [27], which has no all four
mammogram images per patient.
Recently, ROC-AUC of 75.3% and F1 score (macro) of
76.0% have been achieved on the VinDr-Mammo dataset [33].
They divide BI-RADS into two classes differently than we do:
BI-RADS 2 – “normal”, 4, 5 – “cancer”, BI-RADS 1 and 3
is not included. Table IV show the performance of methods
with their method of dividing them into 2 classes. In that case
FL has α1= 0.87.
We adopt a two-stage approach in our new framework:
MamT4, a cancer classification model, is based on the analysis
of four views. It is important to note that our framework is
versatile and can be adapted to predict different scenarios, such
as two projections for a single breast or any required number
of images in a specific area.
TABLE III
PERFORMANCE COMPARISON OF THREE METHODS ON VINDR-M AMMO
DATASE T.
Method ROC-AUC F1 F1 (macro)
w/o crop 79.6±2.0 44.7±0.4 71.10 ±0.20
with crop aug. 83.8±0.4 53.2±1.1 75.5±0.5
with crop all 82.4±1.0 52.8±0.8 75.3±0.4
MamT4with crop 84.0±1.756.0±1.3 76.8±0.8
TABLE IV
PERFORMANCE COMPARISON OF TWO METHODS ON VINDR-MAMMO
DATASE T,WHE N WE AS SU ME BI-RADS 2 – “NORMAL”AND 4, 5 –
“CA NCE R”, BI-RADS 1,3 IS NOT INCLUDED.
Method ROC-AUC F1 F1 (macro)
with crop 79.9±0.9 57.8±1.1 75.0±0.7
MamT4with crop 80.3±1.161.0±1.4 77.2±0.7
V. RE LATE D WOR K
Multi-view Analysis. The rationale for using four images
simultaneously for prediction is that, in radiologist practices,
the symmetry information from other images is utilized to
improve the accuracy of decisions. For example, a lesion in
one breast rarely appears in the corresponding area in the other
breast [34]. Other multi-view approaches, using two to four
images as inputs, have also been proposed [34]–[39]. Recent
studies indicate that multiple-view approaches improve breast
cancer diagnosis [33], [40], [41].
Applications. The proposed method of using a multi-view
model to determine the diagnosis of breast cancer from a mam-
mogram can also be applied to other areas of medicine where
multiple projections or types of images need to be analyzed
for more accurate diagnosis. For example, this technique can
be effectively applied to identify the diagnosis of other types
of cancer, such as lung, stomach, skin, from different types of
scans like computed tomography (CT), magnetic resonance
imaging (MRI), and ultrasound, where deep learning tech-
niques are already being actively applied [42], [43].
One example of medical imaging that requires the analysis
of multiple projections or types of images for a more accurate
diagnosis may be CT in the examination of patients with
head injuries. If there is a suspicion of skull fracture or other
injuries, doctors may need to review images from different
projections to get a complete picture of the injuries and
choose the best treatment method. Studies on multi-class
semantic segmentation and on detection of abnormalities in
traumatic brain injury have already shown their effectiveness
and the positive impact of artificial intelligence techniques in
optimizing workflow in radiology [44], [45]. Therefore, the
proposed method on analyzing multiple projections of CT
scans can significantly increase the performance in detecting
suspicious areas and thus help clinicians to make a more
accurate informed decision on further treatment of the patient.
VI. CONCLUSION
Our study achieved metrics on the independent VinDr-
Mammo dataset, including a ROC-AUC of 84.0 ± 1.7 and a
F1 score of 56.0 ± 1.3. The preprocessing method involved
a cropping model that focused on the breast region and
removed extraneous artifacts, while also enlarging the breast
to the image’s size allowing the classification model to better
distinguish details.
Our framework MamT4utilized multi-view analysis based
on Transformer Encoder to improve cancer classification accu-
racy. Depending on task domain, the number of input images
can be increased or decreased. This approach can also be
beneficial in various medical imaging applications beyond
mammography where different projections of the same object
are used, improving accuracy and helping physicians to make
informed patient treatment decisions.
REFERENCES
[1] P. Hopewood and M. Milroy, Quality Cancer Care: Survivorship
Before, During and After Treatment. Springer International
Publishing, 2018. [Online]. Available: https://books.google.ru/books?
id=W9ldDwAAQBAJ
[2] M. Mainiero, L. Moy, P. Baron, A. Didwania, R. diFlorio, E. Green,
S. Heller, A. Holbrook, S.-J. Lee, A. Lewin, A. Lourenco, K. Nance,
B. Niell, P. Slanetz, A. Stuckey, N. Vincoff, S. Weinstein, M. Yepes,
and M. Newell, “Acr appropriateness criteria ® breast cancer screening,”
Journal of the American College of Radiology, vol. 14, pp. S383–S390,
11 2017.
[3] J. Tang, R. M. Rangayyan, J. Xu, I. E. Naqa, and Y. Yang, “Computer-
aided detection and diagnosis of breast cancer with mammography:
Recent advances,” IEEE Transactions on Information Technology in
Biomedicine, vol. 13, no. 2, pp. 236–251, 2009.
[4] A. Y. Lee, D. J. Wisner, S. Aminololama-Shakeri, V. A. Arasu,
S. A. Feig, J. Hargreaves, H. Ojeda-Fournier, L. W. Bassett, C. J.
Wells, J. De Guzman, C. I. Flowers, J. E. Campbell, S. L. Elson,
H. Retallack, and B. N. Joe, “Inter-reader variability in the use of bi-
rads descriptors for suspicious findings on diagnostic mammography,”
Academic Radiology, vol. 24, no. 1, p. 60–66, Jan. 2017. [Online].
Available: http://dx.doi.org/10.1016/j.acra.2016.09.010
[5] C. D. Lehman, R. F. Arao, B. L. Sprague, J. M. Lee, D. S. M. Buist,
K. Kerlikowske, L. M. Henderson, T. L. Onega, A. N. Tosteson, G. H.
Rauscher, and D. L. Miglioretti, “National performance benchmarks for
modern screening digital mammography: Update from the breast cancer
surveillance consortium.” Radiology, vol. 283 1, pp. 49–58, 2017.
[Online]. Available: https://api.semanticscholar.org/CorpusID:4786906
[6] Y. Almalki, T. Soomro, M. Irfan, S. Alduraibi, and A. Ali, “Comput-
erized analysis of mammogram images for early detection of breast
cancer,” Healthcare, vol. 10, p. 801, 04 2022.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
pp. 436–44, 05 2015.
[8] M. Prodan, E. Paraschiv, and A. Stanciu, “Applying deep learning meth-
ods for mammography analysis and breast cancer detection,” Applied
Sciences, vol. 13, p. 4272, 03 2023.
[9] R. Warren, S. Duffy, and S. Alija, “The value of the second view in
screening mammography,” The British journal of radiology, vol. 69, pp.
105–8, 02 1996.
[10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
[Online]. Available: http://arxiv.org/abs/1505.04597
[11] A. Ibragimov, S. Senotrusova, K. Markova, E. Karpulevich, A. Ivanov,
E. Tyshchuk, P. Grebenkina, O. Stepanova, A. Sirotskaya, A. Kovaleva
et al., “Deep semantic segmentation of angiogenesis images,” Interna-
tional Journal of Molecular Sciences, vol. 24, no. 2, p. 1102, 2023.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
[13] S. Asgari Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-Adad, and
G. Hamarneh, “Deep semantic segmentation of natural and medical
images: a review,” Artificial Intelligence Review, vol. 54, no. 1, pp. 137–
178, 2021.
[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´
ar, “Focal loss
for dense object detection,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2980–2988.
[15] D. Abdelhafiz, C. Yang, R. Ammar, and S. Nabavi, “Deep
convolutional neural networks for mammography: advances, challenges
and applications,” BMC Bioinformatics, vol. 20, no. 11, p. 281, Jun
2019. [Online]. Available: https://doi.org/10.1186/s12859-019-2823-4
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International journal of computer
vision, vol. 115, pp. 211–252, 2015.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[18] H. T. Nguyen, H. Q. Nguyen, H. H. Pham, K. Lam, L. T. Le, M. Dao, and
V. Vu, “VinDr-Mammo: A large-scale benchmark dataset for computer-
aided diagnosis in full-field digital mammography,” Sci. Data, vol. 10,
no. 1, p. 277, May 2023.
[19] L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and
W. Sieh, “Deep learning to improve breast cancer detection on screening
mammography,” Sci. Rep., vol. 9, no. 1, p. 12495, Aug. 2019.
[20] E. Kegeles, A. Naumov, E. A. Karpulevich, P. Volchkov, and P. Baranov,
“Convolutional neural networks can predict retinal differentiation in
retinal organoids,” Front. Cell. Neurosci., vol. 14, p. 171, Jul. 2020.
[21] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
convolutional neural networks,” CoRR, vol. abs/1905.11946, 2019.
[Online]. Available: http://arxiv.org/abs/1905.11946
[22] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
“Swin transformer: Hierarchical vision transformer using shifted
windows,” CoRR, vol. abs/2103.14030, 2021. [Online]. Available:
https://arxiv.org/abs/2103.14030
[23] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao,
Z. Zhang, L. Dong, F. Wei, and B. Guo, “Swin transformer V2:
scaling up capacity and resolution,” CoRR, vol. abs/2111.09883, 2021.
[Online]. Available: https://arxiv.org/abs/2111.09883
[24] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
“Segformer: Simple and efficient design for semantic segmentation with
transformers,” CoRR, vol. abs/2105.15203, 2021. [Online]. Available:
https://arxiv.org/abs/2105.15203
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[26] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,”
in Proceedings of the IEEE/CVF international conference on computer
vision, 2019, pp. 1314–1324.
[27] R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L.
Rubin, “A curated mammography data set for use in computer-aided
detection and diagnosis research,” Sci. Data, vol. 4, no. 1, p. 170177,
Dec. 2017.
[28] I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso,
and J. S. Cardoso, “INbreast: toward a full-field digital mammographic
database,” Acad. Radiol., vol. 19, no. 2, pp. 236–248, Feb. 2012.
[29] A. S. Alsolami, W. Shalash, W. Alsaggaf, S. Ashoor, H. Refaat, and
M. Elmogy, “King abdulaziz university breast cancer mammogram
dataset (KAU-BCMD),” Data (Basel), vol. 6, no. 11, p. 111, Oct. 2021.
[30] J. Suckling, J. Parker, S. Astley, I. W. Hutt, C. R. M. Boggis,
I. W. Ricketts, E. A. Stamatakis, N. Cerneaz, S. Kok, P. Taylor,
D. Betal, and J. Savage, “The mammographic image analysis
society digital mammogram database,” 1994. [Online]. Available:
https://api.semanticscholar.org/CorpusID:56649461
[31] C. Cui, L. Li, H. Cai, Z. Fan, L. Zhang, T. Dan, J. Li, and J. Wang, “The
chinese mammography database (CMMD): An online mammography
database with biopsy confirmed types for machine diagnosis of breast,”
2021.
[32] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
“Grad-cam++: Generalized gradient-based visual explanations for
deep convolutional networks,” in 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV). IEEE, Mar. 2018. [Online].
Available: http://dx.doi.org/10.1109/WACV.2018.00097
[33] T. T. Truong, H. T. Nguyen, T. B. Lam, D. V. Nguyen, and P. H.
Nguyen, “Delving into ipsilateral mammogram assessment under multi-
view network,” in International Workshop on Machine Learning in
Medical Imaging. Springer, 2023, pp. 367–376.
[34] Z. Yang, Z. Cao, Y. Zhang, Y. Tang, X. Lin, R. Ouyang, M. Wu,
M. Han, J. Xiao, L. Huang et al., “Momminet-v2: Mammographic multi-
view mass identification networks,” Medical Image Analysis, vol. 73, p.
102204, 2021.
[35] Y. Chen, H. Wang, C. Wang, Y. Tian, F. Liu, Y. Liu, M. Elliott, D. J.
McCarthy, H. Frazer, and G. Carneiro, “Multi-view local co-occurrence
and global consistency learning improve mammogram classification gen-
eralisation,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention. Springer, 2022, pp. 3–13.
[36] H. Wang, J. Feng, Z. Zhang, H. Su, L. Cui, H. He, and L. Liu, “Breast
mass classification via deeply integrating the contextual information
from multi-view data,” Pattern Recognition, vol. 80, pp. 42–52, 2018.
[37] G. Carneiro, J. Nascimento, and A. P. Bradley, “Automated analysis
of unregistered multi-view mammograms with deep learning,” IEEE
transactions on medical imaging, vol. 36, no. 11, pp. 2355–2365, 2017.
[38] Y. Li, H. Chen, L. Cao, and J. Ma, “A survey of computer-aided detection
of breast cancer with mammography,” J Health Med Inf, vol. 4, no. 7,
pp. 1–6, 2016.
[39] H. T. Nguyen, S. B. Tran, D. B. Nguyen, H. H. Pham, and H. Q.
Nguyen, “A novel multi-view deep learning approach for bi-rads and
density assessment of mammograms,” in 2022 44th Annual International
Conference of the IEEE Engineering in Medicine & Biology Society
(EMBC). IEEE, 2022, pp. 2144–2148.
[40] H. N. Khan, A. R. Shahid, B. Raza, A. H. Dar, and H. Alquhayz,
“Multi-view feature fusion based four views model for mammogram
classification using convolutional neural network,” IEEE Access, vol. 7,
pp. 165 724–165 733, 2019.
[41] K. J. Geras, S. Wolfson, Y. Shen, N. Wu, S. Kim, E. Kim, L. Heacock,
U. Parikh, L. Moy, and K. Cho, “High-resolution breast cancer screening
with multi-view deep convolutional neural networks,” arXiv preprint
arXiv:1703.07047, 2017.
[42] V. Kumar, C. Prabha, P. Sharma, N. Mittal, S. Askar, and
M. Abouhawwash, “Unified deep learning models for enhanced lung
cancer prediction with resnet-50–101 and efficientnet-b3 using dicom
images,” BMC Medical Imaging, vol. 24, 03 2024.
[43] H. Ueyama, Y. Kato, Y. Akazawa, N. Yatagai, H. Komori, T. Takeda,
K. Matsumoto, K. Ueda, K. Matsumoto, M. Hojo, T. Yao, A. Nagahara,
and T. Tada, “Application of artificial intelligence using a convolutional
neural network for diagnosis of early gastric cancer based on magnifying
endoscopy with narrow-band imaging,” Journal of Gastroenterology and
Hepatology, vol. 36, 07 2020.
[44] M. Monteiro, V. Newcombe, F. Mathieu, K. Adatia, K. Kamnitsas,
E. Ferrante, T. Das, D. Whitehouse, D. Rueckert, D. Menon, and
B. Glocker, “Multiclass semantic segmentation and quantification of
traumatic brain injury lesions on head ct using deep learning: an
algorithm development and multicentre validation study,” The Lancet
Digital Health, vol. 2, 05 2020.
[45] L. Poonamallee and S. Joshi, “Automated detection of intracranial
hemorrhage from head ct scans applying deep learning techniques
in traumatic brain injuries: A comparative review,” Indian Journal of
Neurotrauma, vol. 20, 07 2023.