Conference PaperPDF Available

Adaptation of a deep learning malignancy model from full-field digital mammography to digital breast tomosynthesis

Adaptation of a deep learning malignancy model from
full-field digital mammography to digital breast tomosynthesis
Sadanand Singh1, Thomas Paul Matthews1, Meet Shah1, Brent Mombourquette1, Trevor
Tsue1, Aaron Long1, Ranya Almohsen1, Stefano Pedemonte1, and Jason Su1
1Whiterabbit AI, Inc., Santa Clara, CA, USA
Mammography-based screening has helped reduce the breast cancer mortality rate, but has also been associated
with potential harms due to low specificity, leading to unnecessary exams or procedures, and low sensitivity.
Digital breast tomosynthesis (DBT) improves on conventional mammography by increasing both sensitivity and
specificity and is becoming common in clinical settings. However, deep learning (DL) models have been developed
mainly on conventional 2D full-field digital mammography (FFDM) or scanned film images. Due to a lack of large
annotated DBT datasets, it is difficult to train a model on DBT from scratch. In this work, we present methods
to generalize a model trained on FFDM images to DBT images. In particular, we use average histogram matching
(HM) and DL fine-tuning methods to generalize a FFDM model to the 2D maximum intensity projection (MIP)
of DBT images. In the proposed approach, the differences between the FFDM and DBT domains are reduced
via HM and then the base model, which was trained on abundant FFDM images, is fine-tuned. When evaluating
on image patches extracted around identified findings, we are able to achieve similar areas under the receiver
operating characteristic curve (ROC AUC) of 0.9 for FFDM and 0.85 for MIP images, as compared to a
ROC AUC of 0.75 when tested directly on MIP images.
Keywords: Mammography, Tomosynthesis, Deep Learning, Domain Adaptation, Transfer Learning
Breast cancer is the most commonly diagnosed cancer and the second leading cancer-related cause of death
among women in the United States.1Although mammography-based screening has been shown to reduce breast
cancer mortality,2it has also been associated with physical and psychological harms caused by false positives and
unnecessary biopsies.35To address these concerns, many clinics have started to switch their screening programs
from 2D full-field digital mammography (FFDM) to 3D digital breast tomosynthesis (DBT),6which has been
shown to increase the sensitivity of breast cancer screening7,8and reduce false positives.9
Deep learning (DL) using convolutional neural networks has previously been used to aid in the evaluation of
screening mammography to enhance the specificity of malignancy prediction, particularly for FFDM exams.10,11
Several challenges exist, however, in translating these successes to DBT exams. First, in general, the performance
of DL models scales with the availability of labeled data but as a result of DBT being only more recently adopted,
most large-scale mammography datasets consist mainly of FFDM exams. Second, the 3D volumes of DBT exams
can be quite large (e.g., 2457×1996×70 pixels). This can lead to computational difficulties as well as training
issues related to the curse of dimensionality, which are exacerbated by the low prevalence of cancer training
samples and the often small finding sizes associated with the early detection.
This study focuses on methods to adapt DL malignancy models originally developed for FFDM exams to DBT
exams in the case where the amount of available DBT data is quite limited. In order to overcome the large size
of 3D DBT images, we instead consider the maximum intensity projection (MIP) of these 3D volumes. Several
methods of adapting a model trained on patches of FFDM images to patches of MIP images are evaluated and
compared. The impact of histogram matching on reducing domain shift and simplifying the adaptation problem
is also considered.
Further author information: (Send correspondence to E-mail:
arXiv:2001.08381v1 [cs.CV] 23 Jan 2020
2.1 Dataset
The data was collected from a large academic medical center located in the mid-western region of the United
States between 2008 and 2017. This study was approved by the internal institutional review board of the
university from which the data was collected. Informed consent was waived for this retrospective study. The
data consists of a large set of FFDM exams and a smaller set of DBT exams. The exams were interpreted
by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years. Radiologist assessments
and pathology outcomes were extracted from the mammography reporting software of the site (Magview v7.1,
Magview, Burtonsville, Maryland).
Patients were randomly split into training (‘train’), validation (‘val’) and testing (‘test’) sets with a 80:10:10
ratio. Since the split was performed at the patient level, no patient had images in more than one of the above
sets. This split was shared by both the FFDM and DBT datasets. All training and hyperparameter searches
were performed on the training and validation sets. Performance on the test set was evaluated only once all
model selection, training, and fine-tuning had been carried out.
Images were categorized as one of four classes: (1) normal, no notable findings were identified by the radiolo-
gist; (2) benign, all notable findings were determined to be benign by the radiologist or by biopsy, (3) high-risk,
a biopsy determined a finding to contain tissue types likely to develop into cancer, and (4) malignant, a biopsy
determined a finding to contain malignant tissue types. We combine normal and benign labels into the negative
class and combine the high risk and malignant labels into the positive class. All malignant and high risk images
had exactly one radiologist annotation, indicating the location of the biopsied finding. These annotations were
made during the course of the standard clinical care for the patient. Benign samples have zero or one annotation.
A detailed distribution of the data across the different classes can be found in Table 1.
Table 1: Detailed statistics of the collected FFDM and DBT training (train), validation (val), and testing (test)
datasets. The numbers of patients, exams, and images for each set are given, as well as the distribution of
malignancy by image.
Train Val Test
Patients 49965 6239 6213
Exams 158650 19933 19618
Images 664234 83920 82296
Normal 606080 (91.3%) 76092 (90.7%) 75073 (91.2%)
Benign 56660 (8.5%) 7631 (9.1%) 7023 (8.5%)
High Risk 404 (0.1%) 41 (0.1%) 42 (0.1%)
Malignant 1090 (0.2%) 156 (0.2%) 158 (0.2%)
Train Val Test
Patients 10684 1399 1357
Exams 14828 1944 1855
Images 54380 7140 6791
Normal 48006 (88.3%) 6171 (86.4%) 6058 (89.2%)
Benign 6175 (11.4%) 939 (13.2%) 689 (10.2%)
High Risk 86 (0.2%) 13 (0.2%) 15 (0.2%)
Malignant 113 (0.2%) 17 (0.2%) 29 (0.4%)
2.2 Patch Model
The DL model is a ResNet12 based model with 29 layers and approximately 6 million parameters. It accepts
a 512x512 image patch from an FFDM or MIP image and predicts the probability that the patch contains a
malignant or high risk finding.
The original images had 4096×3328 or 3328×2560 pixels for FFDM images or 2457×1996 or 2457×1890
pixels for the MIP images. Example FFDM and MIP images can be seen in Figure 1. To obtain the input
to the model, an initial patch of 1024×1024 pixels is extracted from the image and downsampled by bilinear
interpolation to 512×512 pixels, yielding a patch at half the resolution of the original image. The resulting patch
covers 7.7-12.3% of the area of the original image.
For samples with annotations indicating the finding location, patches are centered at the center of the
annotations. For samples without annotations, the breast is segmented using a pre-chosen threshold and a patch
is selected centered at a randomly chosen pixel within the breast. Patches are sampled such that they are always
fully contained within the image and may be translated to satisfy this criterion.
2.3 FFDM Training
The base model is trained on the FFDM data, with a uniform sampling of two classes (equal probability of
sampling a positive or negative class sample). Images were augmented during training with random horizontal
and vertical flipping, additive Gaussian white noise with a standard deviation of 1.0, random translation drawn
from an Gaussian distribution with a standard deviation of 20 pixels, and random rotation drawn from an
uniform distribution from -30 to +30 degrees.
The model is trained to minimize a cross entropy loss function using the Adam optimizer13 with an initial
learning rate of 5 ×105and a weight decay of 5 ×104. An epoch is defined as 40000 samples shown to the
model.The model was trained for 100 epochs, and the model chosen for evaluation is the one that maximized
the area under the receiver operating characteristic curve (ROC AUC) on the validation set.
2.4 Domain Adaptation
Figure 2: Average cumulative histograms for FFDM
and MIP images. The intensity values have been scaled
so that they range from 0 to 4095 for both image types.
In domain adaptation, a model trained on one do-
main is adapted to another domain for which there
exists far less data. Previous work has shown that
deep neural networks often learn task and domain ag-
nostic features, particularly in the earlier layers of the
network.14 When the domains and tasks are similar,
larger portions of the network may be reused.
Here, we explore the use of histogram matching to
reduce the domain shift between the FFDM and DBT
domains. The FFDM patch model is adapted to DBT
exams, both with and without histogram matching,
using two different fine-tuning methods.
2.4.1 Histogram Matching
A non-linear transformation is used to transform the
cumulative histogram (c.d.f.) from one domain to the
average c.d.f. of another domain,15 referred to as his-
togram matching (HM). In particular, HM is employed to transform MIP images to better match the FFDM
images originally used to train the model.
Figure 1: Sample images from each domain for different malignancy classes - Normal, Benign and Malignant.
The red box indicates the location of a radiologist-annotated finding.
Algorithm 1: Histogram matching
The algorithm describes the procedure of histogram matching images from two
domains. Here, X[i] represents the i-th pixel value of the image X. The in-
verse mapping to a pixel value in the reference domain is performed by linear
Input: Source image XS[0, K 1]N,
Source c.d.f. FSNK
Reference c.d.f. FRNL
Output: Histogram-matched image X0
S[0, L 1]N
for iin 0 to N-1 do
S[i] = p0
return X0
The procedure for HM is outlined in Algorithm 1and given in greater detail as follows. Let FSbe the c.d.f.
of the source image, whose intensity distribution is to be updated, and let FRbe the c.d.f. of the reference
domain, whose intensity distribution we hope to match. Let pS[0, K 1] be a pixel value for the source image
and pR[0, L 1] be a pixel value in the reference domain such that FS(pS) = FR(pR). Then, our transformed
image will have the pixel value p0
R(FS(pS)), where the inverse mapping is calculated via linear
The average c.d.f. of the FFDM data was calculated over 1200 randomly chosen training samples, comprised
of equal amounts of the normal, benign and malignant classes. Similarly, the average c.d.f. of the MIP data was
calculated over 600 randomly chosen training samples, comprised of equal amounts of the normal, benign and
malignant classes. The histograms for the FFDM and MIP images can be seen in Figure 2. The application of
histogram matching can be qualitatively visualized in Figure 1.
2.4.2 Fine-tuning
Two methods were used to fine-tune the base model trained on FFDM images for use with the original or
histogram-matched MIP images. For the first approach, only the last fully connected layer of the model was
re-trained. This is referred to as the conventional fine-tuning approach. For the second approach, a version of the
SpotTune algorithm was implemented.16 SpotTune is an adaptive fine-tuning approach that finds the optimal
fine-tuning policy (which layers to fine-tune) per instance of target data.
The underlying idea behind SpotTune is that different training samples from the target domain require fine-
tuning updates to different sets of layers in pre-trained network. The SpotTune training procedure involves
predicting, for each training input, the specific layers to be fine-tuned and layers to be kept frozen. This input-
dependent fine-tuning approach enables targeting layers per input instance and leads to better accuracy.16 We
refer readers to the original paper16 for further details of SpotTune.
The fine-tuned model used the same data augmentations as the original FFDM model. The model is fine-
tuned using cross entropy loss and Adam optimizer with a learning rate of 5 ×105and a weight decay of
1×104. The model chosen is the one that maximized the validation ROC AUC.
The performance of all models is measured on the test datasets using the area under the receiver operating
characteristic curve (ROC AUC). On the test data, we extract patches in the same way as explained in Section
2.2. Since this is random, we average the results over three random seeds. The standard deviation of results is
used as an error estimate. A summary of all the results can be found in Table 2.
Table 2: Performance of the models for different domains. Results are shown on a test set for which both FFDM
and DBT/MIP images are available. MIP with HM refers to MIP images pre-processed to look more like FFDM
images. Errors shown here refer to the standard deviation over 3 independent realizations of patch extraction.
Training Data Testing Data Procedure ROC AUC
FFDM FFDM Train from scratch 0.909 ±0.001
FFDM MIP Test only 0.751 ±0.001
FFDM MIP Fine-tune 0.759 ±0.003
FFDM MIP SpotTune16 0.825 ±0.002
FFDM MIP with HM Test only 0.847 ±0.001
FFDM MIP with HM Fine-tune 0.837 ±0.001
FFDM MIP with HM SpotTune16 0.830 ±0.002
The base FFDM patch model has a ROC AUC of 0.909 on the FFDM images. For MIP images, the
performance of the base model drops to a ROC AUC of 0.751. If the MIP images are pre-processed using
the average histogram matching method, the ROC AUC goes up to 0.847. This shows that our simple fixed
non-linear transformation via histogram matching reduces the domain shift considerably.
In order to improve performance further, we apply fine-tuning and SpotTune,16 both with and without HM.
Fine-tuning on the limited number of regular MIP images does not show any improvement in performance;
however, SpotTune leads to a ROC AUC of 0.825. When fine-tuned on MIP images with HM, fine-tuning
improves ROC AUC from 0.759 to 0.837; however, SpotTune leads only to a minor improvement in ROC AUC
from 0.825 to 0.830. Overall, we find that the simple strategy of only pre-processing via histogram matching
leads to the best ROC AUC of 0.847 on MIP images.
Figure 3: Visualization of SpotTune policies to re-use
or fine-tune a residual block for MIP images with and
without HM.
SpotTune learns a policy per sample for selecting a
ResNet block for fine-tuning. For two cases, with and
without HM, the probability of fine-tuning different
ResNet blocks can be seen in Figure 3. With HM, the
relative gap in performance of SpotTune vs. Test-only
is small, perhaps due to the limited number of ResNet
blocks that are modified by the SpotTune algorithm.
Without HM, the gap is large and a significant portion
of the blocks are updated.
This observation also offers insight into the effec-
tiveness of conventional fine-tuning with and without
HM. Without HM, fine-tuning the last layer of the
network is unable to improve performance as more ex-
tensive changes to the network are needed as indicated
by the large number of ResNet blocks changed by the
SpotTune algorithm. With HM, the performance of
SpotTune and conventional fine-tuning are more simi-
lar as only the late layers need to be extensively mod-
We present our work on the adaptation of a patch-
level deep learning malignancy model from FFDM to
DBT exams. The original model was trained and evaluated on a large set of the FFDM images and the model
was adapted using methods requiring few DBT exams. In particular, by incorporating histogram level changes
in image features, we can achieve good classification performance without additional training.
A prior study17 considered the use of transfer learning for malignancy classification for FFDM and DBT
exams. However, the study examine the transfer learning for a model that was not initially trained on medical
images. It also mainly focused on malignant and benign patches (4×smaller than our patches) and employed
a much smaller dataset. A very recent study18 considered domain adaptation from FFDM to DBT images, but
relied on a large multi-site dataset of DBT images. It is unclear how effective that approach would be if the
DBT dataset were smaller.
A deep learning malignancy model was trained to identify high risk or malignant findings in patches of full-field
digital mammography (FFDM) images. Several strategies were evaluated to adapt this model to the maximum
intensity projections (MIP) of digital breast tomosynthesis (DBT) exams. The effectiveness of each domain
adaptation approach depended strongly on the amount of domain shift and the amount of available training
data. Without HM, the amount of domain shift was large and SpotTune was the most effective even with limited
training data. However, following HM, the domain shift was much smaller and the simplest approach proved best.
Histogram match can, therefore, be an effective strategy for domain adaptation when the amount of available
training data in the target domain is limited. This approach is simple and intuitive and can be easily adapted
to other problems in medical imaging where obtaining a large amount of labeled data for every image modality
is difficult.
In this study, we focus exclusively on domain adaptation for patch-level models. Thus, future work includes
using these adaptation techniques and patch models to train whole-image models, which provide a malignancy
probability for the entire image and, eventually, the entire examination. Another interesting approach is learning
the cyclic transformation via a CycleGAN.19 This generative model can learn how to transform a 3D DBT image
into a 2D image from the same distribution as the original FFDM images (and vice versa). By learning this
conditional distribution, we could synthesize 2D images that maintain the important, subtle features that may
have been lost by using maximum projection.
This work has not been submitted to any journal or conference for publication or presentation considerations.
This work was supported by Whiterabbit AI, Inc. The following authors are employed by and/or have a
financial interest in Whiterabbit: Sadanand Singh, Thomas Paul Matthews, Meet Shah, Brent Mombourquette,
Trevor Tsue, Aaron Long, Stefano Pedemonte, and Jason Su.
[1] Siegel, R. L., Miller, K. D., and Jemal, A., “Cancer statistics, 2019,” CA: A Cancer Journal for Clini-
cians 69(1), 7–34 (2019).
[2] Tabar, L., Dean, P. B., Chen, T. H., Yen, A. M., Chen, S. L., Fann, J. C., Chiu, S. Y., Ku, M. M., Wu, W. Y.,
Hsu, C. Y., Chen, Y. C., Beckmann, K., Smith, R. A., and Duffy, S. W., “The incidence of fatal breast
cancer measures the increased effectiveness of therapy in women participating in mammography screening,”
Cancer 125(4), 515–523 (2019).
[3] Smith, R. A., Duffy, S. W., Gabe, R., Tabar, L., Yen, A. M., and Chen, T. H., “The randomized trials
of breast cancer screening: what have we learned?,” Radiologic Clinics of North America 42(5), 793 – 806
[4] Tabr, L., Yen, A. M.-F., Wu, W. Y.-Y., Chen, S. L.-S., Chiu, S. Y.-H., Fann, J. C.-Y., Ku, M. M.-S., Smith,
R. A., Duffy, S. W., and Chen, T. H.-H., “Insights from the breast cancer screening trials: How screening
affects the natural history of breast cancer and implications for evaluating service screening programs,” The
Breast Journal 21(1), 13–20 (2015).
[5] Webb, M. L., Cady, B., Michaelson, J. S., Bush, D. M., Calvillo, K. Z., Kopans, D. B., and Smith, B. L., “A
failure analysis of invasive breast cancer: Most deaths from disease occur in women not regularly screened,”
Cancer 120(18), 2839–2846 (2014).
[6] Richman, I. B., Hoag, J. R., Xu, X., Forman, H. P., Hooley, R., Busch, S. H., and Gross, C. P., “Adoption
of Digital Breast Tomosynthesis in Clinical Practice,” JAMA Intern Med 179, 1292–1295 (2019).
[7] Hooley, R. J., Durand, M. A., and Philpotts, L. E., “Advances in digital breast tomosynthesis,” American
Journal of Roentgenology 208(2), 256266 (2017).
[8] Lee, Y. Z., Puett, C., Inscoe, C. R., Jia, B., Kim, C., Walsh, R., Yoon, S., Kim, S. J., Kuzmiak, C. M.,
Zeng, D., Lu, J., and Zhou, O., “Initial clinical experience with stationary digital breast tomosynthesis,”
Academic Radiology 26, 1363–1373 (2019).
[9] Friedewald, S. M., Rafferty, E. A., Rose, S. L., Durand, M. A., Plecha, D. M., Greenberg, J. S., Hayes, M. K.,
Copit, D. S., Carlson, K. L., Cink, T. M., Barke, L. D., Greer, L. N., Miller, D. P., and Conant, E. F., “Breast
Cancer Screening Using Tomosynthesis in Combination with Digital Mammography,” JAMA 311(24), 2499–
2507 (2014).
[10] Zhu, W., Lou, Q., Vang, Y. S., and Xie, X., “Deep multi-instance networks with sparse label assignment
for whole mammogram classification,” in [Medical Image Computing and Computer Assisted Intervention
MICCAI 2017], 603–611 (2017).
[11] Wu, N., Phang, J., Park, J., Shen, Y., Huang, Z., Zorin, M., Jastrzkebski, S., F´evry, T., Katsnelson, J.,
Kim, E., Wolfson, S., Parikh, U., Gaddam, S., Lin, L. L. Y., Ho, K., Weinstein, J. D., Reig, B., Gao, Y.,
Toth, H., Pysarenko, K., Lewin, A., Lee, J., Airola, K., Mema, E., Chung, S., Hwang, E., Samreen, N.,
Kim, S. G., Heacock, L., Moy, L., Cho, K., and Geras, K. J., “Deep neural networks improve radiologists’
performance in breast cancer screening,” IEEE Trans. Medical Imaging (2019). (pre-print).
[12] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)], 770–778 (2016).
[13] Kingma, D. P. and Ba, J., “Adam: A Method for Stochastic Optimization,” in [International Conference
on Learning Representations (ICML)], (2015).
[14] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H., “How transferable are features in deep neural networks?,”
in [Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume
2], 3320–3328 (2014).
[15] Gonzalez, R., [Digital image processing], Pearson, London, 4 ed. (March 2017).
[16] Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., and Feris, R., “Spottune: transfer learning through
adaptive fine-tuning,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition],
4805–4814 (2019).
[17] Mendel, K., Li, H., Sheth, D., and Giger, M., “Transfer learning from convolutional neural networks for
computer-aided diagnosis: A comparison of digital breast tomosynthesis and full-field digital mammogra-
phy,” Academic Radiology 26(6), 735 – 743 (2019).
[18] Lotter, W., Diab, A. R., Haslam, B., Kim, J. G., Grisot, G., Wu, E., Wu, K., Onieva, J. O., Boxerman,
J. L., Wang, M., Bandler, M., Vijayaraghavan, G., and Sorensen, A. G., “Robust breast cancer detec-
tion in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach,”
arXiv:1912.11027 [cs, eess] (Dec. 2019).
[19] Zhu, J., Park, T., Isola, P., and Efros, A. A., “Unpaired image-to-image translation using cycle-consistent
adversarial networks,” in [2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)],
2223–2232 (2017).
... To this end, we compare SpotTUnet with the best unlearnable layer choice strategies within the supervised DA setup and show it to be a reliable tool for domain shift analysis. While authors of [13] demonstrate SpotTune to perform worse than histogram matching preprocessing in the medical image classification task, we argue that histogram matching is a task-specific method and show its extremely poor segmentation quality in our task. Many approaches competitive to SpotTune have been developed recently, but their focus is more narrow: obtaining the best score on the Target domain rather than analyzing domain shift properties. ...
... Supervised DA methods On the rest of the 25 testing pairs, we compare 4 methods: fine-tuning of the first network layers, fine-tuning of the whole network from [12], histogram matching from [13], and SpotTUnet. We load a baseline model pretrained on the corresponding Source domain and then fine-tune it via one of the methods or preprocess the Target data in case of histogram matching. ...
Full-text available
Domain Adaptation (DA) methods are widely used in medical image segmentation tasks to tackle the problem of differently distributed train (source) and test (target) data. We consider the supervised DA task with a limited number of annotated samples from the target domain. It corresponds to one of the most relevant clinical setups: building a sufficiently accurate model on the minimum possible amount of annotated data. Existing methods mostly fine-tune specific layers of the pretrained Convolutional Neural Network (CNN). However, there is no consensus on which layers are better to fine-tune, e.g. the first layers for images with low-level domain shift or the deeper layers for images with high-level domain shift. To this end, we propose SpotTUnet – a CNN architecture that automatically chooses the layers which should be optimally fine-tuned. More specifically, on the target domain, our method additionally learns the policy that indicates whether a specific layer should be fine-tuned or reused from the pretrained network. We show that our method performs at the same level as the best of the non-flexible fine-tuning methods even under the extreme scarcity of annotated data. Secondly, we show that SpotTUnet policy provides a layer-wise visualization of the domain shift impact on the network, which could be further used to develop robust domain generalization methods. In order to extensively evaluate SpotTUnet performance, we use a publicly available dataset of brain MR images (CC359), characterized by explicit domain shift. We release a reproducible experimental pipeline (
... The ROI patches (dimensions: 512 × 512 ) were extracted from FFDM images, reconstructed 2D mammography images and DBT z-Stack slice images. The AUC metric was used to compare the performance of the model using In another work by Singh et al. (2020) , the authors proposed a method for adapting a deep learning model trained from FFDM images to be used for DBT images. In their study, FFDM and DBT images were labeled into four classes. ...
... The results showed that their ensemble model using the AlexNet model had the best performance with an AUC of 0.97. Matthews et al. (2020) proposed a CNN model to transfer FFDM trained domain to DBT s2D images. The proposed model in this study was pre-trained ResNet and a data set with 78445 s2D images was used. ...
Full-text available
The relatively recent reintroduction of deep learning has been a revolutionary force in the interpretation of diagnostic imaging studies. However, the technology used to acquire those images is undergoing a revolution itself at the very same time. Digital breast tomosynthesis (DBT) is one such technology, which has transformed the field of breast imaging. DBT, a form of three-dimensional mammography, is rapidly replacing the traditional two-dimensional mammograms. These parallel developments in both the acquisition and interpretation of breast images present a unique case study in how modern AI systems can be designed to adapt to new imaging methods. They also present a unique opportunity for co-development of both technologies that can better improve the validity of results and patient outcomes. In this review, we explore the ways in which deep learning can be best integrated into breast cancer screening workflows using DBT. We first explain the principles behind DBT itself and why it has become the gold standard in breast screening. We then survey the foundations of deep learning methods in diagnostic imaging, and review the current state of research into AI-based DBT interpretation. Finally, we present some of the limitations of integrating AI into clinical practice and the opportunities these present in this burgeoning field.
... When compared to stacked CNN, the addition of residual blocks in the network increases representation power, leads to faster convergence, and lowers training errors [82]. A work by Singh et al. [83] proposed a pre-trained model with FFDM images, which was utilized for DBT images. This study used two fine-tuning methods: (1) fine-tuning the last two layers and (2) fine-tuning only the optimal layers. ...
Full-text available
Breast cancer is now the most frequently diagnosed cancer in women, and its percentage is gradually increasing. Optimistically, there is a good chance of recovery from breast cancer if identified and treated at an early stage. Therefore, several researchers have established deep-learning-based automated methods for their efficiency and accuracy in predicting the growth of cancer cells utilizing medical imaging modalities. As of yet, few review studies on breast cancer diagnosis are available that summarize some existing studies. However, these studies were unable to address emerging architectures and modalities in breast cancer diagnosis. This review focuses on the evolving architectures of deep learning for breast cancer detection. In what follows, this survey presents existing deep-learning-based architectures, analyzes the strengths and limitations of the existing studies, examines the used datasets, and reviews image pre-processing techniques. Furthermore, a concrete review of diverse imaging modalities, performance metrics and results, challenges, and research directions for future researchers is presented.
... Wang et al. [27] used CNN to classify benign and malignant mammograms, and proved that the use of transfer learning on similar data can help improve the performance of the model. And the research of Matthews et al. [28] also proved this view. They used the full-field digital mammography database to train the proposed model, and generalized the model to the classification of digital breast tomosynthesis images through transfer learning. ...
Full-text available
Breast cancer is the most common cancer in women and poses a great threat to women's life and health. Mammography is an effective method for the diagnosis of breast cancer, but the results are largely limited by the clinical experience of radiologists. Therefore, the main purpose of this study is to perform two-stage classification (Normal/Abnormal and Benign/Malignancy) of two- view mammograms through convolutional neural network. In this study, we constructed a multi-view feature fusion network model for classification of mammograms from two views, and we proposed a multi-scale attention DenseNet as the backbone network for feature extraction. The model consists of two independent branches, which are used to extract the features of two mammograms from different views. Our work mainly focuses on the construction of multi-scale convolution module and attention module. The final experimental results show that the model has achieved good performance in both classification tasks. We used the DDSM database to evaluate the proposed method. The accuracy, sensitivity and AUC values of normal and abnormal mammograms classification were 94.92%, 96.52% and 94.72%, respectively. And the accuracy, sensitivity and AUC values of benign and malignant mammograms classification were 95.24%, 96.11% and 95.03%, respectively.
Full-text available
Microcalcification clusters (MCs) are among the most important biomarkers for breast cancer, especially in cases of nonpalpable lesions. The vast majority of deep learning studies on digital breast tomosynthesis (DBT) are focused on detecting and classifying lesions, especially soft-tissue lesions, in small regions of interest previously selected. Only about 25% of the studies are specific to MCs, and all of them are based on the classification of small preselected regions. Classifying the whole image according to the presence or absence of MCs is a difficult task due to the size of MCs and all the information present in an entire image. A completely automatic and direct classification, which receives the entire image, without prior identification of any regions, is crucial for the usefulness of these techniques in a real clinical and screening environment. The main purpose of this work is to implement and evaluate the performance of convolutional neural networks (CNNs) regarding an automatic classification of a complete DBT image for the presence or absence of MCs (without any prior identification of regions). In this work, four popular deep CNNs are trained and compared with a new architecture proposed by us. The main task of these trainings was the classification of DBT cases by absence or presence of MCs. A public database of realistic simulated data was used, and the whole DBT image was taken into account as input. DBT data were considered without and with preprocessing (to study the impact of noise reduction and contrast enhancement methods on the evaluation of MCs with CNNs). The area under the receiver operating characteristic curve (AUC) was used to evaluate the performance. Very promising results were achieved with a maximum AUC of 94.19% for the GoogLeNet. The second-best AUC value was obtained with a new implemented network, CNN-a, with 91.17%. This CNN had the particularity of also being the fastest, thus becoming a very interesting model to be considered in other studies. With this work, encouraging outcomes were achieved in this regard, obtaining similar results to other studies for the detection of larger lesions such as masses. Moreover, given the difficulty of visualizing the MCs, which are often spread over several slices, this work may have an important impact on the clinical analysis of DBT images.
Full-text available
In this paper, we develop a detection module with strong training testing to develop a dense convolutional neural network model. The model is designed in such a way that it is trained with necessary features for optimal modelling of the cancer detection. The method involves preprocessing of computerized tomography (CT) images for optimal classification at the testing stages. A 10-fold cross-validation is conducted to test the reliability of the model for cancer detection. The experimental validation is conducted in python to validate the effectiveness of the model. The result shows that the model offers robust detection of cancer instances that novel approaches on large image datasets. The simulation result shows that the proposed method provides analyzes with 94% accuracy than other methods. Also, it helps to reduce the detection errors while classifying the cancer instances than other methods the several existing methods.
Full-text available
Breast cancer remains a global challenge, causing over 600,000 deaths in 2018 (ref. ¹). To achieve earlier cancer detection, health organizations worldwide recommend screening mammography, which is estimated to decrease breast cancer mortality by 20–40% (refs. 2,3). Despite the clear value of screening mammography, significant false positive and false negative rates along with non-uniformities in expert reader availability leave opportunities for improving quality and access4,5. To address these limitations, there has been much recent interest in applying deep learning to mammography6,7,8,9,10,11,12,13,14,15,16,17,18, and these efforts have highlighted two key difficulties: obtaining large amounts of annotated training data and ensuring generalization across populations, acquisition equipment and modalities. Here we present an annotation-efficient deep learning approach that (1) achieves state-of-the-art performance in mammogram classification, (2) successfully extends to digital breast tomosynthesis (DBT; ‘3D mammography’), (3) detects cancers in clinically negative prior mammograms of patients with cancer, (4) generalizes well to a population with low screening rates and (5) outperforms five out of five full-time breast-imaging specialists with an average increase in sensitivity of 14%. By creating new ‘maximum suspicion projection’ (MSP) images from DBT data, our progressively trained, multiple-instance learning approach effectively trains on DBT exams using only breast-level labels while maintaining localization-based interpretability. Altogether, our results demonstrate promise towards software that can improve the accuracy of and access to screening mammography worldwide.
Full-text available
We present a deep convolutional neural network for breast cancer screening exam classification, trained and evaluated on over 200,000 exams (over 1,000,000 images). Our network achieves an AUC of 0.895 in predicting the presence of cancer in the breast, when tested on the screening population. We attribute the high accuracy to a few technical advances. (i) Our network’s novel two-stage architecture and training procedure, which allows us to use a high-capacity patch-level network to learn from pixel-level labels alongside a network learning from macroscopic breast-level labels. (ii) A custom ResNet-based network used as a building block of our model, whose balance of depth and width is optimized for high-resolution medical images. (iii) Pretraining the network on screening BI-RADS classification, a related task with more noisy labels. (iv) Combining multiple input views in an optimal way among a number of possible choices. To validate our model, we conducted a reader study with 14 readers, each reading 720 screening mammogram exams, and show that our model is as accurate as experienced radiologists when presented with the same data. We also show that a hybrid model, averaging the probability of malignancy predicted by a radiologist with a prediction of our neural network, is more accurate than either of the two separately. To further understand our results, we conduct a thorough analysis of our network’s performance on different subpopulations of the screening population, the model’s design, training procedure, errors, and properties of its internal representations. Our best models are publicly available at
Full-text available
Background Women and their health care providers need a reliable answer to this important question: If a woman chooses to participate in regular mammography screening, then how much will this choice improve her chances of avoiding a death from breast cancer compared with women who choose not to participate? Methods To answer this question, we used comprehensive registries for population, screening history, breast cancer incidence, and disease‐specific death data in a defined population in Dalarna County, Sweden. The annual incidence of breast cancer was calculated along with the annual incidence of breast cancers that were fatal within 10 and within 11 to 20 years of diagnosis among women aged 40 to 69 years who either did or did not participate in mammography screening during a 39‐year period (1977‐2015). For an additional comparison, corresponding data are presented from 19 years of the prescreening period (1958‐1976). All patients received stage‐specific therapy according to the latest national guidelines, irrespective of the mode of detection. Results The benefit for women who chose to participate in an organized breast cancer screening program was a 60% lower risk of dying from breast cancer within 10 years after diagnosis (relative risk, 0.40; 95% confidence interval, 0.34‐0.48) and a 47% lower risk of dying from breast cancer within 20 years after diagnosis (relative risk, 0.53; 95% confidence interval, 0.44‐0.63) compared with the corresponding risks for nonparticipants. Conclusions Although all patients with breast cancer stand to benefit from advances in breast cancer therapy, the current results demonstrate that women who have participated in mammography screening obtain a significantly greater benefit from the therapy available at the time of diagnosis than do those who have not participated.
Full-text available
Rationale and objectives: With the growing adoption of digital breast tomosynthesis (DBT) in breast cancer screening, we compare the performance of deep learning computer-aided diagnosis on DBT images to that of conventional full-field digital mammography (FFDM). Materials and methods: In this study, we retrospectively collected FFDM and DBT images of 78 biopsy-proven lesions from 76 patients. A region of interest was selected for each lesion on FFDM, synthesized 2D, and DBT key slice images. Features were extracted from each lesion using a pretrained convolutional neural network (CNN) and served as input to a support vector machine classifier trained in the task of predicting likelihood of malignancy. Results: From receiver operating characteristic (ROC) analysis of all 78 lesions, the synthesized 2D image performed best in both the cradiocaudal view (area under the ROC curve [AUC] = 0.81, SE = 0.05) and mediolateral oblique view (AUC = 0.88, SE = 0.04) in the task of lesion characterization. When cradiocaudal and mediolateral oblique data of each lesion were merged through soft voting, DBT key slice image performed best (AUC = 0.89, SE = 0.04). When only masses and architectural distortions (ARDs) were considered, DBT performed significantly better than FFDM (p = 0.024). Conclusion: DBT performed significantly better than FFDM in the merged view classification of mass and ARD lesions. The increased performance suggests that the information extracted by the CNN from DBT images may be more relevant to lesion malignancy status than the information extracted from FFDM images. Therefore, this study provides supporting evidence for the efficacy of computer-aided diagnosis on DBT in the evaluation of mass and ARD lesions.
Full-text available
Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods rely on regions of interest (ROIs) which require great efforts to annotate. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned ROIs. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations.
Although digital mammography has been widely used for breast cancer screening for more than a decade, it has imperfect sensitivity and specificity. A newer technology, digital breast tomosynthesis (DBT), may have a lower recall rate and a higher cancer detection rate than 2-dimensional mammography, although most studies of DBT were retrospective and did not evaluate long-term health outcomes.¹ The use of DBT has some important trade-offs compared with 2-dimensional mammography, including higher costs and higher radiation dose with some machines.² Although the US Preventive Services Task Force and the American Cancer Society have not specifically endorsed DBT for routine breast cancer screening, citing insufficient evidence, the American College of Radiology supports its use.³,4 Our objectives were to describe adoption of DBT for breast cancer screening in a large privately insured population, characterize regional patterns of adoption, and identify regional-level characteristics associated with that adoption.
Rationale and objectives: A linear array of carbon nanotube-enabled x-ray sources allows for stationary digital breast tomosynthesis (sDBT), during which projection views are collected without the need to move the x-ray tube. This work presents our initial clinical experience with a first-generation sDBT device. Materials and methods: Following informed consent, women with a "suspicious abnormality" (Breast Imaging Reporting and Data System 4), discovered by digital mammography and awaiting biopsy, were also imaged by the first generation sDBT. Four radiologists participated in this paired-image study, completing questionnaires while interpreting the mammograms and sDBT image stacks. Areas under the receiver operating characteristic curve were used to measure reader performance (likelihood of correctly identifying malignancy based on pathology as ground truth), while a multivariate analysis assessed preference, as readers compared one modality to the next when interpreting diagnostically important image features. Results: Findings from 43 women were available for analysis, in whom 12 cases of malignancy were identified by pathology. The mean areas under the receiver operating characteristic curve was significantly higher (p < 0.05) for sDBT than mammography for all breast density categories and breast thicknesses. Additionally, readers preferred sDBT over mammography when evaluating mass margins and shape, architectural distortion, and asymmetry, but preferred mammography when characterizing microcalcifications. Conclusion: Readers preferred sDBT over mammography when interpreting soft-tissue breast features and were diagnostically more accurate using images generated by sDBT in a Breast Imaging Reporting and Data System 4 population. However, the findings also demonstrated the need to improve microcalcification conspicuity, which is guiding both technological and image-processing design changes in future sDBT devices.
Conference Paper
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.