Adaptation of a deep learning malignancy model from
full-ﬁeld digital mammography to digital breast tomosynthesis
Sadanand Singh1, Thomas Paul Matthews1, Meet Shah1, Brent Mombourquette1, Trevor
Tsue1, Aaron Long1, Ranya Almohsen1, Stefano Pedemonte1, and Jason Su1
1Whiterabbit AI, Inc., Santa Clara, CA, USA
Mammography-based screening has helped reduce the breast cancer mortality rate, but has also been associated
with potential harms due to low speciﬁcity, leading to unnecessary exams or procedures, and low sensitivity.
Digital breast tomosynthesis (DBT) improves on conventional mammography by increasing both sensitivity and
speciﬁcity and is becoming common in clinical settings. However, deep learning (DL) models have been developed
mainly on conventional 2D full-ﬁeld digital mammography (FFDM) or scanned ﬁlm images. Due to a lack of large
annotated DBT datasets, it is diﬃcult to train a model on DBT from scratch. In this work, we present methods
to generalize a model trained on FFDM images to DBT images. In particular, we use average histogram matching
(HM) and DL ﬁne-tuning methods to generalize a FFDM model to the 2D maximum intensity projection (MIP)
of DBT images. In the proposed approach, the diﬀerences between the FFDM and DBT domains are reduced
via HM and then the base model, which was trained on abundant FFDM images, is ﬁne-tuned. When evaluating
on image patches extracted around identiﬁed ﬁndings, we are able to achieve similar areas under the receiver
operating characteristic curve (ROC AUC) of ∼0.9 for FFDM and ∼0.85 for MIP images, as compared to a
ROC AUC of ∼0.75 when tested directly on MIP images.
Keywords: Mammography, Tomosynthesis, Deep Learning, Domain Adaptation, Transfer Learning
Breast cancer is the most commonly diagnosed cancer and the second leading cancer-related cause of death
among women in the United States.1Although mammography-based screening has been shown to reduce breast
cancer mortality,2it has also been associated with physical and psychological harms caused by false positives and
unnecessary biopsies.3–5To address these concerns, many clinics have started to switch their screening programs
from 2D full-ﬁeld digital mammography (FFDM) to 3D digital breast tomosynthesis (DBT),6which has been
shown to increase the sensitivity of breast cancer screening7,8and reduce false positives.9
Deep learning (DL) using convolutional neural networks has previously been used to aid in the evaluation of
screening mammography to enhance the speciﬁcity of malignancy prediction, particularly for FFDM exams.10,11
Several challenges exist, however, in translating these successes to DBT exams. First, in general, the performance
of DL models scales with the availability of labeled data but as a result of DBT being only more recently adopted,
most large-scale mammography datasets consist mainly of FFDM exams. Second, the 3D volumes of DBT exams
can be quite large (e.g., 2457×1996×70 pixels). This can lead to computational diﬃculties as well as training
issues related to the curse of dimensionality, which are exacerbated by the low prevalence of cancer training
samples and the often small ﬁnding sizes associated with the early detection.
This study focuses on methods to adapt DL malignancy models originally developed for FFDM exams to DBT
exams in the case where the amount of available DBT data is quite limited. In order to overcome the large size
of 3D DBT images, we instead consider the maximum intensity projection (MIP) of these 3D volumes. Several
methods of adapting a model trained on patches of FFDM images to patches of MIP images are evaluated and
compared. The impact of histogram matching on reducing domain shift and simplifying the adaptation problem
is also considered.
Further author information: (Send correspondence to Whiterabbit.ai)
Whiterabbit.ai: E-mail: firstname.lastname@example.org
arXiv:2001.08381v1 [cs.CV] 23 Jan 2020
The data was collected from a large academic medical center located in the mid-western region of the United
States between 2008 and 2017. This study was approved by the internal institutional review board of the
university from which the data was collected. Informed consent was waived for this retrospective study. The
data consists of a large set of FFDM exams and a smaller set of DBT exams. The exams were interpreted
by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years. Radiologist assessments
and pathology outcomes were extracted from the mammography reporting software of the site (Magview v7.1,
Magview, Burtonsville, Maryland).
Patients were randomly split into training (‘train’), validation (‘val’) and testing (‘test’) sets with a 80:10:10
ratio. Since the split was performed at the patient level, no patient had images in more than one of the above
sets. This split was shared by both the FFDM and DBT datasets. All training and hyperparameter searches
were performed on the training and validation sets. Performance on the test set was evaluated only once all
model selection, training, and ﬁne-tuning had been carried out.
Images were categorized as one of four classes: (1) normal, no notable ﬁndings were identiﬁed by the radiolo-
gist; (2) benign, all notable ﬁndings were determined to be benign by the radiologist or by biopsy, (3) high-risk,
a biopsy determined a ﬁnding to contain tissue types likely to develop into cancer, and (4) malignant, a biopsy
determined a ﬁnding to contain malignant tissue types. We combine normal and benign labels into the negative
class and combine the high risk and malignant labels into the positive class. All malignant and high risk images
had exactly one radiologist annotation, indicating the location of the biopsied ﬁnding. These annotations were
made during the course of the standard clinical care for the patient. Benign samples have zero or one annotation.
A detailed distribution of the data across the diﬀerent classes can be found in Table 1.
Table 1: Detailed statistics of the collected FFDM and DBT training (train), validation (val), and testing (test)
datasets. The numbers of patients, exams, and images for each set are given, as well as the distribution of
malignancy by image.
Train Val Test
Patients 49965 6239 6213
Exams 158650 19933 19618
Images 664234 83920 82296
Normal 606080 (91.3%) 76092 (90.7%) 75073 (91.2%)
Benign 56660 (8.5%) 7631 (9.1%) 7023 (8.5%)
High Risk 404 (0.1%) 41 (0.1%) 42 (0.1%)
Malignant 1090 (0.2%) 156 (0.2%) 158 (0.2%)
Train Val Test
Patients 10684 1399 1357
Exams 14828 1944 1855
Images 54380 7140 6791
Normal 48006 (88.3%) 6171 (86.4%) 6058 (89.2%)
Benign 6175 (11.4%) 939 (13.2%) 689 (10.2%)
High Risk 86 (0.2%) 13 (0.2%) 15 (0.2%)
Malignant 113 (0.2%) 17 (0.2%) 29 (0.4%)
2.2 Patch Model
The DL model is a ResNet12 based model with 29 layers and approximately 6 million parameters. It accepts
a 512x512 image patch from an FFDM or MIP image and predicts the probability that the patch contains a
malignant or high risk ﬁnding.
The original images had 4096×3328 or 3328×2560 pixels for FFDM images or 2457×1996 or 2457×1890
pixels for the MIP images. Example FFDM and MIP images can be seen in Figure 1. To obtain the input
to the model, an initial patch of 1024×1024 pixels is extracted from the image and downsampled by bilinear
interpolation to 512×512 pixels, yielding a patch at half the resolution of the original image. The resulting patch
covers 7.7-12.3% of the area of the original image.
For samples with annotations indicating the ﬁnding location, patches are centered at the center of the
annotations. For samples without annotations, the breast is segmented using a pre-chosen threshold and a patch
is selected centered at a randomly chosen pixel within the breast. Patches are sampled such that they are always
fully contained within the image and may be translated to satisfy this criterion.
2.3 FFDM Training
The base model is trained on the FFDM data, with a uniform sampling of two classes (equal probability of
sampling a positive or negative class sample). Images were augmented during training with random horizontal
and vertical ﬂipping, additive Gaussian white noise with a standard deviation of 1.0, random translation drawn
from an Gaussian distribution with a standard deviation of 20 pixels, and random rotation drawn from an
uniform distribution from -30 to +30 degrees.
The model is trained to minimize a cross entropy loss function using the Adam optimizer13 with an initial
learning rate of 5 ×10−5and a weight decay of 5 ×10−4. An epoch is deﬁned as 40000 samples shown to the
model.The model was trained for 100 epochs, and the model chosen for evaluation is the one that maximized
the area under the receiver operating characteristic curve (ROC AUC) on the validation set.
2.4 Domain Adaptation
Figure 2: Average cumulative histograms for FFDM
and MIP images. The intensity values have been scaled
so that they range from 0 to 4095 for both image types.
In domain adaptation, a model trained on one do-
main is adapted to another domain for which there
exists far less data. Previous work has shown that
deep neural networks often learn task and domain ag-
nostic features, particularly in the earlier layers of the
network.14 When the domains and tasks are similar,
larger portions of the network may be reused.
Here, we explore the use of histogram matching to
reduce the domain shift between the FFDM and DBT
domains. The FFDM patch model is adapted to DBT
exams, both with and without histogram matching,
using two diﬀerent ﬁne-tuning methods.
2.4.1 Histogram Matching
A non-linear transformation is used to transform the
cumulative histogram (c.d.f.) from one domain to the
average c.d.f. of another domain,15 referred to as his-
togram matching (HM). In particular, HM is employed to transform MIP images to better match the FFDM
images originally used to train the model.
Figure 1: Sample images from each domain for diﬀerent malignancy classes - Normal, Benign and Malignant.
The red box indicates the location of a radiologist-annotated ﬁnding.
Algorithm 1: Histogram matching
The algorithm describes the procedure of histogram matching images from two
domains. Here, X[i] represents the i-th pixel value of the image X. The in-
verse mapping to a pixel value in the reference domain is performed by linear
Input: Source image XS∈[0, K −1]N,
Source c.d.f. FS∈NK
Reference c.d.f. FR∈NL
Output: Histogram-matched image X0
S∈[0, L −1]N
for iin 0 to N-1 do
S[i] = p0
The procedure for HM is outlined in Algorithm 1and given in greater detail as follows. Let FSbe the c.d.f.
of the source image, whose intensity distribution is to be updated, and let FRbe the c.d.f. of the reference
domain, whose intensity distribution we hope to match. Let pS∈[0, K −1] be a pixel value for the source image
and pR∈[0, L −1] be a pixel value in the reference domain such that FS(pS) = FR(pR). Then, our transformed
image will have the pixel value p0
R(FS(pS)), where the inverse mapping is calculated via linear
The average c.d.f. of the FFDM data was calculated over 1200 randomly chosen training samples, comprised
of equal amounts of the normal, benign and malignant classes. Similarly, the average c.d.f. of the MIP data was
calculated over 600 randomly chosen training samples, comprised of equal amounts of the normal, benign and
malignant classes. The histograms for the FFDM and MIP images can be seen in Figure 2. The application of
histogram matching can be qualitatively visualized in Figure 1.
Two methods were used to ﬁne-tune the base model trained on FFDM images for use with the original or
histogram-matched MIP images. For the ﬁrst approach, only the last fully connected layer of the model was
re-trained. This is referred to as the conventional ﬁne-tuning approach. For the second approach, a version of the
SpotTune algorithm was implemented.16 SpotTune is an adaptive ﬁne-tuning approach that ﬁnds the optimal
ﬁne-tuning policy (which layers to ﬁne-tune) per instance of target data.
The underlying idea behind SpotTune is that diﬀerent training samples from the target domain require ﬁne-
tuning updates to diﬀerent sets of layers in pre-trained network. The SpotTune training procedure involves
predicting, for each training input, the speciﬁc layers to be ﬁne-tuned and layers to be kept frozen. This input-
dependent ﬁne-tuning approach enables targeting layers per input instance and leads to better accuracy.16 We
refer readers to the original paper16 for further details of SpotTune.
The ﬁne-tuned model used the same data augmentations as the original FFDM model. The model is ﬁne-
tuned using cross entropy loss and Adam optimizer with a learning rate of 5 ×10−5and a weight decay of
1×10−4. The model chosen is the one that maximized the validation ROC AUC.
The performance of all models is measured on the test datasets using the area under the receiver operating
characteristic curve (ROC AUC). On the test data, we extract patches in the same way as explained in Section
2.2. Since this is random, we average the results over three random seeds. The standard deviation of results is
used as an error estimate. A summary of all the results can be found in Table 2.
Table 2: Performance of the models for diﬀerent domains. Results are shown on a test set for which both FFDM
and DBT/MIP images are available. MIP with HM refers to MIP images pre-processed to look more like FFDM
images. Errors shown here refer to the standard deviation over 3 independent realizations of patch extraction.
Training Data Testing Data Procedure ROC AUC
FFDM FFDM Train from scratch 0.909 ±0.001
FFDM MIP Test only 0.751 ±0.001
FFDM MIP Fine-tune 0.759 ±0.003
FFDM MIP SpotTune16 0.825 ±0.002
FFDM MIP with HM Test only 0.847 ±0.001
FFDM MIP with HM Fine-tune 0.837 ±0.001
FFDM MIP with HM SpotTune16 0.830 ±0.002
The base FFDM patch model has a ROC AUC of 0.909 on the FFDM images. For MIP images, the
performance of the base model drops to a ROC AUC of 0.751. If the MIP images are pre-processed using
the average histogram matching method, the ROC AUC goes up to 0.847. This shows that our simple ﬁxed
non-linear transformation via histogram matching reduces the domain shift considerably.
In order to improve performance further, we apply ﬁne-tuning and SpotTune,16 both with and without HM.
Fine-tuning on the limited number of regular MIP images does not show any improvement in performance;
however, SpotTune leads to a ROC AUC of 0.825. When ﬁne-tuned on MIP images with HM, ﬁne-tuning
improves ROC AUC from 0.759 to 0.837; however, SpotTune leads only to a minor improvement in ROC AUC
from 0.825 to 0.830. Overall, we ﬁnd that the simple strategy of only pre-processing via histogram matching
leads to the best ROC AUC of 0.847 on MIP images.
Figure 3: Visualization of SpotTune policies to re-use
or ﬁne-tune a residual block for MIP images with and
SpotTune learns a policy per sample for selecting a
ResNet block for ﬁne-tuning. For two cases, with and
without HM, the probability of ﬁne-tuning diﬀerent
ResNet blocks can be seen in Figure 3. With HM, the
relative gap in performance of SpotTune vs. Test-only
is small, perhaps due to the limited number of ResNet
blocks that are modiﬁed by the SpotTune algorithm.
Without HM, the gap is large and a signiﬁcant portion
of the blocks are updated.
This observation also oﬀers insight into the eﬀec-
tiveness of conventional ﬁne-tuning with and without
HM. Without HM, ﬁne-tuning the last layer of the
network is unable to improve performance as more ex-
tensive changes to the network are needed as indicated
by the large number of ResNet blocks changed by the
SpotTune algorithm. With HM, the performance of
SpotTune and conventional ﬁne-tuning are more simi-
lar as only the late layers need to be extensively mod-
4. NEW OR BREAKTHROUGH
WORK TO BE PRESENTED
We present our work on the adaptation of a patch-
level deep learning malignancy model from FFDM to
DBT exams. The original model was trained and evaluated on a large set of the FFDM images and the model
was adapted using methods requiring few DBT exams. In particular, by incorporating histogram level changes
in image features, we can achieve good classiﬁcation performance without additional training.
A prior study17 considered the use of transfer learning for malignancy classiﬁcation for FFDM and DBT
exams. However, the study examine the transfer learning for a model that was not initially trained on medical
images. It also mainly focused on malignant and benign patches (4×smaller than our patches) and employed
a much smaller dataset. A very recent study18 considered domain adaptation from FFDM to DBT images, but
relied on a large multi-site dataset of DBT images. It is unclear how eﬀective that approach would be if the
DBT dataset were smaller.
A deep learning malignancy model was trained to identify high risk or malignant ﬁndings in patches of full-ﬁeld
digital mammography (FFDM) images. Several strategies were evaluated to adapt this model to the maximum
intensity projections (MIP) of digital breast tomosynthesis (DBT) exams. The eﬀectiveness of each domain
adaptation approach depended strongly on the amount of domain shift and the amount of available training
data. Without HM, the amount of domain shift was large and SpotTune was the most eﬀective even with limited
training data. However, following HM, the domain shift was much smaller and the simplest approach proved best.
Histogram match can, therefore, be an eﬀective strategy for domain adaptation when the amount of available
training data in the target domain is limited. This approach is simple and intuitive and can be easily adapted
to other problems in medical imaging where obtaining a large amount of labeled data for every image modality
6. FUTURE WORK
In this study, we focus exclusively on domain adaptation for patch-level models. Thus, future work includes
using these adaptation techniques and patch models to train whole-image models, which provide a malignancy
probability for the entire image and, eventually, the entire examination. Another interesting approach is learning
the cyclic transformation via a CycleGAN.19 This generative model can learn how to transform a 3D DBT image
into a 2D image from the same distribution as the original FFDM images (and vice versa). By learning this
conditional distribution, we could synthesize 2D images that maintain the important, subtle features that may
have been lost by using maximum projection.
This work has not been submitted to any journal or conference for publication or presentation considerations.
This work was supported by Whiterabbit AI, Inc. The following authors are employed by and/or have a
ﬁnancial interest in Whiterabbit: Sadanand Singh, Thomas Paul Matthews, Meet Shah, Brent Mombourquette,
Trevor Tsue, Aaron Long, Stefano Pedemonte, and Jason Su.
 Siegel, R. L., Miller, K. D., and Jemal, A., “Cancer statistics, 2019,” CA: A Cancer Journal for Clini-
cians 69(1), 7–34 (2019).
 Tabar, L., Dean, P. B., Chen, T. H., Yen, A. M., Chen, S. L., Fann, J. C., Chiu, S. Y., Ku, M. M., Wu, W. Y.,
Hsu, C. Y., Chen, Y. C., Beckmann, K., Smith, R. A., and Duﬀy, S. W., “The incidence of fatal breast
cancer measures the increased eﬀectiveness of therapy in women participating in mammography screening,”
Cancer 125(4), 515–523 (2019).
 Smith, R. A., Duﬀy, S. W., Gabe, R., Tabar, L., Yen, A. M., and Chen, T. H., “The randomized trials
of breast cancer screening: what have we learned?,” Radiologic Clinics of North America 42(5), 793 – 806
 Tabr, L., Yen, A. M.-F., Wu, W. Y.-Y., Chen, S. L.-S., Chiu, S. Y.-H., Fann, J. C.-Y., Ku, M. M.-S., Smith,
R. A., Duﬀy, S. W., and Chen, T. H.-H., “Insights from the breast cancer screening trials: How screening
aﬀects the natural history of breast cancer and implications for evaluating service screening programs,” The
Breast Journal 21(1), 13–20 (2015).
 Webb, M. L., Cady, B., Michaelson, J. S., Bush, D. M., Calvillo, K. Z., Kopans, D. B., and Smith, B. L., “A
failure analysis of invasive breast cancer: Most deaths from disease occur in women not regularly screened,”
Cancer 120(18), 2839–2846 (2014).
 Richman, I. B., Hoag, J. R., Xu, X., Forman, H. P., Hooley, R., Busch, S. H., and Gross, C. P., “Adoption
of Digital Breast Tomosynthesis in Clinical Practice,” JAMA Intern Med 179, 1292–1295 (2019).
 Hooley, R. J., Durand, M. A., and Philpotts, L. E., “Advances in digital breast tomosynthesis,” American
Journal of Roentgenology 208(2), 256266 (2017).
 Lee, Y. Z., Puett, C., Inscoe, C. R., Jia, B., Kim, C., Walsh, R., Yoon, S., Kim, S. J., Kuzmiak, C. M.,
Zeng, D., Lu, J., and Zhou, O., “Initial clinical experience with stationary digital breast tomosynthesis,”
Academic Radiology 26, 1363–1373 (2019).
 Friedewald, S. M., Raﬀerty, E. A., Rose, S. L., Durand, M. A., Plecha, D. M., Greenberg, J. S., Hayes, M. K.,
Copit, D. S., Carlson, K. L., Cink, T. M., Barke, L. D., Greer, L. N., Miller, D. P., and Conant, E. F., “Breast
Cancer Screening Using Tomosynthesis in Combination with Digital Mammography,” JAMA 311(24), 2499–
 Zhu, W., Lou, Q., Vang, Y. S., and Xie, X., “Deep multi-instance networks with sparse label assignment
for whole mammogram classiﬁcation,” in [Medical Image Computing and Computer Assisted Intervention
MICCAI 2017], 603–611 (2017).
 Wu, N., Phang, J., Park, J., Shen, Y., Huang, Z., Zorin, M., Jastrzkebski, S., F´evry, T., Katsnelson, J.,
Kim, E., Wolfson, S., Parikh, U., Gaddam, S., Lin, L. L. Y., Ho, K., Weinstein, J. D., Reig, B., Gao, Y.,
Toth, H., Pysarenko, K., Lewin, A., Lee, J., Airola, K., Mema, E., Chung, S., Hwang, E., Samreen, N.,
Kim, S. G., Heacock, L., Moy, L., Cho, K., and Geras, K. J., “Deep neural networks improve radiologists’
performance in breast cancer screening,” IEEE Trans. Medical Imaging (2019). (pre-print).
 He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)], 770–778 (2016).
 Kingma, D. P. and Ba, J., “Adam: A Method for Stochastic Optimization,” in [International Conference
on Learning Representations (ICML)], (2015).
 Yosinski, J., Clune, J., Bengio, Y., and Lipson, H., “How transferable are features in deep neural networks?,”
in [Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume
2], 3320–3328 (2014).
 Gonzalez, R., [Digital image processing], Pearson, London, 4 ed. (March 2017).
 Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., and Feris, R., “Spottune: transfer learning through
adaptive ﬁne-tuning,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition],
 Mendel, K., Li, H., Sheth, D., and Giger, M., “Transfer learning from convolutional neural networks for
computer-aided diagnosis: A comparison of digital breast tomosynthesis and full-ﬁeld digital mammogra-
phy,” Academic Radiology 26(6), 735 – 743 (2019).
 Lotter, W., Diab, A. R., Haslam, B., Kim, J. G., Grisot, G., Wu, E., Wu, K., Onieva, J. O., Boxerman,
J. L., Wang, M., Bandler, M., Vijayaraghavan, G., and Sorensen, A. G., “Robust breast cancer detec-
tion in mammography and digital breast tomosynthesis using annotation-eﬃcient deep learning approach,”
arXiv:1912.11027 [cs, eess] (Dec. 2019).
 Zhu, J., Park, T., Isola, P., and Efros, A. A., “Unpaired image-to-image translation using cycle-consistent
adversarial networks,” in [2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)],