Conference PaperPDF Available

Adaptation of a deep learning malignancy model from full-field digital mammography to digital breast tomosynthesis

Adaptation of a deep learning malignancy model from
full-field digital mammography to digital breast tomosynthesis
Sadanand Singh1, Thomas Paul Matthews1, Meet Shah1, Brent Mombourquette1, Trevor
Tsue1, Aaron Long1, Ranya Almohsen1, Stefano Pedemonte1, and Jason Su1
1Whiterabbit AI, Inc., Santa Clara, CA, USA
Mammography-based screening has helped reduce the breast cancer mortality rate, but has also been associated
with potential harms due to low specificity, leading to unnecessary exams or procedures, and low sensitivity.
Digital breast tomosynthesis (DBT) improves on conventional mammography by increasing both sensitivity and
specificity and is becoming common in clinical settings. However, deep learning (DL) models have been developed
mainly on conventional 2D full-field digital mammography (FFDM) or scanned film images. Due to a lack of large
annotated DBT datasets, it is difficult to train a model on DBT from scratch. In this work, we present methods
to generalize a model trained on FFDM images to DBT images. In particular, we use average histogram matching
(HM) and DL fine-tuning methods to generalize a FFDM model to the 2D maximum intensity projection (MIP)
of DBT images. In the proposed approach, the differences between the FFDM and DBT domains are reduced
via HM and then the base model, which was trained on abundant FFDM images, is fine-tuned. When evaluating
on image patches extracted around identified findings, we are able to achieve similar areas under the receiver
operating characteristic curve (ROC AUC) of 0.9 for FFDM and 0.85 for MIP images, as compared to a
ROC AUC of 0.75 when tested directly on MIP images.
Keywords: Mammography, Tomosynthesis, Deep Learning, Domain Adaptation, Transfer Learning
Breast cancer is the most commonly diagnosed cancer and the second leading cancer-related cause of death
among women in the United States.1Although mammography-based screening has been shown to reduce breast
cancer mortality,2it has also been associated with physical and psychological harms caused by false positives and
unnecessary biopsies.35To address these concerns, many clinics have started to switch their screening programs
from 2D full-field digital mammography (FFDM) to 3D digital breast tomosynthesis (DBT),6which has been
shown to increase the sensitivity of breast cancer screening7,8and reduce false positives.9
Deep learning (DL) using convolutional neural networks has previously been used to aid in the evaluation of
screening mammography to enhance the specificity of malignancy prediction, particularly for FFDM exams.10,11
Several challenges exist, however, in translating these successes to DBT exams. First, in general, the performance
of DL models scales with the availability of labeled data but as a result of DBT being only more recently adopted,
most large-scale mammography datasets consist mainly of FFDM exams. Second, the 3D volumes of DBT exams
can be quite large (e.g., 2457×1996×70 pixels). This can lead to computational difficulties as well as training
issues related to the curse of dimensionality, which are exacerbated by the low prevalence of cancer training
samples and the often small finding sizes associated with the early detection.
This study focuses on methods to adapt DL malignancy models originally developed for FFDM exams to DBT
exams in the case where the amount of available DBT data is quite limited. In order to overcome the large size
of 3D DBT images, we instead consider the maximum intensity projection (MIP) of these 3D volumes. Several
methods of adapting a model trained on patches of FFDM images to patches of MIP images are evaluated and
compared. The impact of histogram matching on reducing domain shift and simplifying the adaptation problem
is also considered.
Further author information: (Send correspondence to E-mail:
arXiv:2001.08381v1 [cs.CV] 23 Jan 2020
2.1 Dataset
The data was collected from a large academic medical center located in the mid-western region of the United
States between 2008 and 2017. This study was approved by the internal institutional review board of the
university from which the data was collected. Informed consent was waived for this retrospective study. The
data consists of a large set of FFDM exams and a smaller set of DBT exams. The exams were interpreted
by one of 11 radiologists with breast imaging experience ranging from 2 to 30 years. Radiologist assessments
and pathology outcomes were extracted from the mammography reporting software of the site (Magview v7.1,
Magview, Burtonsville, Maryland).
Patients were randomly split into training (‘train’), validation (‘val’) and testing (‘test’) sets with a 80:10:10
ratio. Since the split was performed at the patient level, no patient had images in more than one of the above
sets. This split was shared by both the FFDM and DBT datasets. All training and hyperparameter searches
were performed on the training and validation sets. Performance on the test set was evaluated only once all
model selection, training, and fine-tuning had been carried out.
Images were categorized as one of four classes: (1) normal, no notable findings were identified by the radiolo-
gist; (2) benign, all notable findings were determined to be benign by the radiologist or by biopsy, (3) high-risk,
a biopsy determined a finding to contain tissue types likely to develop into cancer, and (4) malignant, a biopsy
determined a finding to contain malignant tissue types. We combine normal and benign labels into the negative
class and combine the high risk and malignant labels into the positive class. All malignant and high risk images
had exactly one radiologist annotation, indicating the location of the biopsied finding. These annotations were
made during the course of the standard clinical care for the patient. Benign samples have zero or one annotation.
A detailed distribution of the data across the different classes can be found in Table 1.
Table 1: Detailed statistics of the collected FFDM and DBT training (train), validation (val), and testing (test)
datasets. The numbers of patients, exams, and images for each set are given, as well as the distribution of
malignancy by image.
Train Val Test
Patients 49965 6239 6213
Exams 158650 19933 19618
Images 664234 83920 82296
Normal 606080 (91.3%) 76092 (90.7%) 75073 (91.2%)
Benign 56660 (8.5%) 7631 (9.1%) 7023 (8.5%)
High Risk 404 (0.1%) 41 (0.1%) 42 (0.1%)
Malignant 1090 (0.2%) 156 (0.2%) 158 (0.2%)
Train Val Test
Patients 10684 1399 1357
Exams 14828 1944 1855
Images 54380 7140 6791
Normal 48006 (88.3%) 6171 (86.4%) 6058 (89.2%)
Benign 6175 (11.4%) 939 (13.2%) 689 (10.2%)
High Risk 86 (0.2%) 13 (0.2%) 15 (0.2%)
Malignant 113 (0.2%) 17 (0.2%) 29 (0.4%)
2.2 Patch Model
The DL model is a ResNet12 based model with 29 layers and approximately 6 million parameters. It accepts
a 512x512 image patch from an FFDM or MIP image and predicts the probability that the patch contains a
malignant or high risk finding.
The original images had 4096×3328 or 3328×2560 pixels for FFDM images or 2457×1996 or 2457×1890
pixels for the MIP images. Example FFDM and MIP images can be seen in Figure 1. To obtain the input
to the model, an initial patch of 1024×1024 pixels is extracted from the image and downsampled by bilinear
interpolation to 512×512 pixels, yielding a patch at half the resolution of the original image. The resulting patch
covers 7.7-12.3% of the area of the original image.
For samples with annotations indicating the finding location, patches are centered at the center of the
annotations. For samples without annotations, the breast is segmented using a pre-chosen threshold and a patch
is selected centered at a randomly chosen pixel within the breast. Patches are sampled such that they are always
fully contained within the image and may be translated to satisfy this criterion.
2.3 FFDM Training
The base model is trained on the FFDM data, with a uniform sampling of two classes (equal probability of
sampling a positive or negative class sample). Images were augmented during training with random horizontal
and vertical flipping, additive Gaussian white noise with a standard deviation of 1.0, random translation drawn
from an Gaussian distribution with a standard deviation of 20 pixels, and random rotation drawn from an
uniform distribution from -30 to +30 degrees.
The model is trained to minimize a cross entropy loss function using the Adam optimizer13 with an initial
learning rate of 5 ×105and a weight decay of 5 ×104. An epoch is defined as 40000 samples shown to the
model.The model was trained for 100 epochs, and the model chosen for evaluation is the one that maximized
the area under the receiver operating characteristic curve (ROC AUC) on the validation set.
2.4 Domain Adaptation
Figure 2: Average cumulative histograms for FFDM
and MIP images. The intensity values have been scaled
so that they range from 0 to 4095 for both image types.
In domain adaptation, a model trained on one do-
main is adapted to another domain for which there
exists far less data. Previous work has shown that
deep neural networks often learn task and domain ag-
nostic features, particularly in the earlier layers of the
network.14 When the domains and tasks are similar,
larger portions of the network may be reused.
Here, we explore the use of histogram matching to
reduce the domain shift between the FFDM and DBT
domains. The FFDM patch model is adapted to DBT
exams, both with and without histogram matching,
using two different fine-tuning methods.
2.4.1 Histogram Matching
A non-linear transformation is used to transform the
cumulative histogram (c.d.f.) from one domain to the
average c.d.f. of another domain,15 referred to as his-
togram matching (HM). In particular, HM is employed to transform MIP images to better match the FFDM
images originally used to train the model.
Figure 1: Sample images from each domain for different malignancy classes - Normal, Benign and Malignant.
The red box indicates the location of a radiologist-annotated finding.
Algorithm 1: Histogram matching
The algorithm describes the procedure of histogram matching images from two
domains. Here, X[i] represents the i-th pixel value of the image X. The in-
verse mapping to a pixel value in the reference domain is performed by linear
Input: Source image XS[0, K 1]N,
Source c.d.f. FSNK
Reference c.d.f. FRNL
Output: Histogram-matched image X0
S[0, L 1]N
for iin 0 to N-1 do
S[i] = p0
return X0
The procedure for HM is outlined in Algorithm 1and given in greater detail as follows. Let FSbe the c.d.f.
of the source image, whose intensity distribution is to be updated, and let FRbe the c.d.f. of the reference
domain, whose intensity distribution we hope to match. Let pS[0, K 1] be a pixel value for the source image
and pR[0, L 1] be a pixel value in the reference domain such that FS(pS) = FR(pR). Then, our transformed
image will have the pixel value p0
R(FS(pS)), where the inverse mapping is calculated via linear
The average c.d.f. of the FFDM data was calculated over 1200 randomly chosen training samples, comprised
of equal amounts of the normal, benign and malignant classes. Similarly, the average c.d.f. of the MIP data was
calculated over 600 randomly chosen training samples, comprised of equal amounts of the normal, benign and
malignant classes. The histograms for the FFDM and MIP images can be seen in Figure 2. The application of
histogram matching can be qualitatively visualized in Figure 1.
2.4.2 Fine-tuning
Two methods were used to fine-tune the base model trained on FFDM images for use with the original or
histogram-matched MIP images. For the first approach, only the last fully connected layer of the model was
re-trained. This is referred to as the conventional fine-tuning approach. For the second approach, a version of the
SpotTune algorithm was implemented.16 SpotTune is an adaptive fine-tuning approach that finds the optimal
fine-tuning policy (which layers to fine-tune) per instance of target data.
The underlying idea behind SpotTune is that different training samples from the target domain require fine-
tuning updates to different sets of layers in pre-trained network. The SpotTune training procedure involves
predicting, for each training input, the specific layers to be fine-tuned and layers to be kept frozen. This input-
dependent fine-tuning approach enables targeting layers per input instance and leads to better accuracy.16 We
refer readers to the original paper16 for further details of SpotTune.
The fine-tuned model used the same data augmentations as the original FFDM model. The model is fine-
tuned using cross entropy loss and Adam optimizer with a learning rate of 5 ×105and a weight decay of
1×104. The model chosen is the one that maximized the validation ROC AUC.
The performance of all models is measured on the test datasets using the area under the receiver operating
characteristic curve (ROC AUC). On the test data, we extract patches in the same way as explained in Section
2.2. Since this is random, we average the results over three random seeds. The standard deviation of results is
used as an error estimate. A summary of all the results can be found in Table 2.
Table 2: Performance of the models for different domains. Results are shown on a test set for which both FFDM
and DBT/MIP images are available. MIP with HM refers to MIP images pre-processed to look more like FFDM
images. Errors shown here refer to the standard deviation over 3 independent realizations of patch extraction.
Training Data Testing Data Procedure ROC AUC
FFDM FFDM Train from scratch 0.909 ±0.001
FFDM MIP Test only 0.751 ±0.001
FFDM MIP Fine-tune 0.759 ±0.003
FFDM MIP SpotTune16 0.825 ±0.002
FFDM MIP with HM Test only 0.847 ±0.001
FFDM MIP with HM Fine-tune 0.837 ±0.001
FFDM MIP with HM SpotTune16 0.830 ±0.002
The base FFDM patch model has a ROC AUC of 0.909 on the FFDM images. For MIP images, the
performance of the base model drops to a ROC AUC of 0.751. If the MIP images are pre-processed using
the average histogram matching method, the ROC AUC goes up to 0.847. This shows that our simple fixed
non-linear transformation via histogram matching reduces the domain shift considerably.
In order to improve performance further, we apply fine-tuning and SpotTune,16 both with and without HM.
Fine-tuning on the limited number of regular MIP images does not show any improvement in performance;
however, SpotTune leads to a ROC AUC of 0.825. When fine-tuned on MIP images with HM, fine-tuning
improves ROC AUC from 0.759 to 0.837; however, SpotTune leads only to a minor improvement in ROC AUC
from 0.825 to 0.830. Overall, we find that the simple strategy of only pre-processing via histogram matching
leads to the best ROC AUC of 0.847 on MIP images.
Figure 3: Visualization of SpotTune policies to re-use
or fine-tune a residual block for MIP images with and
without HM.
SpotTune learns a policy per sample for selecting a
ResNet block for fine-tuning. For two cases, with and
without HM, the probability of fine-tuning different
ResNet blocks can be seen in Figure 3. With HM, the
relative gap in performance of SpotTune vs. Test-only
is small, perhaps due to the limited number of ResNet
blocks that are modified by the SpotTune algorithm.
Without HM, the gap is large and a significant portion
of the blocks are updated.
This observation also offers insight into the effec-
tiveness of conventional fine-tuning with and without
HM. Without HM, fine-tuning the last layer of the
network is unable to improve performance as more ex-
tensive changes to the network are needed as indicated
by the large number of ResNet blocks changed by the
SpotTune algorithm. With HM, the performance of
SpotTune and conventional fine-tuning are more simi-
lar as only the late layers need to be extensively mod-
We present our work on the adaptation of a patch-
level deep learning malignancy model from FFDM to
DBT exams. The original model was trained and evaluated on a large set of the FFDM images and the model
was adapted using methods requiring few DBT exams. In particular, by incorporating histogram level changes
in image features, we can achieve good classification performance without additional training.
A prior study17 considered the use of transfer learning for malignancy classification for FFDM and DBT
exams. However, the study examine the transfer learning for a model that was not initially trained on medical
images. It also mainly focused on malignant and benign patches (4×smaller than our patches) and employed
a much smaller dataset. A very recent study18 considered domain adaptation from FFDM to DBT images, but
relied on a large multi-site dataset of DBT images. It is unclear how effective that approach would be if the
DBT dataset were smaller.
A deep learning malignancy model was trained to identify high risk or malignant findings in patches of full-field
digital mammography (FFDM) images. Several strategies were evaluated to adapt this model to the maximum
intensity projections (MIP) of digital breast tomosynthesis (DBT) exams. The effectiveness of each domain
adaptation approach depended strongly on the amount of domain shift and the amount of available training
data. Without HM, the amount of domain shift was large and SpotTune was the most effective even with limited
training data. However, following HM, the domain shift was much smaller and the simplest approach proved best.
Histogram match can, therefore, be an effective strategy for domain adaptation when the amount of available
training data in the target domain is limited. This approach is simple and intuitive and can be easily adapted
to other problems in medical imaging where obtaining a large amount of labeled data for every image modality
is difficult.
In this study, we focus exclusively on domain adaptation for patch-level models. Thus, future work includes
using these adaptation techniques and patch models to train whole-image models, which provide a malignancy
probability for the entire image and, eventually, the entire examination. Another interesting approach is learning
the cyclic transformation via a CycleGAN.19 This generative model can learn how to transform a 3D DBT image
into a 2D image from the same distribution as the original FFDM images (and vice versa). By learning this
conditional distribution, we could synthesize 2D images that maintain the important, subtle features that may
have been lost by using maximum projection.
This work has not been submitted to any journal or conference for publication or presentation considerations.
This work was supported by Whiterabbit AI, Inc. The following authors are employed by and/or have a
financial interest in Whiterabbit: Sadanand Singh, Thomas Paul Matthews, Meet Shah, Brent Mombourquette,
Trevor Tsue, Aaron Long, Stefano Pedemonte, and Jason Su.
[1] Siegel, R. L., Miller, K. D., and Jemal, A., “Cancer statistics, 2019,” CA: A Cancer Journal for Clini-
cians 69(1), 7–34 (2019).
[2] Tabar, L., Dean, P. B., Chen, T. H., Yen, A. M., Chen, S. L., Fann, J. C., Chiu, S. Y., Ku, M. M., Wu, W. Y.,
Hsu, C. Y., Chen, Y. C., Beckmann, K., Smith, R. A., and Duffy, S. W., “The incidence of fatal breast
cancer measures the increased effectiveness of therapy in women participating in mammography screening,”
Cancer 125(4), 515–523 (2019).
[3] Smith, R. A., Duffy, S. W., Gabe, R., Tabar, L., Yen, A. M., and Chen, T. H., “The randomized trials
of breast cancer screening: what have we learned?,” Radiologic Clinics of North America 42(5), 793 – 806
[4] Tabr, L., Yen, A. M.-F., Wu, W. Y.-Y., Chen, S. L.-S., Chiu, S. Y.-H., Fann, J. C.-Y., Ku, M. M.-S., Smith,
R. A., Duffy, S. W., and Chen, T. H.-H., “Insights from the breast cancer screening trials: How screening
affects the natural history of breast cancer and implications for evaluating service screening programs,” The
Breast Journal 21(1), 13–20 (2015).
[5] Webb, M. L., Cady, B., Michaelson, J. S., Bush, D. M., Calvillo, K. Z., Kopans, D. B., and Smith, B. L., “A
failure analysis of invasive breast cancer: Most deaths from disease occur in women not regularly screened,”
Cancer 120(18), 2839–2846 (2014).
[6] Richman, I. B., Hoag, J. R., Xu, X., Forman, H. P., Hooley, R., Busch, S. H., and Gross, C. P., “Adoption
of Digital Breast Tomosynthesis in Clinical Practice,” JAMA Intern Med 179, 1292–1295 (2019).
[7] Hooley, R. J., Durand, M. A., and Philpotts, L. E., “Advances in digital breast tomosynthesis,” American
Journal of Roentgenology 208(2), 256266 (2017).
[8] Lee, Y. Z., Puett, C., Inscoe, C. R., Jia, B., Kim, C., Walsh, R., Yoon, S., Kim, S. J., Kuzmiak, C. M.,
Zeng, D., Lu, J., and Zhou, O., “Initial clinical experience with stationary digital breast tomosynthesis,”
Academic Radiology 26, 1363–1373 (2019).
[9] Friedewald, S. M., Rafferty, E. A., Rose, S. L., Durand, M. A., Plecha, D. M., Greenberg, J. S., Hayes, M. K.,
Copit, D. S., Carlson, K. L., Cink, T. M., Barke, L. D., Greer, L. N., Miller, D. P., and Conant, E. F., “Breast
Cancer Screening Using Tomosynthesis in Combination with Digital Mammography,” JAMA 311(24), 2499–
2507 (2014).
[10] Zhu, W., Lou, Q., Vang, Y. S., and Xie, X., “Deep multi-instance networks with sparse label assignment
for whole mammogram classification,” in [Medical Image Computing and Computer Assisted Intervention
MICCAI 2017], 603–611 (2017).
[11] Wu, N., Phang, J., Park, J., Shen, Y., Huang, Z., Zorin, M., Jastrzkebski, S., F´evry, T., Katsnelson, J.,
Kim, E., Wolfson, S., Parikh, U., Gaddam, S., Lin, L. L. Y., Ho, K., Weinstein, J. D., Reig, B., Gao, Y.,
Toth, H., Pysarenko, K., Lewin, A., Lee, J., Airola, K., Mema, E., Chung, S., Hwang, E., Samreen, N.,
Kim, S. G., Heacock, L., Moy, L., Cho, K., and Geras, K. J., “Deep neural networks improve radiologists’
performance in breast cancer screening,” IEEE Trans. Medical Imaging (2019). (pre-print).
[12] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)], 770–778 (2016).
[13] Kingma, D. P. and Ba, J., “Adam: A Method for Stochastic Optimization,” in [International Conference
on Learning Representations (ICML)], (2015).
[14] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H., “How transferable are features in deep neural networks?,”
in [Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume
2], 3320–3328 (2014).
[15] Gonzalez, R., [Digital image processing], Pearson, London, 4 ed. (March 2017).
[16] Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., and Feris, R., “Spottune: transfer learning through
adaptive fine-tuning,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition],
4805–4814 (2019).
[17] Mendel, K., Li, H., Sheth, D., and Giger, M., “Transfer learning from convolutional neural networks for
computer-aided diagnosis: A comparison of digital breast tomosynthesis and full-field digital mammogra-
phy,” Academic Radiology 26(6), 735 – 743 (2019).
[18] Lotter, W., Diab, A. R., Haslam, B., Kim, J. G., Grisot, G., Wu, E., Wu, K., Onieva, J. O., Boxerman,
J. L., Wang, M., Bandler, M., Vijayaraghavan, G., and Sorensen, A. G., “Robust breast cancer detec-
tion in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach,”
arXiv:1912.11027 [cs, eess] (Dec. 2019).
[19] Zhu, J., Park, T., Isola, P., and Efros, A. A., “Unpaired image-to-image translation using cycle-consistent
adversarial networks,” in [2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)],
2223–2232 (2017).
... First, many of them work with synthesized 2D images rather than the entire 3D image. These images are generated from the DBT images by proprietary algorithms bundled with the scanner [15] or by third-party methods aggregating information from all slices such as dynamic feature image [16] or maximum intensity projection [17]. In addition, there exist approaches which create attention-weighted summaries of small subsets of DBT slices, resulting in multiple "slabs" that resemble 2D mammography [18]. ...
... Second, the studies that use DBT instead of synthetic 2D images still do not utilize the entire image at once in training. Some utilize a subset of DBT slices [19] whereas others process small region of interest (ROI) patches pre-extracted from lesions [17], [20], [21]. A model utilizing a subset of slices risks missing those potentially most informative to the diagnosis. ...
... Third, some of the works require more detailed labels such as bounding-boxes or pixel-level segmentations [12], [17], [20]- [24]. While training the model with pixel or lesion-level labels may improve the performance compared to just using image-level labels, the former are more labor-intensive and time-consuming to collect, especially for 3D data. ...
Full-text available
3D imaging enables accurate diagnosis by providing spatial information about organ anatomy. However, using 3D images to train AI models is computationally challenging because they consist of 10x or 100x more pixels than their 2D counterparts. To be trained with high-resolution 3D images, convolutional neural networks resort to downsampling them or projecting them to 2D. We propose an effective alternative, a neural network that enables efficient classification of full-resolution 3D medical images. Compared to off-the-shelf convolutional neural networks, our network, 3D Globally-Aware Multiple Instance Classifier (3D-GMIC), uses 77.98%-90.05% less GPU memory and 91.23%-96.02% less computation. While it is trained only with image-level labels, without segmentation labels, it explains its predictions by providing pixel-level saliency maps. On a dataset collected at NYU Langone Health, including 85,526 patients with full-field 2D mammography (FFDM), synthetic 2D mammography, and 3D mammography, 3D-GMIC achieves an AUC of 0.831 (95% CI: 0.769-0.887) in classifying breasts with malignant findings using 3D mammography. This is comparable to the performance of GMIC on FFDM (0.816, 95% CI: 0.737-0.878) and synthetic 2D (0.826, 95% CI: 0.754-0.884), which demonstrates that 3D-GMIC successfully classified large 3D images despite focusing computation on a smaller percentage of its input compared to GMIC. Therefore, 3D-GMIC identifies and utilizes extremely small regions of interest from 3D images consisting of hundreds of millions of pixels, dramatically reducing associated computational challenges. 3D-GMIC generalizes well to BCS-DBT, an external dataset from Duke University Hospital, achieving an AUC of 0.848 (95% CI: 0.798-0.896).
... Recent studies have already addressed these issues by exploiting mammography data to enhance DBT breast cancer detection. For instance, various works have demonstrated that pretraining on mammography data, followed by finetuning on DBT, substantially enhances DBT lesion detection [19,18,20]. While these methods outperform random initialization or natural images pretraining, they are notably vulnerable to catastrophic forgetting [15]. ...
Digital Breast Tomosynthesis (DBT) is an advanced breast imaging modality that offers superior lesion detection accuracy compared to conventional mammography, albeit at the trade-off of longer reading time. Accelerating lesion detection from DBT using deep learning is hindered by limited data availability and huge annotation costs. A possible solution to this issue could be to leverage the information provided by a more widely available modality, such as mammography, to enhance DBT lesion detection. In this paper, we present a novel framework, CoMoTo, for improving lesion detection in DBT. Our framework leverages unpaired mammography data to enhance the training of a DBT model, improving practicality by eliminating the need for mammography during inference. Specifically, we propose two novel components, Lesion-specific Knowledge Distillation (LsKD) and Intra-modal Point Alignment (ImPA). LsKD selectively distills lesion features from a mammography teacher model to a DBT student model, disregarding background features. ImPA further enriches LsKD by ensuring the alignment of lesion features within the teacher before distilling knowledge to the student. Our comprehensive evaluation shows that CoMoTo is superior to traditional pretraining and image-level KD, improving performance by 7% Mean Sensitivity under low-data setting. Our code is available at .
... The researchers compared these methods using both histogram-matched MIP images and original MIP images. The findings revealed that the best performance, in terms of (AUC = 0.847) was achieved by fine-tuning the last two layers with MIP-HM [25]. ...
Full-text available
Background: Breast cancer is caused by the uncontrolled growth of abnormal cells, resulting in a mass in the breast tissue. Early detection of the disease can significantly improve its prognosis and treatment outcome. Digital Breast Tomosynthesis, a three-dimensional imaging technology of the breast tissue, is emerging as a standard for breast imaging with improved screening and diagnostic results. The additional information obtained from tomosynthesis reduces the misleading effects of tissue overlap and improves the detection, identification, and localization of abnormalities. Benign-malignant breast tumors classification at early detection is a crucial step to improve diagnosis and prolong patient survival. The aim of this study is to classify benign-malignancy of masses in Digital Breast Tomosynthesis images based on new DCT-DOST and Radiomic features on an open database of TCIA that consist of 224 lesion bounding boxes. Methods: In order to effectively obtain both DCT-DOST and Radiomics characteristics, a 2D central slice of the DBT image, which encompasses a substantial anatomical size of the breast tumor, was employed. In the pre-processing stage, rescale intensity method was used to improve contrast and image quality. After that, by using a binary mask, the mass of the breast tissue was subjected to segmentation. Then, using radiomic and DCT-DOST features, new features were extracted from the images. In addition, we also investigated the importance of feature selection and class balancing. By pooling the radiomics and DCT-DOST features into a hybrid feature set, we investigated the compatibility of these two sets with respect to benign malignancy prediction. Results: Finally, for classification using Random Forest algorithm, K-Nearest Neighbor and Support Vector Machine, it was shown that the best result of the evaluation metrics are respectively equal to 87.80%, 78.51%, 82.78% and 75.19% in terms of mean AUC, Accuracy, Sensitivity and Specificity which was obtained by Random Forest classifier. Conclusion: The findings from the empirical analysis reveal that integrating DCT-DOST and Radiomic features into the same learning algorithm improves the discrimination power.
... A model trained on FFDM was proposed in [45], which can be used for DBT. The model was based on the ResNet model. ...
Full-text available
Background Recent advancements in computing power and state-of-the-art algorithms have helped in more accessible and accurate diagnosis of numerous diseases. In addition, the development of de novo areas in imaging science, such as radiomics and radiogenomics, have been adding more to personalize healthcare to stratify patients better. These techniques associate imaging phenotypes with the related disease genes. Various imaging modalities have been used for years to diagnose breast cancer. Nonetheless, digital breast tomosynthesis (DBT), a state-of-the-art technique, has produced promising results comparatively. DBT, a 3D mammography, is replacing conventional 2D mammography rapidly. This technological advancement is key to AI algorithms for accurately interpreting medical images. Objective and methods This paper presents a comprehensive review of deep learning (DL), radiomics and radiogenomics in breast image analysis. This review focuses on DBT, its extracted synthetic mammography (SM), and full-field digital mammography (FFDM). Furthermore, this survey provides systematic knowledge about DL, radiomics, and radiogenomics for beginners and advanced-level researchers. Results A total of 500 articles were identified, with 30 studies included as the set criteria. Parallel benchmarking of radiomics, radiogenomics, and DL models applied to the DBT images could allow clinicians and researchers alike to have greater awareness as they consider clinical deployment or development of new models. This review provides a comprehensive guide to understanding the current state of early breast cancer detection using DBT images. Conclusion Using this survey, investigators with various backgrounds can easily seek interdisciplinary science and new DL, radiomics, and radiogenomics directions towards DBT.
Full-text available
Breast cancer prediction is a critical area of research aimed at improving early detection and enhancing treatment strategies. Considering the fast development of Machine Learning techniques, the level of curiosity has increased dramatically in leveraging these algorithms for accurate and efficient breast cancer prediction. This survey paper comprehensively overviews the present condition of the art Machine Learning approaches employed in breast cancer prediction. This study analyzed a wide range of research studies, methodologies, and datasets to present a complete image of the state of the field, the problems it faces, and where it's going. Diverse techniques for Machine Learning, including deep learning models, SVMs, random forests, ANNs, and ensemble methods, are explored in terms of their strengths, weaknesses, and specific breast cancer prediction tasks they have been applied. Furthermore, the study also discussed the diverse input data modalities used, ranging from traditional mammograms and histopathological images to genomics and proteomics data. Challenges such as dataset imbalance, feature selection, interpretability, and generalizability are examined, along with proposed solutions and prospective directions for research. This survey paper aims to give a wealth of information for scientists, doctors, and others in the healthcare field to understand the advancements and potential of predicting breast cancer with Machine Learning, contributing to the development of improved precision and dependable predictive models for improved patient outcomes in the battle against breast cancer.
Difference of Gaussians (DoG) convolutional filters are one of the earliest image processing methods employed for detecting microcalcifications on mammogram images before machine and deep learning methods became widespread. DoG is a blob enhancement filter that consists in subtracting one Gaussian-smoothed version of an image from another less Gaussian-smoothed version of the same image. Smoothing with a Gaussian kernel suppresses high-frequency spatial information, thus DoG can be regarded as a band-pass filter. However, due to their small size and overimposed breast tissue, microcalcifications vary greatly in contrast-to-noise ratio and sharpness. This makes it difficult to find a single DoG configuration that enhances all microcalcifications. In this work, we propose a convolutional network, named DoG-MCNet, where the first layer automatically learns a bank of DoG filters parameterized by their associated standard deviations. We experimentally show that when employed for microcalcification detection, our DoG layer acts as a learnable bank of band-pass preprocessing filters and improves detection performance by 4.86% AUFROC over baseline MCNet and 1.53% AUFROC over state-of-the-art multicontext ensemble of CNNs.
Cancer is a major cause of death that is brought on by the body's abnormal cell proliferation, including breast cancer. It poses a significant threat to the safety and health of people globally. Several imaging methods, such as mammography, CT scans, MRI, ultrasound, and biopsies, can help detect breast cancer. A biopsy is commonly done in histopathology to examine an image and assist in diagnosing breast cancer. However, accurately identifying the appropriate Region of Interest (ROI) remains challenging due to the complex nature of pre-processing phases, feature extracting regions, segmenting process and other conventional machine learning phases. This reduces the system's efficiency and accuracy. In order to reduce the variance that exists among viewers, the aim of this work is to build superior deep-learning phases algorithms. This research introduces a classifier that can detect and classify images simultaneously, without any human involvement. It employs a transfer-driven ensemble learning approach, where the framework comprises two main phases: production and detection of pseudo-color images and segmentation based on ROI Pooling CNN, which then feeds its output to ensemble models such as Efficientnet, ResNet101, and VGG19. Before the feature extraction process, data augmentation is necessary, involving minor adjustments like random cropping, horizontal flipping, and color space augmentations. Implementing and simulating the proposed segmentation and classification algorithms for any decision-making framework suggested could decrease the frequency of incorrect diagnoses and enhance classification accuracy. This could aid pathologists in obtaining a second opinion and facilitate the early identification of diseases. With a prediction accuracy of 98.3%, the proposed method outperforms the individual pre-trained models, namely Efficientnet, ResNet101, VGG16, and VGG19, by 2.3%, 1.71%, 2.01%, and 1.47%, respectively.
Full-text available
Importance: An accurate and robust artificial intelligence (AI) algorithm for detecting cancer in digital breast tomosynthesis (DBT) could significantly improve detection accuracy and reduce health care costs worldwide. Objectives: To make training and evaluation data for the development of AI algorithms for DBT analysis available, to develop well-defined benchmarks, and to create publicly available code for existing methods. Design, setting, and participants: This diagnostic study is based on a multi-institutional international grand challenge in which research teams developed algorithms to detect lesions in DBT. A data set of 22 032 reconstructed DBT volumes was made available to research teams. Phase 1, in which teams were provided 700 scans from the training set, 120 from the validation set, and 180 from the test set, took place from December 2020 to January 2021, and phase 2, in which teams were given the full data set, took place from May to July 2021. Main outcomes and measures: The overall performance was evaluated by mean sensitivity for biopsied lesions using only DBT volumes with biopsied lesions; ties were broken by including all DBT volumes. Results: A total of 8 teams participated in the challenge. The team with the highest mean sensitivity for biopsied lesions was the NYU B-Team, with 0.957 (95% CI, 0.924-0.984), and the second-place team, ZeDuS, had a mean sensitivity of 0.926 (95% CI, 0.881-0.964). When the results were aggregated, the mean sensitivity for all submitted algorithms was 0.879; for only those who participated in phase 2, it was 0.926. Conclusions and relevance: In this diagnostic study, an international competition produced algorithms with high sensitivity for using AI to detect lesions on DBT images. A standardized performance benchmark for the detection task using publicly available clinical imaging data was released, with detailed descriptions and analyses of submitted algorithms accompanied by a public release of their predictions and code for selected methods. These resources will serve as a foundation for future research on computer-assisted diagnosis methods for DBT, significantly lowering the barrier of entry for new researchers.
Full-text available
Breast cancer remains a global challenge, causing over 600,000 deaths in 2018 (ref. ¹). To achieve earlier cancer detection, health organizations worldwide recommend screening mammography, which is estimated to decrease breast cancer mortality by 20–40% (refs. 2,3). Despite the clear value of screening mammography, significant false positive and false negative rates along with non-uniformities in expert reader availability leave opportunities for improving quality and access4,5. To address these limitations, there has been much recent interest in applying deep learning to mammography6,7,8,9,10,11,12,13,14,15,16,17,18, and these efforts have highlighted two key difficulties: obtaining large amounts of annotated training data and ensuring generalization across populations, acquisition equipment and modalities. Here we present an annotation-efficient deep learning approach that (1) achieves state-of-the-art performance in mammogram classification, (2) successfully extends to digital breast tomosynthesis (DBT; ‘3D mammography’), (3) detects cancers in clinically negative prior mammograms of patients with cancer, (4) generalizes well to a population with low screening rates and (5) outperforms five out of five full-time breast-imaging specialists with an average increase in sensitivity of 14%. By creating new ‘maximum suspicion projection’ (MSP) images from DBT data, our progressively trained, multiple-instance learning approach effectively trains on DBT exams using only breast-level labels while maintaining localization-based interpretability. Altogether, our results demonstrate promise towards software that can improve the accuracy of and access to screening mammography worldwide.
Full-text available
We present a deep convolutional neural network for breast cancer screening exam classification, trained and evaluated on over 200,000 exams (over 1,000,000 images). Our network achieves an AUC of 0.895 in predicting the presence of cancer in the breast, when tested on the screening population. We attribute the high accuracy to a few technical advances. (i) Our network’s novel two-stage architecture and training procedure, which allows us to use a high-capacity patch-level network to learn from pixel-level labels alongside a network learning from macroscopic breast-level labels. (ii) A custom ResNet-based network used as a building block of our model, whose balance of depth and width is optimized for high-resolution medical images. (iii) Pretraining the network on screening BI-RADS classification, a related task with more noisy labels. (iv) Combining multiple input views in an optimal way among a number of possible choices. To validate our model, we conducted a reader study with 14 readers, each reading 720 screening mammogram exams, and show that our model is as accurate as experienced radiologists when presented with the same data. We also show that a hybrid model, averaging the probability of malignancy predicted by a radiologist with a prediction of our neural network, is more accurate than either of the two separately. To further understand our results, we conduct a thorough analysis of our network’s performance on different subpopulations of the screening population, the model’s design, training procedure, errors, and properties of its internal representations. Our best models are publicly available at
Full-text available
Background Women and their health care providers need a reliable answer to this important question: If a woman chooses to participate in regular mammography screening, then how much will this choice improve her chances of avoiding a death from breast cancer compared with women who choose not to participate? Methods To answer this question, we used comprehensive registries for population, screening history, breast cancer incidence, and disease‐specific death data in a defined population in Dalarna County, Sweden. The annual incidence of breast cancer was calculated along with the annual incidence of breast cancers that were fatal within 10 and within 11 to 20 years of diagnosis among women aged 40 to 69 years who either did or did not participate in mammography screening during a 39‐year period (1977‐2015). For an additional comparison, corresponding data are presented from 19 years of the prescreening period (1958‐1976). All patients received stage‐specific therapy according to the latest national guidelines, irrespective of the mode of detection. Results The benefit for women who chose to participate in an organized breast cancer screening program was a 60% lower risk of dying from breast cancer within 10 years after diagnosis (relative risk, 0.40; 95% confidence interval, 0.34‐0.48) and a 47% lower risk of dying from breast cancer within 20 years after diagnosis (relative risk, 0.53; 95% confidence interval, 0.44‐0.63) compared with the corresponding risks for nonparticipants. Conclusions Although all patients with breast cancer stand to benefit from advances in breast cancer therapy, the current results demonstrate that women who have participated in mammography screening obtain a significantly greater benefit from the therapy available at the time of diagnosis than do those who have not participated.
Full-text available
Rationale and objectives: With the growing adoption of digital breast tomosynthesis (DBT) in breast cancer screening, we compare the performance of deep learning computer-aided diagnosis on DBT images to that of conventional full-field digital mammography (FFDM). Materials and methods: In this study, we retrospectively collected FFDM and DBT images of 78 biopsy-proven lesions from 76 patients. A region of interest was selected for each lesion on FFDM, synthesized 2D, and DBT key slice images. Features were extracted from each lesion using a pretrained convolutional neural network (CNN) and served as input to a support vector machine classifier trained in the task of predicting likelihood of malignancy. Results: From receiver operating characteristic (ROC) analysis of all 78 lesions, the synthesized 2D image performed best in both the cradiocaudal view (area under the ROC curve [AUC] = 0.81, SE = 0.05) and mediolateral oblique view (AUC = 0.88, SE = 0.04) in the task of lesion characterization. When cradiocaudal and mediolateral oblique data of each lesion were merged through soft voting, DBT key slice image performed best (AUC = 0.89, SE = 0.04). When only masses and architectural distortions (ARDs) were considered, DBT performed significantly better than FFDM (p = 0.024). Conclusion: DBT performed significantly better than FFDM in the merged view classification of mass and ARD lesions. The increased performance suggests that the information extracted by the CNN from DBT images may be more relevant to lesion malignancy status than the information extracted from FFDM images. Therefore, this study provides supporting evidence for the efficacy of computer-aided diagnosis on DBT in the evaluation of mass and ARD lesions.
Full-text available
Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods rely on regions of interest (ROIs) which require great efforts to annotate. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned ROIs. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations.
Although digital mammography has been widely used for breast cancer screening for more than a decade, it has imperfect sensitivity and specificity. A newer technology, digital breast tomosynthesis (DBT), may have a lower recall rate and a higher cancer detection rate than 2-dimensional mammography, although most studies of DBT were retrospective and did not evaluate long-term health outcomes.¹ The use of DBT has some important trade-offs compared with 2-dimensional mammography, including higher costs and higher radiation dose with some machines.² Although the US Preventive Services Task Force and the American Cancer Society have not specifically endorsed DBT for routine breast cancer screening, citing insufficient evidence, the American College of Radiology supports its use.³,4 Our objectives were to describe adoption of DBT for breast cancer screening in a large privately insured population, characterize regional patterns of adoption, and identify regional-level characteristics associated with that adoption.
Rationale and objectives: A linear array of carbon nanotube-enabled x-ray sources allows for stationary digital breast tomosynthesis (sDBT), during which projection views are collected without the need to move the x-ray tube. This work presents our initial clinical experience with a first-generation sDBT device. Materials and methods: Following informed consent, women with a "suspicious abnormality" (Breast Imaging Reporting and Data System 4), discovered by digital mammography and awaiting biopsy, were also imaged by the first generation sDBT. Four radiologists participated in this paired-image study, completing questionnaires while interpreting the mammograms and sDBT image stacks. Areas under the receiver operating characteristic curve were used to measure reader performance (likelihood of correctly identifying malignancy based on pathology as ground truth), while a multivariate analysis assessed preference, as readers compared one modality to the next when interpreting diagnostically important image features. Results: Findings from 43 women were available for analysis, in whom 12 cases of malignancy were identified by pathology. The mean areas under the receiver operating characteristic curve was significantly higher (p < 0.05) for sDBT than mammography for all breast density categories and breast thicknesses. Additionally, readers preferred sDBT over mammography when evaluating mass margins and shape, architectural distortion, and asymmetry, but preferred mammography when characterizing microcalcifications. Conclusion: Readers preferred sDBT over mammography when interpreting soft-tissue breast features and were diagnostically more accurate using images generated by sDBT in a Breast Imaging Reporting and Data System 4 population. However, the findings also demonstrated the need to improve microcalcification conspicuity, which is guiding both technological and image-processing design changes in future sDBT devices.
Conference Paper
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.