Deep Learning for Identifying Metastatic Breast Cancer

The International Symposium on Biomedical Imaging (ISBI) held a grand challenge to evaluate computational systems for the automated detection of metastatic breast cancer in whole slide images of sentinel lymph node biopsies. Our team won both competitions in the grand challenge, obtaining an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and a score of 0.7051 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. Combining our deep learning system's predictions with the human pathologist's diagnoses increased the pathologist's AUC to 0.995, representing an approximately 85 percent reduction in human error rate. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.
Deep Learning for Identifying Metastatic Breast Cancer
Dayong Wang Aditya Khosla?Rishab Gargeya Humayun Irshad Andrew H Beck
Beth Israel Deaconess Medical Center, Harvard Medical School
?CSAIL, Massachusetts Institute of Technology
1. Introduction
The medical specialty of pathology is tasked with pro-
viding definitive disease diagnoses to guide patient treat-
ment and management decisions [4]. Standardized, accu-
rate and reproducible pathological diagnoses are essential
for advancing precision medicine. Since the mid-19th cen-
tury, the primary tool used by pathologists to make diag-
noses has been the microscope [1]. Limitations of the quali-
tative visual analysis of microscopic images includes lack of
standardization, diagnostic errors, and the significant cog-
nitive load required to manually evaluate millions of cells
across hundreds of slides in a typical pathologist’s workday
[15,17,7]. Consequently, over the past several decades
there has been increasing interest in developing computa-
tional methods to assist in the analysis of microscopic im-
ages in pathology [9,8].
From October 2015 to April 2016, the International
Symposium on Biomedical Imaging (ISBI) held the Came-
lyon Grand Challenge 2016 (Camelyon16) to identify top-
performing computational image analysis systems for the
task of automatically detecting metastatic breast cancer in
digital whole slide images (WSIs) of sentinel lymph node
biopsies1. The evaluation of breast sentinel lymph nodes is
an important component of the American Joint Committee
on Cancer’s TNM breast cancer staging system, in which
patients with a sentinel lymph node positive for metastatic
cancer will receive a higher pathologic TNM stage than pa-
tients negative for sentinel lymph node metastasis [6], fre-
quently resulting in more aggressive clinical management,
including axillary lymph node dissection [13,14].
The manual pathological review of sentinel lymph nodes
is time-consuming and laborious, particularly in cases in
which the lymph nodes are negative for cancer or contain
only small foci of metastatic cancer. Many centers have
implemented testing of sentinel lymph nodes with immuno-
histochemistry for pancytokeratins [5], which are proteins
expressed on breast cancer cells and not normally present
in lymph nodes, to improve the sensitivity of cancer metas-
tasis detection. However, limitations of pancytokeratin im-
munohiostochemistry testing of sentinel lymph nodes in-
clude: increased cost, increased time for slide preparation,
and increased number of slides required for pathological
review. Further, even with immunohistochemistry-stained
slides, the identification of small cancer metastases can be
tedious and inaccurate.
Computer-assisted image analysis systems have been de-
veloped to aid in the detection of small metastatic foci
from pancytokeratin-stained immunohistochemistry slides
of sentinel lymph nodes [22]; however, these systems are
not used clinically. Thus, the development of effective and
cost efficient methods for sentinel lymph node evaluation
remains an active area of research [11], as there would be
value to a high-performing system that could increase accu-
racy and reduce cognitive load at low cost.
Here, we present a deep learning-based approach for the
identification of cancer metastases from whole slide im-
ages of breast sentinel lymph nodes. Our approach uses
millions of training patches to train a deep convolutional
neural network to make patch-level predictions to discrim-
inate tumor-patches from normal-patches. We then aggre-
gate the patch-level predictions to create tumor probability
heatmaps and perform post-processing over these heatmaps
to make predictions for the slide-based classification task
and the tumor-localization task. Our system won both com-
petitions at the Camelyon Grand Challenge 2016, with per-
formance approaching human level accuracy. Finally, com-
bining the predictions of our deep learning system with a
pathologist’s interpretations produced a significant reduc-
tion in the pathologist’s error rate.
2. Dataset and Evaluation Metrics
In this section, we describe the Camelyon16 dataset pro-
vided by the organizers of the competition and the evalua-
tion metrics used to rank the participants.
2.1. Camelyon16 Dataset
The Camelyon16 dataset consists of a total of 400 whole
slide images (WSIs) split into 270 for training and 130 for
testing. Both splits contain samples from two institutions
(Radbound UMC and UMC Utrecht) with specific details
provided in Table 1.
Table 1: Number of slides in the Camelyon16 dataset.
Institution Train Train Test
cancer normal
Radboud UMC 90 70 80
UMC Utrecht 70 40 50
Total 160 110 130
The ground truth data for the training slides consists of
a pathologist’s delineation of regions of metastatic cancer
on WSIs of sentinel lymph nodes. The data was provided in
two formats: XML files containing vertices of the annotated
contours of the locations of cancer metastases and WSI bi-
nary masks indicating the location of the cancer metastasis.
2.2. Evaluation Metrics
Submissions to the competition were evaluated on the
following two metrics:
Slide-based Evaluation: For this metric, teams were
judged on performance at discriminating between
slides containing metastasis and normal slides. Com-
petition participants submitted a probability for each
test slide indicating its predicted likelihood of contain-
ing cancer. The competition organizers measured the
participant performance using the area under the re-
ceiver operator (AUC) score.
Lesion-based Evaluation: For this metric, partic-
ipants submitted a probability and a corresponding
(x, y)location for each predicted cancer lesion within
the WSI. The competition organizers measured partic-
ipant performance as the average sensitivity for detect-
ing all true cancer lesions in a WSI across 6 false pos-
itive rates: 1
2, 1, 2, 4, and 8 false positives per WSI.
3. Method
In this section, we describe our approach to cancer
metastasis detection.
3.1. Image Pre-processing
(a) (b)
Figure 1: Visualization of tissue region detection during im-
age pre-processing (described in Section 3.1). Detected tis-
sue regions are highlighted with the green curves.
To reduce computation time and to focus our analysis
on regions of the slide most likely to contain cancer metas-
tasis, we first identify tissue within the WSI and exclude
background white space. To achieve this, we adopt a thresh-
old based segmentation method to automatically detect the
background region. In particular, we first transfer the orig-
inal image from the RGB color space to the HSV color
space, then the optimal threshold values in each channel are
computed using the Otsu algorithm [16], and the final mask
images are generated by combining the masks from H and
S channels. The detection results are visualized in Fig. 1,
where the tissue regions are highlighted using green curves.
According to the detection results, the average percentage
of background region per WSI is approximately 82%.
3.2. Cancer Metastasis Detection Framework
Our cancer metastasis detection framework consists of a
patch-based classification stage and a heatmap-based post-
processing stage, as depicted in Fig. 2.
During model training, the patch-based classification
stage takes as input whole slide images and the ground
truth image annotation, indicating the locations of regions
of each WSI containing metastatic cancer. We randomly
whole slide image
training data
deep model
whole slide image
overlapping image
patches tumor prob. map
Figure 2: The framework of cancer metastases detection.
extract millions of small positive and negative patches from
the set of training WSIs. If the small patch is located in
a tumor region, it is a tumor / positive patch and labeled
with 1, otherwise, it is a normal / negative patch and labeled
with 0. Following selection of positive and negative training
examples, we train a supervised classification model to dis-
criminate between these two classes of patches, and we em-
bed all the prediction results into a heatmap image. In the
heatmap-based post-processing stage, we use the tumor
probability heatmap to compute the slide-based evaluation
and lesion-based evaluation scores for each WSI.
3.3. Patch-based Classification Stage
During training, this stage uses as input 256x256 pixel
patches from positive and negative regions of the WSIs
and trains a classification model to discriminate between
the positive and negative patches. We evaluated the per-
formance of four well-known deep learning network ar-
chitectures for this classification task: GoogLeNet [20],
AlexNet [12], VGG16 [19] and a face orientated deep net-
work [21], as shown in Table 2. The two deeper networks
(GoogLeNet and VGG16) achieved the best patch-based
classification performance. In our framework, we adopt
GoogLeNet as our deep network structure since it is gen-
erally faster and more stable than VGG16. The network
structure of GoogLeNet consists of 27 layers in total and
more than 6million parameters.
Table 2: Evaluation of Various Deep Models
Patch classification accuracy
GoogLeNet [20] 98.4%
AlexNet [12] 92.1%
VGG16 [19] 97.9%
FaceNet [21] 96.8%
In our experiments, we evaluated a range of magnifica-
tion levels, including 40×,20×and 10×, and we obtained
the best performance with 40×magnification. We used
only the 40×magnification in the experimental results re-
ported for the Camelyon competition.
After generating tumor-probability heatmaps using
GoogLeNet across the entire training dataset, we noted that
a significant proportion of errors were due to false positive
classification from histologic mimics of cancer. To improve
model performance on these regions, we extract additional
training examples from these difficult negative regions and
retrain the model with a training set enriched for these hard
negative patches.
We present one of our results in Fig. 3. Given a whole
slide image (Fig. 3(a)) and a deep learning based patch clas-
sification model, we generate the corresponding tumor re-
gion heatmap (Fig. 3(b)), which highlights the tumor area.
(a) Tumor Slide (b) Heatmap (c) Heatmap overlaid on slide
Figure 3: Visualization of tumor region detection.
3.4. Post-processing of tumor heatmaps to compute
slide-based and lesion-based probabilities
After completion of the patch-based classification stage,
we generate a tumor probability heatmap for each WSI. On
these heatmaps, each pixel contains a value between 0 and
1, indicating the probability that the pixel contains tumor.
We now perform post-processing to compute slide-based
and lesion-based scores for each heatmap.
3.4.1 Slide-based Classification
For the slide-based classification task, the post-processing
takes as input a heatmap for each WSI and produces as out-
put a single probability of tumor for the entire WSI. Given
a heatmap, we extract 28 geometrical and morphological
features from each heatmap, including the percentage of tu-
mor region over the whole tissue region, the area ratio be-
tween tumor region and the minimum surrounding convex
region, the average prediction values, and the longest axis
of the tumor region. We compute these features over tu-
mor probability heatmaps across all training cases, and we
build a random forest classifier to discriminate the WSIs
with metastases from the negative WSIs. On the test cases,
our slide-based classification method achieved an AUC of
0.925, making it the top-performing system for the slide-
based classification task in the Camelyon grand challenge.
3.4.2 Lesion-based Detection
For the lesion-based detection post-processing, we aim to
identify all cancer lesions within each WSI with few false
positives. To achieve this, we first train a deep model (D-I)
using our initial training dataset described above. We then
train a second deep model (D-II) with a training set that is
enriched for tumor-adjacent negative regions. This model
(D-II) produces fewer false positives than D-I but has re-
duced sensitivity. In our framework, we first threshold the
heatmap produced from D-I at 0.90, which creates a binary
heatmap. We then identify connected components within
the tumor binary mask, and we use the central point as the
tumor location for each connected component. To estimate
the probability of tumor at each of these (x, y)locations, we
take the average of the tumor probability predictions gener-
ated by D-I and D-II across each connected component. The
scoring metric for Camelyon16 was defined as the average
sensitivity at 6 predefined false positive rates: 1/4, 1/2, 1, 2,
4, and 8 FPs per whole slide image. Our system achieved
a score of 0.7051, which was the highest score in the com-
petition and was 22 percent higher than the second-ranking
score (0.5761).
4. Experimental Results
4.1. Evaluation Results from Camelyon16
In this section, we briefly present the evaluation results
generated by the Camelyon16 organizers, which is also
available on the website 2.
There are two kinds evaluation in Camelyon16: Slide-
based Evaluation and Lesion-based Evaluation. We won
both of these two challenging tasks.
Slide-based Evaluation: The merits of the algorithms
were assessed for discriminating between slides containing
metastasis and normal slides. Receiver operating character-
istic (ROC) analysis at the slide level were performed and
the measure used for comparing the algorithms was area un-
der the ROC curve (AUC). Our submitted result was gener-
ated based on the algorithm in Section 3.4.1. As shown in
Fig. 4, the AUC is 0.9250. Notice that our algorithm per-
formed much better than the second best method when the
False Positive Rate (FPR) is low.
Lesion-based Evaluation: For the lesion-based evalua-
tion, free-response receiver operating characteristic (FROC)
curve were used. The FROC curve is defined as the plot of
Figure 4: Receiver Operating Characteristic (ROC) curve of
Slide-based Classification
sensitivity versus the average number of false-positives per
image. Our submitted result was generated based on the al-
gorithm in Section 3.4.2. As shown in Fig. 5, we can make
two observations: first, the pathologist did not make any
false positive predictions; second, when the average number
of false positives is larger than 2, which indicates that there
will be two false positive alert in each slide on average, our
performance (in terms of sensitivity) even outperformed the
Figure 5: Free-Response Receiver Operating Characteristic
(FROC) curve of the Lesion-based Detection.
4.2. Combining Deep Learning System with a Hu-
man Pathologist
To evaluate the top-ranking deep learning systems
against a human pathologist, the Camelyon16 organizers
had a pathologist examine the test images used in the com-
petition. For the slide-based classification task,the human
pathologist achieved an AUC of 0.9664, reflecting a 3.4 per-
cent error rate. When the predictions of our deep learning
system were combined with the predictions of the human
pathologist, the AUC was raised to 0.9948 reflecting a drop
in the error rate to 0.52 percent.
5. Discussion
Here we present a deep learning-based system for the
automated detection of metastatic cancer from whole slide
images of sentinel lymph nodes. Key aspects of our sys-
tem include: enrichment of the training set with patches
from regions of normal lymph node that the system was
initially mis-classifying as cancer; use of a state-of-the-
art deep learning model architecture; and careful design of
post-processing methods for the slide-based classification
and lesion-based detection tasks.
Historically, approaches to histopathological image anal-
ysis in digital pathology have focused primarily on low-
level image analysis tasks (e.g., color normalization, nu-
clear segmentation, and feature extraction), followed by
construction of classification models using classical ma-
chine learning methods, including: regression, support vec-
tor machines, and random forests. Typically, these algo-
rithms take as input relatively small sets of image features
(on the order of tens) [9,10]. Building on this framework,
approaches have been developed for the automated extrac-
tion of moderately high dimensional sets of image features
(on the order of thousands) from histopathological images
followed by the construction of relatively simple, linear
classification models using methods designed for dimen-
sionality reduction, such as sparse regression [2].
Since 2012, deep learning-based approaches have con-
sistently shown best-in-class performance in major com-
puter vision competitions, such as the ImageNet Large
Scale Visual Recognition Competition (ILSVRC) [18].
Deep learning-based approaches have also recently shown
promise for applications in pathology. A team from the re-
search group of Juergen Schmidhuber used a deep learning-
based approach to win the ICPR 2012 and MICCAI 2013
challenges focused on algorithm development for mitotic
figure detection[3]. In contrast to the types of machine
learning approaches historically used in digital pathology,
in deep learning-based approaches there tend to be no dis-
crete human-directed steps for object detection, object seg-
mentation, and feature extraction. Instead, the deep learn-
ing algorithms take as input only the images and the image
labels (e.g., 1 or 0) and learn a very high-dimensional and
complex set of model parameters with supervision coming
only from the image labels.
Our winning approach in the Camelyon Grand Challenge
2016 utilized a 27-layer deep network architecture and ob-
tained near human-level classification performance on the
test data set. Importantly, the errors made by our deep
learning system were not strongly correlated with the errors
made by a human pathologist. Thus, although the patholo-
gist alone is currently superior to our deep learning system
alone, combining deep learning with the pathologist pro-
duced a major reduction in pathologist error rate, reducing
it from over 3 percent to less than 1 percent. More generally,
these results suggest that integrating deep learning-based
approaches into the work-flow of the diagnostic pathologist
could drive improvements in the reproducibility, accuracy
and clinical value of pathological diagnoses.
6. Acknowledgments
We thank all the Camelyon Grand Challenge 2016 or-
ganizers with special acknowledgments to lead coordinator
Babak Ehteshami Bejnordi. AK and AHB are co-founders
of PathAI, Inc.
