ArticlePDF Available

Abstract and Figures

Background One challenge to train deep convolutional neural network (CNNs) models with whole slide images (WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of transfer learning in a CNN model when different data sources are available for training or how to leverage from the combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong annotations in heterogeneous data sources. The trade-off between classification performance and annotation effort is explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset with less expensive weak (image-level) labels. Results As expected, the model performance on strongly annotated data steadily increases as the percentage of strong annotations that are used increases, reaching a performance comparable to pathologists ( $$\kappa = 0.691 \pm 0.02$$ κ = 0.691 ± 0.02 ). Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with $$\kappa = 0.307 \pm 0.133$$ κ = 0.307 ± 0.133 . Moreover, it only provides a lower performance regardless of the number of annotations used. The model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels $$\kappa = 0.528 \pm 0.05$$ κ = 0.528 ± 0.05 . Conclusion Combining weak and strong supervision improves strong supervision in classification of Gleason patterns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN models combining few annotated data and heterogeneous data sources. The performance increases in the controlled TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered when the trained TMA model is applied directly to the more challenging WSI classification problem. This demonstrates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the experiments in the paper:
This content is subject to copyright. Terms and conditions apply.
Otáloraetal. BMC Med Imaging (2021) 21:77
Combining weakly andstrongly supervised
learning improves strong supervision inGleason
pattern classication
Sebastian Otálora1,2*, Niccolò Marini1,2, Henning Müller1,3 and Manfredo Atzori1,4
Background: One challenge to train deep convolutional neural network (CNNs) models with whole slide images
(WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate
the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less
expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of
transfer learning in a CNN model when different data sources are available for training or how to leverage from the
combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to
evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong anno-
tations in heterogeneous data sources. The trade-off between classification performance and annotation effort is
explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset
with less expensive weak (image-level) labels.
Results: As expected, the model performance on strongly annotated data steadily increases as the percentage of
strong annotations that are used increases, reaching a performance comparable to pathologists (
κ=0.691 ±0.02
Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with
κ=0.307 ±0.133
. Moreover, it only provides a lower performance regardless of the number of annotations used. The
model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels
κ=0.528 ±0.05
Conclusion: Combining weak and strong supervision improves strong supervision in classification of Gleason pat-
terns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN
models combining few annotated data and heterogeneous data sources. The performance increases in the controlled
TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered
when the trained TMA model is applied directly to the more challenging WSI classification problem. This demon-
strates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream
model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the
experiments in the paper: https:// github. com/ ilmar o8/ Digit al_ Patho logy_ Trans fer_ Learn ing
Keywords: Computational pathology, Weak supervision, Transfer learning, Prostate cancer, Deep learning
© The Author(s) 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco
mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Prostate cancer (PCa), the fourth most common can-
cer worldwide, is the sixth leading cause of cancer death
among men, with 1.2 million new cases and more than
350,000 deaths in the world in 2018[1]. PCa also has
Open Access
2 Computer Science Centre (CUI), University of Geneva, Route de Drize 7,
Battelle A, Carouge, Switzerland
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
the second-highest incidence of all cancers in men[1].
e gold standard for the diagnosis of PCa is the visual
inspection of tissue samples from needle biopsies or
prostatectomies. e Gleason score (GS) is the standard
grading system currently used to describe the aggressive-
ness of PCa[2].
e Gleason scoring assigns a number given the
architectural patterns shown in prostate tissue samples
observed under a microscope. e system describes
tumor appearance and aberrations in the prostate tis-
sue glands in biopsies or radical prostatectomy speci-
mens. GS staging is required for treatment planning,
and pathologists use it to help in prognosis and guide the
therapy. e Gleason score comprises the sum of the two
patterns (Gleason patterns, from 1 to 5) most frequently
present in the tissue slide producing a final score in the
range of 2 to 10. e revised Gleason score from the
International Society of Urological Pathology (ISUP) is
used in pathology routine[3].
anks to the recent improvements in digital pathology
and the availability of slide scanners, the diagnosis is now
also increasingly performed through the visual inspec-
tion of high-resolution scans of a tissue sample, known as
Whole-Slide Images (WSIs)[4].
Computer assisted diagnosis (CAD) systems in digital
pathology cover tasks such as the automatic classification
or grading of a disease, segmentation of regions of inter-
est like mitotic cells and tumor-infiltrating lymphocytes,
as well as the retrieval of similar cases, among others[5,
One of the aims of computational pathology (CP) is the
development of CAD tools for helping pathologists in the
analysis of WSIs, that can easily be over
100, 0002
els in size. e lack of a large set of images annotated in
detail (pixel level) is one of the constraints that research-
ers have when training deep learning models in com-
putational pathology. Despite the lack of annotations,
methods have shown partial success in medical imaging
when trained with small sets of annotations using tech-
niques such as multiple instance learning, active, transfer
and weakly supervised learning[718].
Deep Convolutional Neural Network (CNN) models
are currently the backbone of the state-of-the art meth-
ods to analyze prostate WSIs[7, 8, 17] in computational
pathology. e success of CNNs relies on automatically
learning the relevant features to classify the input images
using a large set of annotated data with supervised learn-
ing. Obtaining big annotated datasets can be feasible for
natural images, where the annotation effort is reduced
by leveraging crowd-sourcing platforms such as Ama-
zon mechanical turk[19]. In medical imaging, and par-
ticularly in histopathology, annotations from regions of
interest require qualified personnel that undergoes years
of training and often has limited time for such activities
due to a heavy clinical workload. Manually increasing
the number of annotations the number of annotations
in a training dataset is well-known to improve the gen-
eralization performance in machine learning models.
is is difficult in medical applications because of the
costly annotations. For this reason, alternative sources of
supervision using readily available and inexpensive labels
(i.e., from clinical or pathology reports) are being stud-
ied to obtain larger annotated datasets without the time
and cost of the pixel-wise annotations. Transfer learning
refers to the set of techniques where models are trained
on one dataset (or task) and then applied to another task
(or dataset), where the features that are learned by the
model can be reused. Empirical results show that it is
better to transfer the weights of an ImageNet pre-trained
network than to train from scratch for histopathology
classification tasks[20]. e effect of transfer learning can
be seen as a good choice of initial parameters for a deep
neural network that will have a strong regularizing effect
on its performance [21]. ere are several motivations
for transfer learning. One of the principal reasons is to
discover generic features likely to be of interest for many
classification tasks on similar data [22]. In the context
of medical imaging, two types of transfer learning are
commonly described in the literature[17, 23]: (1) where
model features, for example the weights of a CNN archi-
tecture, are used to extract a continuous vector represen-
tation, i.e. a feature vector, from which a linear classifier
is trained on top of this generic inferred representation
to predict labels that were not used for the initial training
of the model. (2) When a pre-trained model is used as the
initialization for a second model and optimized or fine-
tuned using a new set of labels that correspond to the
new task. For computational pathology tasks, the second
strategy of fine-tuning models from natural image data-
sets, such as ImageNet, was shown to yield better results
than the off-the-shelf feature extraction [23]. Transfer
learning is also a good strategy to speed-up model train-
ing time, since the models converge faster. Strong labels
refer to manually delineated regions of interest in the
images. ese labels are also commonly called pixel-wise
labels since usually each pixel can be assigned to one cat-
egory depending on the contour of the manually delin-
eated region. In computational pathology, such datasets
with strong labels are rare because having a thoroughly
annotated dataset is very expensive with such large
Weak labels refer to general categories such as cancer,
benign tissue, or a score in a grading system of a specific
organ, e.g., the Gleason grade in prostate cancer [7, 8]
that are attached to an image as a whole and not to image
regions. Spatial details are often lacking in pathology
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
reports, since the exact location of the relevant areas is
not given.
e line of research that investigates how to use inex-
pensive weak labels, known in the machine learning liter-
ature as weakly supervised learning has recently obtained
good results in medical imaging, including computational
pathology[8, 9, 15, 2427]. In computational pathology,
weak labels (or WSI-level labels), such as a grade for the
entire image are often readily available for digital pathol-
ogy applications since they are usually included in pathol-
ogy reports summarizing the findings of the pathologist
about the tissue slide or WSI. Weakly supervised learning
allows to bypass the need for strong supervision by using
weaker labels that are often easier to obtain with limited
efforts, also in large quantities. However, this usually
comes at the cost of requiring a large amount of weakly
annotated data to reach good performance, i.e., a number
of WSIs in the order of thousands[9].
A potential solution to the need for a large annotated
dataset is to gather different supervision sources to train
and test a model for a specific task leveraging from sev-
eral datasets and transfer learning. Alleviating the poten-
tial of overfitting to one dataset allows obtaining a robust
model that accounts for a higher variability of the sample
is opens the question of how to use different types
of supervision and data sources together in practice,
since this can alleviate the requirement of many weak
annotations. Furthermore, the trade-off between strong
and weak supervision for CNN models in computational
pathology is not clearly defined.
As discussed in the above paragraphs, weakly super-
vised CNN models pose feasible solutions to tackle clas-
sification tasks in computational pathology, particularly
with the help of transfer learning approaches using the
most frequently ImageNet pre-trained models. However,
the computational pathology literature lacks methods
dealing with the problem of building upon knowledge
acquired in one histopathology dataset to another with
similar characteristics or with the same underlying clas-
sification task.
ere are many weakly annotated datasets available
and also federated learning in hospitals is starting to be
used [9]. e investigation of deep learning methods
that leverage from different supervision levels, as well as
knowing how many annotations are required for training
such models, is of paramount importance for the prac-
tical deployment of computational pathology models in
digital pathology laboratories.
In this paper fully and weakly supervised CNN models
are compared using two openly accessible data sets and
transfer learning, targeting prostate cancer scoring. Tis-
sue microarrays with strong annotations and WSI with
weak labels. We evaluated the performance gap between
the models with an increasing number of strong annota-
tions and the models trained with weakly labeled images
and also both sources of data.
e organization of the article is as follows: first, we
discuss the related work dealing with transfer learn-
ing and weakly supervised learning in computational
pathology and how our contributions fit into the context
of automatic PCa grading with deep learning. en, in
the experimental setup section, we describe the charac-
teristics of the datasets, baseline CNN models and the
transfer learning strategies. "Results" section, presents
the results of the experiments and finally in "Discussion"
section discusses the results in the context of similar
approaches and models that might benefit from the strat-
egies presented in the paper (Table1).
Related work
e related work is summarized in Table 4. Both, the
notion of combining weak and strong supervision [15]
and transfer learning approaches[23] exist in the com-
putational pathology literature. In this section we dis-
cuss the existing work and explain the contributions of
this paper in the context of the related work. Recently,
Otálora et al. [29] compared weakly supervised strate-
gies for the fine-grained task of Gleason grading, report-
ing that the use of class-wise data augmentation and a
DenseNet architecture using transfer learning lead to
a kappa-score of
in a set of 341 WSIs from the
TCGA-PRAD repository. Nevertheless, the authors did
not include strong supervision into the training of their
models, despite having a limited number of WSIs. In the
work of Ström[12] the authors obtained a
. e
authors trained models with transfer learning using an
ImageNet pre-trained InceptionV3 deep CNN architec-
ture, for the Gleason scoring task using 1631 strongly-
annotated WSIs. Arvaniti et al. [15] present a CNN
architecture that combines weak and strong supervision
for the classification tasks of high vs. low and Gleason
scoring groups (6,
, 8, 9–10) used
Table 1 Number of TMA cores in the TMAZ dataset
Class/set Train Val Test
Benign 61 42 12
GS6 165 35 88
GS7 58 25 38
GS8 120 15 91
GS9 26 2 3
GS10 78 14 13
Total 508 133 245
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
in clinical practice[30]. e model penalizes the weak
supervision signal of patches by weighting them using
predicted probability of the strong labels. e model
using self-weighted weak supervision obtained an accu-
racy of 0.848 for the binary task and a Kendall’s
for Gleason scoring using 447 WSIs and 886 annotated
tissue microarrays.
Campanella etal.[9] use transfer learning and a mas-
sive dataset of more than 44,000 WSIs to weakly train
CNN classifiers. e ImageNet pre-trained classifiers
were trained using a multiple instance learning paradigm,
using bags in which the assigned label referred only to a
non-empty subset of elements in the bag, accounting for
the inherent label noise. While their results paved the way
for automated screening tools in computational pathol-
ogy (where the pathologist can discard non-cancerous
slides), their generalization to different clinical scenarios
with highly heterogeneous data and fine-grained classes
remains to be confirmed. Bulten[7] and colleagues use
several CNN classifiers and immunohistochemistry
labels to distill the noisy weak WSI-level labels obtaining
a Gleason grade classifier with a performance compara-
ble to the pathologist inter-rater agreement on a set of
IHC-H&E registered biopsies.
Recently, multiple instance learning techniques and
attention models have had partial success in weakly
supervised scenarios[9, 13, 26]. It was shown that despite
the noise, the use of weak labels only allows a good per-
formance for the low vs. high Gleason score classifica-
tion[9, 31], as well as the Gleason score tasks[8]. is
might be due to the definition of the Gleason scoring
system itself, where the score directly correlates with the
percentage of areas with the most frequently repeated
Gleason patterns present in the image. However, to reach
a clinical-grade performance, the number of images
needed for training sometimes can be in the order of
thousands of slides, as shown by Campanella etal.[9] for
PCa detection.
is paper aims at contributing to answer two questions:
(1) How to build up knowledge from different datasets
to train histopathology image classifiers. We are par-
ticularly interested in how to distill the less-expensive,
weakly annotated data in conjunction with few annotated
regions. (2) How many annotations are necessary to train
a model for automatic Gleason grading. For answering
the questions we evaluate the trade-off of using small
amounts of strong annotations from one data source,
jointly with weak annotations from another data source
to train supervised CNN models. CNN models are fine-
tuned with different levels of pixel-wise labels of pros-
tate cancer grading from TMA images. en, the trained
model is evaluated on Gleason scoring in the same type
of TMA cores and then on WSIs from the TCGA-PRAD
repository. Second, the pre-trained model is further
fine-tuned with images that use the weak labels from the
TCGA-PRAD reports to perform Gleason scoring at the
WSI level. Despite the TMA dataset with strong labels
having different visual characteristics from the WSI data-
set with weak labels, fine-tuning the best model trained
with TMA cores with the weak labels reduces the effect
of domain shift, obtaining considerably better results
than directly predicting on the external datasets and also
better than training with the weak labels only, as reported
in the results in "Results" section. In this paper we use
only  400 slides, the weakly supervised training serves
as a performance baseline to compare against other pre-
sented strategies. e main technical contributions of
this paper are the following:
A thorough evaluation of transfer-learning, using
fixed amounts of strong labels, in the task of prostate
cancer image classification with CNNs, using openly
accessible datasets for evaluation of the generaliza-
tion power of the model.
Systematic evaluation of the CNN performance
dependency on strong, weak labels and a combina-
tion of the two types of labels in the task of prostate
cancer image classification with heterogeneous data-
Figure 1 summarizes the overall workflow of the pro-
posed approach to measure the dependency on annota-
tions with different data sources in CNN models for PCa
grading. First, we gather and preprocess the PCa grad-
ing datasets with heterogeneous characteristics ("Data-
sets" section), and with a different level of specificity in
the labels. e first dataset consists of TMAs that are
thoroughly annotated with pixel-wise Gleason pattern
annotations. e second dataset is composed of WSIs
of prostactectomies, in this case we used the WSI-level
label that was extracted from the pathology reports.
Next, the CNN models are trained according to three
strategies ("Datasets" section) as follows. e first one
takes fixed percentages of pixel-wise annotations and
trains fully supervised CNN models for each percent-
age. Like this we can evaluate how the performance of
the CNN depends on the number of strong annotations
provided for training the model. is is reported in
"Results" section. e second strategy consists of tak-
ing each of the previously trained CNN architectures
and fine-tune the weights using the weakly supervised
dataset from the external source. Here, we evaluated if
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
Fig. 1 The main components of our approach: Datasets for PCa grading with strong and weak labels ("Datasets" section, CNN model training
with three strategies "Datasets" section and third, the tests performed in two scenarios of PCa grading: tissue microarrays of prostate tissue and
prostactectomy WSIs, "Results" section. The patches from the strongly labeled TMAs are used to train CNN models with an increasing number of
annotations, evaluating the performance depending on the number of strong labels used for training. The ImageNet pre-trained models are either
trained using only the weak WSI-level label or fine-tuned with WSI patches and weak labels, combining different sources of supervision. The models
are tested in the two scenarios of PCa grading: tissue-microarrays, and prostactectomies. Arrows of the same color indicate the data or model input
from the previous step
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
it was better to perform direct inference with the pre-
viously trained models or to transfer the weights of
the fully supervised CNN and combining them using
weakly supervised fine-tuning. In the third strategy,
we trained the CNN model from scratch using only the
WSI-level labels as supervision. While there are many
non-relevant areas used for training the models, results
show that the performance is robust in a test set with
similar preparation conditions.
For the evaluation of the approaches, two openly acces-
sible datasets are used in the experiments for Gleason
scoring and Gleason pattern classification. e datasets
originate from different sources. Despite the fact that the
tissue is stained in all cases with hematoxylin and eosin
(H&E) and that all the datasets are from prostate tissue,
the visual appearance of the images shows many differ-
ences in the preparation. As shown in previous work[32,
33], not using color augmentation or stain-invariant
methods decreases the performance of CNN models in
WSI analysis. We account for this variability using image
augmentations as described in "Datasets" section.
TMAZ: tissue microarrays withstrong annotations
e first dataset is the pixel-wise pathologist-annotated
tissue microarrays dataset released by Arvaniti etal.[34],
referred to as TMAZ (Tissue MicroArrays Zürich). is
dataset contains 886 prostate TMA cores, each of
pixels, from a cohort of 641 patients for model training/
validation and 245 for testing, with one image core per
patient. e TMAs were scanned at a 40x resolution (0.23
microns per pixel) at the University Hospital of Zurich
(NanoZoomer-XR Digital slide scanner, Hamamatsu).
Tumor stages and Gleason scores were assigned accord-
ing to the criteria of the Union for International Cancer
Control and the World Health Organization/Interna-
tional Society of Urological Pathology. In this dataset the
annotated classes are benign epithelium, Gleason pat-
terns 3, 4 and 5. Given the pixel-level region annotations,
the Gleason score is computed using the two most prom-
inent patterns present in the core, if only one pattern is
present in the core, the Gleason score is twice that pat-
tern. Overlappingpatches corresponding to areas of
pixels that cover enough gland context are extracted from
the annotations mask. All the patches are scaled to
pixels for feeding them to the CNN models, which are
pre-trained with ImageNet[18]. e number of patches
extracted from the annotations for each Gleason pattern
are listed in Table2 and the number of TMA cores used
in the experiments is listed in Table1.
TCGA‑PRAD: Prostactectomies withweak labels
e second dataset consists of 301 cases of prostatec-
tomy WSIs from the public resource of e Cancer
Genome Atlas (TCGA) repository of prostate adeno-
carcinoma (TCGA-PRAD) [35, 36]. Removal of the full
prostate (radical prostatectomies) is only done in severe
cases. ere are thus no fully benign tissue slides in the
TCGA-PRAD dataset. Having a benign prostatectomy
implies a resection of the whole prostate when the tis-
sue was healthy. When more benign lesions are present,
the imaging technique chosen is usually an ultrasound
or magnetic resonance imaging. In contrast with TMA,
a single TMA core might contain only benign tissue due
to its small size (0.7mm in diameter vs. 3cm of height in
a prostatectomy slide) and the potential tissue sampling
strategies, which do not always account for the tumor
From the 301 cases, 171 are used for training, 84 for
validation and 46 for testing. e selected cases are a sub-
set of all the available cases in TCGA-PRAD. We selected
only the Formalin-Fixed Paraffin-Embedded slides. e
other available slides also contained frozen sections,
which include morphological changes due to the dehy-
dration process that can lead to more noisy region extrac-
tion. Each WSI is paired with its corresponding primary
and secondary Gleason pattern (weak) labels from the
available pathology reports[35, 37]. Due to the massive
size in pixels of a WSI, which can be over
100, 0002
els, the patch extraction used for the supervised CNN
training is performed only within relevant tissue regions.
e HistoQC tool[38] is used to avoid extracting patches
from connective tissue, also avoiding background and
pen marks. After HistoQC generates a mask with the
usable tissue areas, relevant areas (patches) need to be
extracted. e blue-ratio ranking (BR), as described
in Rousson etal. [39], is applied to obtain patches with
high-cell density. We extracted 3000 patches of
els at random locations within the HistoQC mask and
selected the top 1000 blue-ratio ranked regions. To keep
a comparable pixel size in the areas extracted and in the
TMA images, all the patches are computed at a roughly
Table 2 Number of patches for each Gleason pattern in the
TMAZ dataset
Due to high class-imbalance, particularly for the high GP 5 and the benign
tissue, class-wise data augmentation was applied
Class/Set Train Val Test
Benign 1831 1260 127
GP3 5992 1352 1602
GP4 4472 831 2121
GP5 2766 457 387
Total 15,061 3901 4237
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
apparent magnification (0.25 microns per pixel) but
include some heterogeneity. Each region is then down-
sampled to a patch of
pixels, which covers enough
context and gland content. e detailed number of cases
is given in Table3.
Model training
In the second row of Fig.1, an overview of the approaches
for the CNN model training is illustrated. e three
strategies differ in what type of dataset is used (weakly
labeled, strongly labeled, or a combination of both) and
how the model is trained (incrementally fully supervised,
fine-tuned with weak annotations, or only with weakly
supervised learning). e architecture chosen for all the
CNN models is the Mobilenet architecture [40], with
a width Multiplier
. Mobilenet was chosen as it
allows comparability to the previously reported results
of Arvaniti et al [14]. Mobilenet is a lightweight CNN
architecture with fewer than
5 million parameters and
was shown to obtain a performance that is compara-
ble to pathologists on the TMAZ dataset[14]. Training
the CNN classifiers from scratch is not optimal consid-
ering the relatively small size of the training set and the
heterogeneity of data in color and tissue structures. For
this reason, the models are initialized with pre-trained
weights that were computed on the ImageNet challenge
and then fine-tuned with the respective datasets [18,
41]. e ImageNet pre-trained Mobilenet architecture
is kept constant throughout the three strategies. Drop-
out and weight regularization techniques were applied to
avoid over-fitting. A dropout layer is placed in between
the dense layers, with a probability of 0.2. is value
was selected in line with the analysis made by Arvaniti
and colleagues, where it is also stated that networks that
were not regularized exhibit divergence in the valida-
tion cross-entropy loss scores. In the intermediate layers
of the CNN,
regularization was used, with a lambda
parameter equal to 0.01. All the CNN models are trained
to predict the Gleason pattern of an input image patch. In
the case where strong labels are available, i.e. the TMAZ
dataset, the performance is compared against the ground
truth labels. For the weakly annotated TCGA-PRAD
dataset, the models are optimized to predict the primary
Gleason pattern reported for the WSI, and the Gleason
score is computed using a majority voting of probabilities
for the patches extracted in each WSI. is aggregation
is simple but it reflects the nature of the grading system:
adding the two most common patterns seen in the tissue
sample. In the rest of this section, we discuss the details
of each of the strategies.
To avoid model overfitting, it is usually good to have
more samples that exploit the symmetries in histopathol-
ogy images (where orientation does not influence the
prediction), as well as to account for the subtle differ-
ences in stain and preparation methods. For this reason a
pipeline for data augmentation is implemented for all the
models using the Augmentor open-source library [42].
Augmentation is preferred over stain normalization since
recent studies show the former has often better results
than normalization techniques to tackle color heteroge-
neity [33, 43] Class-wise data augmentation is applied
to alleviate the imbalance of the classes. e number of
augmentations applied are inversely-proportional to the
number of samples in each class as previous studies sug-
gests [8]. e procedure includes four kinds of opera-
tions: rotation, flipping, morphological distortions and
color augmentation. We obtain the equivalent of an extra
half of training data by applying the operations to the
images in the training set with a probability of 0.5. Rota-
tion augmentation is performed by choosing a rigid rota-
tion (
, or
) randomly with equal probability
for each rotation. Flipping augmentation can be a vertical
or horizontal flipping of the image. Morphological aug-
mentation is a random elastic distortion applied to the
image patch. e patch was divided into a 5x5 grid and
the magnitude of the distortion was set to 8. Color aug-
mentation was also applied, changing the colors of the
input patches. e StainTools library [44] is used here,
with the Vahadane stain extractor[45] and both sigma
parameters set to 0.3. All the CNN architectures are opti-
mized and evaluated using the internal validation parti-
tion and then evaluated in different test set scenarios as
described in the results section.
Incremental supervised learning
In the supervised strategy, a comparison of the classifi-
cation performance of patch-wise Gleason pattern clas-
sifiers is performed training CNN classifiers using an
increasing number of strong annotations. e MobileNet
architecture is fine-tuned using
patches extracted
from the pathologist annotations, following the approach
of Arvaniti etal.[34]. e CNN is trained to classify the
patches into the four patterns: benign tissue, Gleason
Table 3 Number of WSIs from the TCGA-PRAD dataset that were
Class/set Training Validation Test
GS6 13 20 5
GS7 (3+4) 42 10 6
GS7 (4+3) 30 14 11
GS8 37 12 13
GS9-10 49 28 11
Total 171 84 46
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
pattern 3, 4 and 5. To evaluate the contribution of the
number of annotations (strong supervision) to the per-
formance of the models, five subsets of the TMA train-
ing set are extracted. Each subset has a percentage of
samples and associated pixel-level annotations from the
original training set:
20% 40% 60% 80% 100%
e number of patches in each subset are 3090, 6030,
9029, 12059 and 15061 patches, respectively. Following
the original dataset partitions from the setup of Arvaniti
etal.[46] the TMA training dataset includes cores from
three arrays (ZT111, ZT199, ZT204), the validation set
includes the cores from the ZT76 array and the test parti-
tion includes cores from the array ZT80. e TMA cores
are heterogeneous in color representation, as it can be
noticed by visual inspection of the arrays that are charac-
terized by slight different stain colors. A fixed number of
cores from each array from the training set are selected
to have a balanced number of samples for each class. 30
patches are extracted from each TMA core. is value
was chosen considering the size of each patch and the
amount of tissue covered. It represents a good trade-off
between the amount of overlap of the patches and the
percentage of tissue extracted in the TMA, evaluated
qualitatively. e patches are randomly generated within
the annotation mask of the pathologist. We discard the
patches with less than
tissue. is criterion is applied
for avoiding the extraction of uninformative patches from
the slide background. To have more robust performance
estimates we did not fixed the random seed in the experi-
ments. We performed ten experiments to account for
the different random seeds and SGD convergent solu-
tions. e ten classifiers for each percentage were trained
with the same configuration of hyper-parameters. All the
models were trained for 25 epochs until convergence in
validation performance was observed. e input for the
models is a batch size of 32 image patches, using the
Adam optimizer with a learning rate of 0.001 and a decay
rate of
, that are the standard values for many image
classification tasks [47]. e learning rate was explored
with the values of the set
{105, 104, 103, 102, 101}
and despite having robust performance measures regard-
ing the learning rate, we observed the model to have a
slightly better validation error with the value of
Fine‑tuning withweak annotations
Fine-tuning a model that is initially trained with few
annotated samples using a new set of samples for the
same underlying task (i.e., Gleason grading) is a suitable
technique for combining several levels of supervision
since it allows to reuse the knowledge acquired on one
dataset to others where the input distribution might shift
slightly, for example due to slide preparation. In this strat-
egy, we start with an ImageNet pre-trained Mobilenet
CNN, which is initially fine-tuned with a fixed number of
strongly annotated data from the TMAZ dataset. Finally,
the model is further fine-tuned with the regions extracted
from the whole slide images of the TCGA-PRAD dataset.
Strictly, this means that there are two stages of fine-tun-
ing, the first one from the ImageNet pre-trained models
to the TMAZ and then from TMAZ to TCGA dataset.
e loss function of the classification models mini-
mizes the categorical cross-entropy between the pre-
dicted and the pathologist-reported primary Gleason
pattern of each of the WSIs. e models are trained at
the patch level and then the results are aggregated using
majority voting to obtain the Gleason score. Since there
are ten models for each percentage of supervised anno-
tations with the fully supervised training, each model is
fine-tuned and the performance for each percentage of
strongly annotated data is averaged over the ten models.
e transfer learning models are trained with the same
hyperparameters as the supervised models. ey are
trained for 5 epochs, with a batch size of 32 samples,
using the Adam optimizer with a learning rate of 0.001
and a decay rate of
. As in the TMA, also the patches
from the WSIs are highly heterogeneous in color rep-
resentation. In total, there are 50 fine-tuned Mobilenet
models using the TCGA-PRAD dataset: ere are 10
models for each annotation percentage.
Weakly supervised learning
In this strategy, we use only the weakly labeled dataset
to train the network to predict the global Gleason score.
Training a model only with weak labels is a challenging
scenario since many of the patches that are fed to the net-
work might not contain the relevant visual characteristics
or patterns that are associated with the global Gleason
score[8, 48].
e weakly supervised models are trained with all 100%
of the weak labels available. e models are trained for 5
epochs, with an input batch size of 32 samples and the
Adam optimizer with a learning rate of 0.001 and a decay
rate of
. Since the prediction is at the patch level but
the labels used for evaluation are at the WSI level, the
predictions have to be aggregated. e WSI label com-
puted by taking the majority voting of the most frequently
predicted Gleason patterns. Similar to the training phase,
there are two test sets for evaluating the models of the
three proposed strategies: the strongly annotated TMAZ
weakly annotated TCGA-PRAD test sets. e classifica-
tion performance is measured as the inter-rater agree-
ment with the ground truth. e raters can be either the
pathologist who annotated the dataset and made a report,
or the prediction model, that assigns classes to the image
patches. e used performance measure is Cohen’s kappa
) that is often used in PCa grading[12, 14, 15] because
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
it really chose the agreement between the algorithm and
the human raters and is also used to quantify inter-rater
disagreement. A perfect agreement has a score of
and since
is normalized by random chance, a random
assignment of ratings (classes) has a
. e kappa
score that is used throughout reporting all the results in
this section is the quadratically weighted
that penalizes
predicted values that are far from their actual class, i.e. if
the annotation for a patch is GP4 and the predicted class
is GP5, it is penalized less than if the predicted class is
benign. e quadratically weighted
score is defined as:
where i,j are the ordered scores,
is the total num-
ber of Gleason scores, or
in the case of Gleason
, is the number of images that were classified
with a score i by the first rater and j by the second.
denotes the expected number of images receiving rating
i by the first expert and rating j by the second. e quad-
ratic term
penalizes the rating-predictions that are
separated. When the predicted Gleason score is far from
the ground-truth class,
gets closer to 1.
After all models are trained, an evaluation in different
PCa grading scenarios is performed. In order to evaluate
the generalization power of the models, they are evalu-
ated on different test-sets where possible. Results sug-
gest that the strategy of training with a small number of
annotations and fine-tuning the models with weak labels
of the data at hand, is better than performing direct infer-
ence with pre-trained models and than training the mod-
els from scratch using the weak labels.
e plots presented in this section have 5 data points
(except for the weakly supervised training), correspond-
ing to the subsets of strong annotations used to train
the models (20%, 40%, 60%, 80%, 100%). Each of the five
points represents the average of the results of the ten
trained models for each percentage. Along with the aver-
age, confidence intervals at 95% are drawn as a shaded
area. Between each pair of points, the intermediate values
are interpolated to display the performance tendencies.
e figures and tables display the performance on the
different test partitions that are used neither for training
nor model selection. e tests carried out for each strat-
egy are described in the following subsections (Table4).
Incremental supervised learning (TMAZ test set)
In this setup the MobileNet classifiers, trained with an
increasing number of annotations from the TMAZ data-
set, are used to perform inference on the TMAZ test set.
e full training and test procedures on the TMAZ data-
set correspond to the leftmost branch (blue arrows) in
Fig.1. On the TMAZ test set, the evaluation is straight-
forward: for each percentage of annotation, the best
model for each of the ten repetitions, as evaluated on the
validation partition, is selected. en, the ten models are
used to predict the Gleason patterns and aggregate them
to compute the Gleason score for the patches in each
TMA core. In this evaluation the patches are of the same
center of the input patches and the labels are the same
that were used for training of the models.
In the case of Gleason pattern classification on the
TMAs using the models trained with strong labels, the
performance is monotonically increasing, as shown in
Fig. 2 . e average performance of the models when
using 100% of the annotations is
κ=0.55 ±0.03
, which
is comparable to the performance reported by Arvaniti
etal.[14] of
e plot in Fig.3 represents the Gleason score classi-
fication on TMAs using the models trained with strong
labels. As reference, the inter-pathologist agreement
is represented in both cases with a star. e model
Table 4 Reported performance for prostate cancer grading and scoring using deep learning models
The rst four rows correspond to strongly supervised methods using pixel-level annotations. The last four rows are weakly supervised methods that use global labels.
Multi-Center studies involve training with images from multiple institutions, which increases complexity and requires good generalization performance
Reference Classes Results #Patients Annotations Multicenter
Arvaniti [14] GS6,GS7,GS8,GS9,GS10
641 Strong No
Nagpal [10] GS6,GS7,GS8,GS9,GS10 ACC
342 Strong Yes
Burlutskiy [11] With/out basal cells
229 Strong No
Ström [12] ISUP: 1,2,3,4,5
976 Strong Yes
Otálora [8] GS6, GS7, GS8, GS9, GS10
341 Weak Yes
This work ISUP: 1,2,3,4,5
341 WSI + 641 TMA Weak and strong Yes
Arvaniti [15] ISUP: 1,2,3,4,5
447 WSI + 641 TMA Weak and strong Yes
Bulten [7] ISUP: 1,2,3,4,5
1243 Weak and strong Yes
Campanella [9] Benign versus cancer AUCs of 0.98 7159 Weak Yes
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
performance increases until 40% of the annotations are
used and then it remains approximately stable until 100%
of the annotations are used. e average performance
of the 10 models when using 100% of the annotations
0.69 ±0.02
, which is comparable to the pathologists
agreement of
Weakly labeled prostatectomies (TCGA‑PRAD test set)
e results of the three strategies on the TCGA-PRAD
evaluation are shown in Fig.4 and in Table5 the results
are summarized for the case where the inference and
fine-tuning models use 100% of the strong labels. To
compare the weak ground-truth label with the predic-
tion of the models, the individual patch predictions
need to be aggregated. e probabilities inferred for
all the 1000 patches of each WSI are aggregated with
a majority voting rule. e following rule is applied
to take into account WSIs with the same primary and
secondary Gleason patterns: if the most frequently
represented pattern has more than twice the number
of patches of the second, then this pattern is assumed
to be both the primary and the secondary Gleason
pattern. In the prostatectomies TCGA-PRAD test set,
the evaluations correspond to the three arrows in the
weakly labeled prostactectomies branch of Fig.1. where
all of the three model training strategies are evaluated:
1 Incremental fully supervised learning models In
this test, the MobileNet classifiers trained with an
increasing number of annotations from the TMAZ
dataset, are used to perform inference on the patches
of the TCGA-PRAD whole slide test images. Here,
the models perform inference in the automati-
cally extracted region patches, without further fine-
tuning. e hypothesis is that the model can be
transferred, since the patterns learnt on the TMAZ
dataset are similar to those using the TCGA-PRAD
WSIs (which is likely the case since the grading sys-
tem is the same and pathologists usually do not have
problems in grading in either of the image sources).
Despite the difference in image size and visual char-
acteristics, the trained models should learn high-level
representations of the visual Gleason patterns that
are transferable to external prostate cancer image
2 Weakly supervised training is is the case where the
models do not use the annotations from the TMA
dataset at all, but only the weak labels as the source
of supervision. e last dense layer of the model pre-
dicts both the primary/secondary Gleason pattern. In
this case, the model should capture the relevant pat-
terns on the TCGA-PRAD dataset due to the large
number of selected patches used to train (
patches), which are also processed with the data-aug-
mentation pipeline described in "Datasets" section.
Incremental fully supervised
Pathologist inter-rater agreement ( =0.67)
TMAZ dataset: Gleason pattern results
Percentage of strong labels used in training
Fig. 2 Results for the average performance of the trained models as
measured by
-score as function of the strong annotation percentage
used for training
Incremental fully supervised
Pathologist inter-rater agreement ( =0.71)
TMAZ dataset: Gleason score results
Percentage of strong labels used in training
Fig. 3 Results for the average performance of the trained models
as measured by the
-score as function of the strong annotation
percentage used for training
TCGA-PRAD: Gleason score results
Finetuning TMA models with weak labels
Weakly supervised training = 0.49
Incremental fully supervised
e of stron
labels used in trainin
Fig. 4 Results for the average performance of the trained models
using the TCGA-PRAD test dataset. The performance is measured by
-score as function of the strong annotation percentage
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
3 Fine-tuning TMA models with weak labels ese
models infer using the fine-tuned TCGA-PRAD fea-
tures. In this case the transferred features are learned
on the TMAZ dataset. e model weights used in
each percentage as initialization are the weights of
the best TMA model (as measured by
-score in the
TMAZ validation set) for the corresponding percent-
age of the annotations, changing the last dense layer
to predict both the first and second Gleason patterns.
In this case, the model is expected to further reduce
the difference between the datasets by adapting with
the particularities of the TCGA-PRAD dataset.
Figure4 shows the performance for the three strategies
on the TCGA-PRAD test set. e performance is meas-
ured by the weighted Cohen Kappa (
-score) as a func-
tion of the percentage of TMAZ annotations used to
train the models.
e performance of the first strategy of the models
using incremental full supervision is shown as the blue
line. is strategy reaches
κ=0.31 ±0.13
as the aver-
age for the 10 trained models when using the 100% of
the annotations. is strategy obtains the worst perfor-
mance, regardless of the number of the annotations used.
e second strategy of weakly supervised training, i.e.
fine-tuning Mobilenet from ImageNet weights using only
the TCGA-PRAD dataset, achieves
κ=0.49 ±0.08
is represented with the green line. e weakly supervised
strategy obtained better performance than the incremen-
tal fully supervised models, despite being trained only
with the noisy, weak labels. e third transfer learning
technique that fine-tunes the incremental fully super-
vised models reaches a
κ=0.52 ±0.05
as performance
when fine-tuning the TMAZ model with 100% of the
annotations. e performance for the fully supervised
models with each percentage of data used is repre-
sented as the orange line. is transfer learning strategy
outperformed the other two strategies and it suggests a
performance increase when the number of annotations
used to train the fine-tuned models also increases. e
performance seems to be more robust, i.e., having a nar-
rower confidence interval, as long as more annotations
are provided to the fine-tuned models. e performance
gap between the incremental fully supervised strategy
is broader than between the weakly supervised strategy.
e results for Gleason score and primary and second-
ary Gleason patterns are summarized in Table5. e cor-
rect and misclassified cases for each ISUP grade and each
method are in the confusion matrices shown in Fig.5.
In the following paragraphs we discuss the results and
trade-offs in sample efficiency of each of the methods.
An increase of the classification performance on the
Confusion Matrices for the TCGA-PRAD test set
Supervised using 100% of stong labels Weakly Supervised Fine-tuning using 100% of strong labels
Fig. 5 Confusion matrices displaying the correct cases classified for each of the five ISUP grades (diagonals of the matrices) in the TCGA-PRAD
dataset. The top row shows the normalized matrices from the bottom row matrices displaying the total number of cases
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
TMA dataset is observed when increasing the number of
annotations used to train the model. is was expected
because of the controlled conditions of the experiment
and annotations, where all the data originates from the
same pathology laboratory. e performance trend sug-
gests that more pixel-wise annotations can further
increase the performance in Gleason pattern classifi-
cation. e performance for Gleason scoring is always
above the performance of Gleason pattern classification.
is may be due to the way in which the Gleason score
is calculated, since the order of the patterns is irrelevant
to the final score in this evaluation, thus making the task
e inter-dataset heterogeneity is evident when the
best model trained with the TMA dataset failed to gen-
eralize on the TCGA-PRAD dataset directly. e image
heterogeneity in the TMAZ dataset is lower than in the
TCGA-PRAD dataset. Furthermore, the strongly anno-
tated training dataset is relatively small (up to 15,000
patchs). Besides the visual differences, the lack of anno-
tations makes the patch extraction process more prone
to errors, which is tackled only partially by extracting
patches from usable tissue masks only and ranking the
patches with the blue-ratio criterion. e augmentations
used for the TCGA-PRAD dataset make the training of
the CNN more robust to tissue appearance.
For this, the fine-tuning alleviated the performance
decrease considerably, surpassing the performance of
the weak training only, which shows that the model is
correctly leveraging upon the TMA weights transferred
and adjusting to the particularities of the TCGA-PRAD
ere is an important gap between the performance
of the model in the controlled TMAZ dataset with
κ=0.69 ±0.02
, and the best models on TCGA-PRAD
κ=0.52 ±0.05
which suggests that there is room for
improvement by designing a model that can better lever-
age the usage of a combination of a small set of strong
labels and a large set of weakly annotated data. e most
effective transfer learning method for the WSI dataset
was fine-tuning the trained TMA model weights using
the weakly annotated patches, despite the fact that these
patches are less reliable than those from the TMAZ data-
set (because they refer to the global WSI diagnosis rather
than local structures). is shows a similar behaviour
to what occurs in transfer learning with CNN models
trained on natural images [41], where even if the task
or dataset differs substantially there is still a set of basic
features that linger to the fine-tuned models and allow to
have better generalization.
Our strategy is easy to implement and applicable to dif-
ferent scenarios but also has a few limitations. e prin-
cipal limitation is that weak labels introduce some degree
of noise that is difficult to quantify. ere are no ground-
truth regions of interest to compare with; nevertheless,
this problem is evident in the performance gap from
strong supervision to the weakly trained models. For the
limitation introduced by the noise in the weak labels,
we foresee improvements by adopting attention mecha-
nisms and multiple instance learning models that allow
the model to discard the noisy non-relevant samples and
already proved useful and effective in histopathology use
cases[9, 13, 26].
Second, the combination of weakly and fully supervised
learning are the best experimental results at the whole
slide image level, consistent with previous similar work
using such a combination[15]. An interesting observa-
tion is that for achieving the best performance, it comes
with the cost of using at least the 70% of the strong anno-
tations in conjunction with all the weak labels to outper-
form the baseline of weakly supervised training. e need
for such a large number of strong labels might be a draw-
back in the case of having very few strong labels. is
need for additional supervision can be reduced using
active and semi-supervised learning techniques[48, 49].
We also performed internal experiments using a large
set of prostate biopsies(https:// www. kaggle. com/c/ prost
ate- cancer- grade- asses sment/, dataset is currently under
embargo). e results also showed the same trend and
similar results as for the TCGA dataset, thus making the
approach applicable to the commonly used images in
clinical practice.
In this paper, transfer learning strategies that combine
strong and weak supervision are evaluated in two cases
of prostate cancer image classification in histopathol-
ogy: tissue microarrays of prostate tissue and prosta-
tectomy WSIs. e results of the techniques show that
the Gleason pattern classification performance of a
Table 5 Performance for the evaluated methods on the test data of the TCGA-PRAD dataset
‑SGP Avg. acc. Error rate Micro‑precision
Supervised (100%) 0.30 ± 0.13 0.23 ± 0.16 0.10 ± 0.1 0.51 ± 0.06 2.45 ± 0.33 0.27 ± 0.05
Weakly supervised 0.49 ± 0.08 0.36 ± 0.11 0.30 ±0.09 0.67 ± 0.03 1.65 ± 0.18 0.43 ± 0.04
Fine-tuning (100%) 0.52 ± 0.05 0.34 ±0.10 0.40 ± 0.10 0.69 ±0.02 1.51 ± 0.14 0.46 ± 0.03
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
CNN model can be improved using a combination of
strong and a large amount of weak labels using transfer
e performance increases in the controlled TMA
scenario with a larger number of annotations used to
train the model. Nevertheless, the performance is hin-
dered when the trained TMA model is applied directly
to the more challenging WSI classification problem.
is demonstrates that a good pretrained model for
prostate cancer image classification may lead to the
best downstream model if fine-tuned on the target
dataset but it may not be the most transferable and
generalizable pretrained model otherwise. In future
work, we plan to close the generalization gap even fur-
ther between the weakly supervised trained models and
the fully supervised ones, by better leveraging on the
combination of both sources of supervision, by design-
ing and training better semi and weakly-supervised
learning models.
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under Grant Agreement No. 825292
(ExaMode, http:// www. examo de. eu/). Infrastructure from the SURFsara HPC
center was used to train the CNN models in parallel. Otálora thanks Colcien-
cias through the call 756 for PhD studies. The authors also thank for the Titan
Xp GPU donated by NVIDIA used for the weakly supervised experiments.
Authors’ contributions
SO, NM, MA, and HM conceived the presented idea. SO designed the experi-
mental setup, NM and SO performed the experiments. MA and HM verified
the methods and supervised the findings of this work. All authors discussed
the results and contributed to the final manuscript. All authors read and
approved the final manuscript.
Availability of data and materials
The results shown in this paper are partly based upon data generated by
the TCGA Research Network: https:// www. cancer. gov/ tcga. The datasets
described in this paper are publicly available. The TMAZ dataset can be found
here: https:// datav erse. harva rd. edu/ datas et. xhtml? persi stent Id= doi: 10. 7910/
DVN/ OCYCMP. The TCGA-PRAD whole slide images can be found here (Only
diagnostic slides were used): https:// portal. gdc. cancer. gov/ proje cts/ TCGA-
PRAD The source code for reproducing the experiments will be available upon
publication of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Henning Müller is on the advisory board of “Zebra Medical Vision” and “Contex-
tVision”. The remaining authors declare no competing interests.
Author details
1 HES-SO Valais, Technopôle 3, 3960 Sierre, Switzerland. 2 Computer Science
Centre (CUI), University of Geneva, Route de Drize 7, Battelle A, Carouge,
Switzerland. 3 Faculty of Medicine, University of Geneva, 1 rue Michel-Servet,
1211 Geneva, Switzerland. 4 Department of Neuroscience, University
of Padova, via Belzoni 160, 35121 Padova, Italy.
Received: 1 March 2021 Accepted: 20 April 2021
1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global
cancer statistics 2018: Globocan estimates of incidence and mor-
tality worldwide for 36 cancers in 185 countries. CA Cancer J Clin.
2. Epstein JI. An update of the gleason grading system. J Urol.
3. Egevad L, Delahunt B, Srigley JR, Samaratunga H (2016) International
society of urological pathology (ISUP) grading of prostate cancer—an
ISUP consensus on contemporary grading. Wiley Online Library
4. Fraggetta F, Garozzo S, Zannoni GF, Pantanowitz L, Rossi ED. Routine
digital pathology workflow: the catania experience. J Pathol Inform.
5. Komura D, Ishikawa S. Machine learning methods for histopathological
image analysis. Comput Struct Biotechnol J. 2018;16:34–42.
6. Schaer R, Otálora S, Jimenez-del-Toro O, Atzori M, Müller H. Deep
learning-based retrieval system for gigapixel histopathology cases and
the open access literature. J Pathol Inform. 2019;10:19.
7. Bulten W, Pinckaers H, van Boven H, Vink R, de Bel T, van Ginneken B, van
der Laak J, Hulsbergen-van de Kaa C, Litjens G (2020) Automated gleason
grading of prostate biopsies using deep learning. Lancet Oncol
8. Otálora S, Atzori M, Khan A, Jimenez-del-Toro O, Andrearczyk V, Müller H
(2020) A systematic comparison of deep learning strategies for weakly
supervised gleason grading. In: Medical imaging 2020: digital pathology,
vol 11320, International Society for Optics and Photonics, p 113200
9. Campanella G, Hanna MG, Geneslaw L, Miraflor A, Silva VWK, Busam KJ,
Brogi E, Reuter VE, Klimstra DS, Fuchs TJ. Clinical-grade computational
pathology using weakly supervised deep learning on whole slide images.
Nat Med. 2019;25(8):1301–9.
10. Nagpal K, Foote D, Liu Y, Chen P-HC, Wulczyn E, Tan F, Olson N, Smith JL,
Wren MA. Development and validation of a deep learning algorithm for
improving gleason scoring of prostate cancer. Digital Med. 2019;2(1):48.
11. Burlutskiy N, Pinchaud N, Gu F, Hägg D, Andersson M, Björk L, Eurén K,
Svensson C, Wilén LK, Hedlund M (2019) Segmenting potentially cancer-
ous areas in prostate biopsies using semi-automatically annotated data.
In: Cardoso MJ, Feragen A, Glocker B, Konukoglu E, Oguz I, Unal G, Vercau-
teren T (eds) Proceedings of the 2nd international conference on medical
imaging with deep learning. Proceedings of machine learning research,
vol 102, PMLR, London, United Kingdom 2019, pp 92–108. http:// proce
edings. mlr. press/ v102/ burlu tskiy 19a. html
12. Ström P, Kartasalo K, Olsson H, Solorzano L, Delahunt B, Berney DM,
Bostwick DG, Evans AJ, Grignon DJ, Humphrey PA et al (2019) Pathologist-
level grading of prostate biopsies with artificial intelligence. arXiv: 1907.
13. Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple
instance learning. In: Dy J, Krause A (eds) Proceedings of the 35th interna-
tional conference on machine learning. Proceedings of machine learning
research, vol 80, PMLR, Stockholmsmässan, Stockholm Sweden 2018, pp
2127–2136. http:// proce edings. mlr. press/ v80/ ilse1 8a. html
14. Arvaniti E, Fricker KS, Moret M, Rupp N, Hermanns T, Fankhauser C, Wey N,
Wild PJ, Rueschoff JH, Claassen M (2018) Automated gleason grading of
prostate cancer tissue microarrays via deep learning. Scie Rep
15. Arvaniti E, Claassen M (2018) Coupling weak and strong supervision for
classification of prostate cancer histopathology images. In: Medical imag-
ing meets NIPS workshop
16. Otálora S, Perdomo O, González F, Müller H (2017) Training deep convo-
lutional neural networks with active learning for exudate classification
in eye fundus images. In: Intravascular imaging and computer assisted
stenting, and large-scale annotation of biomedical data and expert label
synthesis, Springer, pp 146–154
17. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der
Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medi-
cal image analysis. Med Image Anal. 2017;42:60–88.
18. Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, Liang J.
Convolutional neural networks for medical image analysis: full training or
fine tuning? IEEE Trans Med Imaging. 2016;35(5):1299–312.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 14
Otáloraetal. BMC Med Imaging (2021) 21:77
fast, convenient online submission
thorough peer review by experienced researchers in your field
rapid publication on acceptance
support for research data, including large and complex data types
gold Open Access which fosters wider collaboration and increased citations
maximum visibility for your research: over 100M website views per year
At BMC, research is always in progress.
Learn more
Ready to submit your research
Ready to submit your research
? Choose BMC and benefit from:
? Choose BMC and benefit from:
19. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-
scale hierarchical image database. In: 2009 IEEE conference on computer
vision and pattern recognition, IEEE, pp 248–255
20. Kieffer B, Babaie M, Kalra S, Tizhoosh HR (2017) Convolutional neural
networks for histopathology image classification: training vs. using pre-
trained networks. In: 2017 Seventh international conference on image
processing theory, tools and applications (IPTA), IEEE, pp 1–6
21. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://
www. deepl earni ngbook. org
22. Bengio Y (2012) Deep learning of representations for unsupervised and
transfer learning. In: Proceedings of ICML workshop on unsupervised and
transfer learning, pp 17–36
23. Mormont R, Geurts P, Marée R (2018) Comparison of deep transfer learn-
ing strategies for digital pathology. In: Proceedings of the IEEE conference
on computer vision and pattern recognition workshops, pp 2262–2271
24. Han S, Hwang SI, Lee HJ. A weak and semi-supervised segmenta-
tion method for prostate cancer in trus images. J Digital Imaging.
25. Li J, Li W, Gertych A, Knudsen BS, Speier W, Arnold CW (2019) An atten-
tion-based multi-resolution model for prostate whole slide image clas-
sification and localization. In: Medical computer vision workshop—CVPR
26. Katharopoulos A, Fleuret F (2019) Processing megapixel images with
deep attention-sampling models. In: Chaudhuri K, Salakhutdinov R (eds)
Proceedings of the 36th international conference on machine learning.
Proceedings of machine learning research, vol 97, PMLR, Long Beach,
California, USA, pp 3282–3291. http:// proce edings. mlr. press/ v97/ katha
ropou los19a. html
27. van der Laak J, Ciompi F, Litjens G. No pixel-level annotations needed. Nat
Biomed Eng. 2019;2019:1–2.
28. Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do imagenet classifiers
generalize to imagenet? arXiv: 1902. 10811
29. Otálora S, Atzori M, Khan A, Jimenez-del-Toro O, Andrearczyk V, Müller
H (2020) Systematic comparison of deep learning strategies for weakly
supervised Gleason grading. In: Tomaszewski JE et al Ward AD (eds) Medi-
cal imaging 2020: digital pathology, 2020, vol 11320, SPIE, International
Society for Optics and Photonics, pp 142–149. https:// doi. org/ 10. 1117/ 12.
25485 71
30. Epstein JI, Zelefsky MJ, Sjoberg DD, Nelson JB, Egevad L, Magi-Galluzzi C,
Vickers AJ, Parwani AV, Reuter VE, Fine SW, et al. A contemporary prostate
cancer grading system: a validated alternative to the gleason score. Eur
Urol. 2016;69(3):428–35.
31. del Toro OJ, Atzori M, Otálora S, Andersson M, Eurén K, Hedlund M, Rön-
nquist P, Müller H (2017) Convolutional neural networks for an automatic
classification of prostate tissue slides with high-grade gleason score. In:
Medical imaging 2017: digital pathology, vol 10140, International Society
for Optics and Photonics, p 101400
32. Tellez D, Litjens G, Bándi P, Bulten W, Bokhorst J-M, Ciompi F, van der Laak
J. Quantifying the effects of data augmentation and stain color normali-
zation in convolutional neural networks for computational pathology.
Med Image Anal. 2019;58:101544. https:// doi. org/ 10. 1016/j. media. 2019.
33. Otálora S, Atzori M, Andrearczyk V, Khan A, Müller H. Staining invariant
features for improving generalization of deep convolutional neural net-
works in computational pathology. Front Bioeng Biotechnol. 2019;7:198.
34. Tellez D, Litjens G, van der Laak J, Ciompi F (2019) Neural image compres-
sion for gigapixel histopathology image analysis. IEEE Trans Pattern Anal
Mach Intell
35. Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips
S, Maffitt D, Pringle M, et al. The cancer imaging archive (tcia): maintain-
ing and operating a public information repository. J Digit Imaging.
36. Zuley M, Jarosz R, Drake B, Rancilio D, Klim A, Rieger-Christ K, Lemmer-
man J (2016) Radiology data from the cancer genome atlas prostate
adenocarcinoma [tcga-prad] collection. Cancer Imaging Arch
37. Abeshouse A, Ahn J, Akbani R, Ally A, Amin S, Andry CD, Annala M,
Aprikian A, Armenia J, Arora A, et al. The molecular taxonomy of primary
prostate cancer. Cell. 2015;163(4):1011–25.
38. Janowczyk A, Zuo R, Gilmore H, Feldman M, Madabhushi A. Histoqc: an
open-source quality control tool for digital pathology slides. JCO Clin
Cancer Inform. 2019;3:1–7.
39. Rousson M, Hedlund M, Andersson M, Jacobsson L, Läthén G, Norell B,
Jimenez-del-Toro O, Müller H, Atzori M (2018) Tumor proliferation assess-
ment of whole slide images. In: Medical imaging 2018: digital pathology,
vol 105810, International Society for Optics and Photonics
40. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T,
Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv: 1704. 04861
41. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features
in deep neural networks? In: Advances in neural information processing
systems, pp 3320–3328
42. Bloice MD, Stocker C, Holzinger A (2017) Augmentor: an image augmen-
tation library for machine learning. arXiv preprint arXiv: 1708. 04680
43. Tellez D, Litjens G, Bándi P, Bulten W, Bokhorst J-M, Ciompi F, van der Laak
J. Quantifying the effects of data augmentation and stain color normali-
zation in convolutional neural networks for computational pathology.
Med Image Anal. 2019;58:101544.
44. Byfield P et al (2020) Staintools: tools for tissue image stain normalisation
and augmentation in python. Github Reposit. https:// github. com/ Peter
554/ Stain Tools
45. Vahadane A, Peng T, Sethi A, Albarqouni S, Wang L, Baust M, Steiger K,
Schlitter AM, Esposito I, Navab N. Structure-preserving color normaliza-
tion and sparse stain separation for histological images. IEEE Trans Med
Imaging. 2016;35(8):1962–71.
46. Arvaniti E, Fricker K, Moret M, Rupp N, Hermanns T, Fankhauser C, Wey N,
Wild P, Rüschoff JH, Claassen M (2018) Replication data for: automated
gleason grading of prostate cancer tissue microarrays via deep learning.
https:// doi. org/ 10. 7910/ DVN/ OCYCMP
47. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv
preprint arXiv: 1412. 6980
48. Cheplygina V, de Bruijne M, Pluim JP. Not-so-supervised: a survey of
semi-supervised, multi-instance, and transfer learning in medical image
analysis. Med Image Anal. 2019;54:280–96.
49. Otálora S, Marini N, Müller H, Atzori M (2020) Semi-weakly supervised
learning for prostate cancer image classification with teacher-student
deep convolutional networks. In: Interpretable and annotation-efficient
learning for medical image computing, Springer, pp 193–203
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... Although these approaches provided a reasonable accuracy, they heavily relied on pixel-level annotated datasets which should be done by experts and are very time-consuming. To reduce the need for annotating the data, Otalora et al. Otálora, Marini, Müller and Atzori (2021) proposed a transfer learning approach in which a CNN model is pretrained on the patches extracted from a fully annotated (region annotations) dataset and is later fine-tuned on a dataset with less expensive weak (image-level) labels. Their results showed that the model performance increases when fine-tuning a pretrained model on a fully annotated dataset for the task of Gleason scoring with the weak WSI labels. ...
... Their results showed that the model performance increases when fine-tuning a pretrained model on a fully annotated dataset for the task of Gleason scoring with the weak WSI labels. Despite of a good performance, the framework proposed in Otálora et al. (2021) still needs a large fully annotated dataset for their pretrained model. To address this limitation, weakly supervised algorithms have been developed for the Gleason grading of prostate cancer. ...
... The model proposed in Otálora et al. (2021) was trained on the Gleason 2019 dataset, and their weakly supervised model was used for the TCGA-PRAD dataset. As described in Section 2, the model predicts the primary and secondary scores of the patches in the slide, and WSI labels are computed by taking the majority voting of the most frequently predicted Gleason patterns. ...
Prostate cancer is the most common cancer in men worldwide and the second leading cause of cancer death in the United States. One of the prognostic features in prostate cancer is the Gleason grading of histopathology images. The Gleason grade is assigned based on tumor architecture on Hematoxylin and Eosin (H&E) stained whole slide images (WSI) by the pathologists. This process is time-consuming and has known interobserver variability. In the past few years, deep learning algorithms have been used to analyze histopathology images, delivering promising results for grading prostate cancer. However, most of the algorithms rely on the fully annotated datasets which are expensive to generate. In this work, we proposed a novel weakly-supervised algorithm to classify prostate cancer grades. The proposed algorithm consists of three steps: (1) extracting discriminative areas in a histopathology image by employing the Multiple Instance Learning (MIL) algorithm based on Transformers, (2) representing the image by constructing a graph using the discriminative patches, and (3) classifying the image into its Gleason grades by developing a Graph Convolutional Neural Network (GCN) based on the gated attention mechanism. We evaluated our algorithm using publicly available datasets, including TCGAPRAD, PANDA, and Gleason 2019 challenge datasets. We also cross validated the algorithm on an independent dataset. Results show that the proposed model achieved state-of-the-art performance in the Gleason grading task in terms of accuracy, F1 score, and cohen-kappa. The code is available at
... The annotation of the WSI in most data sets is at least overseen by a pathologist and can range from a single slide-level label of an overall Gleason score (GS) or Gleason grade 25 to individual gland-level Gleason pattern assignments. 26,27 Some studies attempt to strengthen the generalizability of their algorithm by using multiple data sets because annotation variation exists in clinical practice. 26,[28][29][30][31][32][33] Another point that must be considered is that annotations are only as good as the person making them. ...
... 26,27 Some studies attempt to strengthen the generalizability of their algorithm by using multiple data sets because annotation variation exists in clinical practice. 26,[28][29][30][31][32][33] Another point that must be considered is that annotations are only as good as the person making them. Observer variability is known to be an issue to the consistency of Gleason grading among pathologists, 34 leading some researchers to look at possible genetic indicators for cancer detection instead of hematoxylin and eosin (H&E) slide annotation. ...
Context.—: Automated prostate cancer detection using machine learning technology has led to speculation that pathologists will soon be replaced by algorithms. This review covers the development of machine learning algorithms and their reported effectiveness specific to prostate cancer detection and Gleason grading. Objective.—: To examine current algorithms regarding their accuracy and classification abilities. We provide a general explanation of the technology and how it is being used in clinical practice. The challenges to the application of machine learning algorithms in clinical practice are also discussed. Data sources.—: The literature for this review was identified and collected using a systematic search. Criteria were established prior to the sorting process to effectively direct the selection of studies. A 4-point system was implemented to rank the papers according to their relevancy. For papers accepted as relevant to our metrics, all cited and citing studies were also reviewed. Studies were then categorized based on whether they implemented binary or multi-class classification methods. Data were extracted from papers that contained accuracy, area under the curve (AUC), or κ values in the context of prostate cancer detection. The results were visually summarized to present accuracy trends between classification abilities. Conclusions.—: It is more difficult to achieve high accuracy metrics for multiclassification tasks than for binary tasks. The clinical implementation of an algorithm that can assign a Gleason grade to clinical whole slide images (WSIs) remains elusive. Machine learning technology is currently not able to replace pathologists but can serve as an important safeguard against misdiagnosis.
... When estimating Gleason grading, many papers only focused on classifying tiles or small regions like TMAs by taking advantage of classical CNN architectures trained on large datasets of natural images such as ImageNet [60]. In that context, tiles were encoded into features which corresponded to the input data for classification [21,38,[61][62][63][64][65][66]. One of the first papers in the field used a cohort of 641 TMAs, obtaining a quadratic Cohen Kappa of 0.71 [21]. ...
... Indeed, there is no standardization on the staining protocols, which translates to WSIs. This bias could be overcome using normalization such as Vahadane [94] or Macenko methods [95] or GANs (Generative Adversarial Networks) [24] or color augmentation [32,36,61,65,66,73,96], but most articles do not use image normalization to overcome this bias (50 out of 77 do not see Figure 3). In addition, most scanners have their proprietary format, which can impair the generalizability of the algorithms. ...
Full-text available
Deep learning (DL), often called artificial intelligence (AI), has been increasingly used in Pathology thanks to the use of scanners to digitize slides which allow us to visualize them on monitors and process them with AI algorithms. Many articles have focused on DL applied to prostate cancer (PCa). This systematic review explains the DL applications and their performances for PCa in digital pathology. Article research was performed using PubMed and Embase to collect relevant articles. A Risk of Bias (RoB) was assessed with an adaptation of the QUADAS-2 tool. Out of the 77 included studies, eight focused on pre-processing tasks such as quality assessment or staining normalization. Most articles (n = 53) focused on diagnosis tasks like cancer detection or Gleason grading. Fifteen articles focused on prediction tasks, such as recurrence prediction or genomic correlations. Best performances were reached for cancer detection with an Area Under the Curve (AUC) up to 0.99 with algorithms already available for routine diagnosis. A few biases outlined by the RoB analysis are often found in these articles, such as the lack of external validation. This review was registered on PROSPERO under CRD42023418661.
... Our primary objective is to classify cSCC WSI into one of four grading classes: normal (tumor not present), well-differentiated, moderatelydifferentiated, and poorly-differentiated. We address this problem in the multiple-instance learning (MIL) paradigm because of the success of previous studies that used MIL for weakly-supervised cancer grading and, thus, transform each WSI into a bag of tiled patches (instances) (Mun et al. 2021;Silva-Rodriguez et al. 2021;Otálora et al. 2021). ...
Full-text available
Cutaneous squamous cell cancer (cSCC) is the second most common skin cancer in the US. It is diagnosed by manual multi-class tumor grading using a tissue whole slide image (WSI), which is subjective and suffers from inter-pathologist variability. We propose an automated weakly-supervised grading approach for cSCC WSIs that is trained using WSI-level grade and does not require fine-grained tumor annotations. The proposed model, RACR-MIL, transforms each WSI into a bag of tiled patches and leverages attention-based multiple-instance learning to assign a WSI-level grade. We propose three key innovations to address general as well as cSCC-specific challenges in tumor grading. First, we leverage spatial and semantic proximity to define a WSI graph that encodes both local and non-local dependencies between tumor regions and leverage graph attention convolution to derive contextual patch features. Second, we introduce a novel ordinal ranking constraint on the patch attention network to ensure that higher-grade tumor regions are assigned higher attention. Third, we use tumor depth as an auxiliary task to improve grade classification in a multitask learning framework. RACR-MIL achieves 2-9% improvement in grade classification over existing weakly-supervised approaches on a dataset of 718 cSCC tissue images and localizes the tumor better. The model achieves 5-20% higher accuracy in difficult-to-classify high-risk grade classes and is robust to class imbalance.
... This decision was based on two key factors. First, in the context of LVO detection on CTA, bounding boxes (i.e., object detection) supply potent supervision signals for the learning process while reducing the need for resource-intensive pixel-level annotations (i.e., segmentation) 28,29 . Second, presenting the model's predictions for LVO detection enables users to understand how the model reached its decision in a specific case, potentially improving the reliability and explainability of the predictions. ...
Full-text available
The use of deep learning (DL) techniques for automated diagnosis of large vessel occlusion (LVO) and collateral scoring on computed tomography angiography (CTA) is gaining attention. In this study, a state-of-the-art self-configuring object detection network called nnDetection was used to detect LVO and assess collateralization on CTA scans using a multi-task 3D object detection approach. The model was trained on single-phase CTA scans of 2425 patients at five centers, and its performance was evaluated on an external test set of 345 patients from another center. Ground-truth labels for the presence of LVO and collateral scores were provided by three radiologists. The nnDetection model achieved a diagnostic accuracy of 98.26% (95% CI 96.25–99.36%) in identifying LVO, correctly classifying 339 out of 345 CTA scans in the external test set. The DL-based collateral scores had a kappa of 0.80, indicating good agreement with the consensus of the radiologists. These results demonstrate that the self-configuring 3D nnDetection model can accurately detect LVO on single-phase CTA scans and provide semi-quantitative collateral scores, offering a comprehensive approach for automated stroke diagnostics in patients with LVO.
... As for the global AI competition, the Prostate cANcer graDe Assessment (PANDA) challenge, a group of AI Gleason grading algorithms developed during a global competition generalized well to intercontinental and multinational cohorts with pathologist-level performance [10]. Other works [23,[28][29][30][31][32][33][34] have also looked into developing deep learning algorithms to classify prostate cancer Gleason scores based on histopathological images. ...
Full-text available
Background Prostate cancer is often a slowly progressive indolent disease. Unnecessary treatments from overdiagnosis are a significant concern, particularly low-grade disease. Active surveillance has being considered as a risk management strategy to avoid potential side effects by unnecessary radical treatment. In 2016, American Society of Clinical Oncology (ASCO) endorsed the Cancer Care Ontario (CCO) Clinical Practice Guideline on active surveillance for the management of localized prostate cancer. Methods Based on this guideline, we developed a deep learning model to classify prostate adenocarcinoma into indolent (applicable for active surveillance) and aggressive (necessary for definitive therapy) on core needle biopsy whole slide images (WSIs). In this study, we trained deep learning models using a combination of transfer, weakly supervised, and fully supervised learning approaches using a dataset of core needle biopsy WSIs (n=1300). In addition, we performed an inter-rater reliability evaluation on the WSI classification. Results We evaluated the models on a test set (n=645), achieving ROC-AUCs of 0.846 for indolent and 0.980 for aggressive. The inter-rater reliability evaluation showed s-scores in the range of 0.10 to 0.95, with the lowest being on the WSIs with both indolent and aggressive classification by the model, and the highest on benign WSIs. Conclusion The results demonstrate the promising potential of deployment in a practical prostate adenocarcinoma histopathological diagnostic workflow system.
Full-text available
Medical imaging model construction often faces a challenge in the shortage of training data. To tackle this limitation, transfer learning is frequently used by leveraging pre-trained models to enhance performance on smaller and more specific datasets. We investigated transfer learning in cancer histopathology imaging, assessing three popular deep neural network algorithms on three target datasets in various fine-tuning configurations. Our study discovered that pre-training cancer histopathology image datasets did not surpass those pre-trained with ImageNet or random initialization. It also revealed that the performance of pre-trained models improves with the increase of images used in fine-tuning. Consequently, the selection of pre-training datasets and fine-tuning configurations is critical when applying transfer learning in medical imaging. This study sheds light on the potential limitations and opportunities of transfer learning in cancer histopathology imaging, promoting the development of accurate and effective medical imaging models.
With digital clinical workflows in histopathology departments, the possibility to use machine-learning-based decision support is increasing. Still, there are many challenges despite often good results on retrospective data. Explainable AI can help to find bias in data and also integrated decision support with other available clinical data. The ExaMode project has implemented many tools and automatic pipelines for such decision support. Most of the algorithms are available for research use and can thus be of help for other researchers in the domain.
Prostate cancer is a dangerous type of cancer that kills a lot of men because it is hard to diagnose. Images taken of people with carcinoma have complex and important parts that are hard to get out with traditional diagnostic methods. Deep learning (DL) can classify the aggressiveness of prostate cancer by automatically extracting characteristics from whole-slide images of prostate biopsies that have been annotated by skilled pathologists. This study uses transfer learning to create resilient DL convolutional neural networks. A technique of risk assessment for prostate cancer called Gleason grading is based on the pathologist who reports the results and is vulnerable to bias. Systems that use DL have the potential to improve the efficiency and objectivity of Gleason grading. Based on a sizable, high-quality training dataset, a cutting-edge convolutional network architecture, and an extensive training set, we developed DL-based models for identifying prostate cancer tissue in whole-slide images (MobileNet V2, InceptionResNet V2, DenseNet 169, ResNet101 V2, and NasNetMobile). Accuracy, loss, and RMSE measurements were used in a confusion matrix to evaluate performance. DenseNet 169 provided the best results, with validation accuracy of 89.76% (ISUP Grade 0), training accuracy of 95.63% (ISUP Grade 1), validation accuracy of 96.98% (ISUP Grade 2), validation accuracy of 91.98% (ISUP Grade 3), and training accuracy of 95.63% (ISUP Grade 5). InceptionResNet V2 has obtained the highest average accuracy (validation), 84.99%. The results demonstrate that InceptionResNet V2 performed better than other models.
Conference Paper
Full-text available
Due to the shortage of training data, transfer learn-ing is frequently used in constructing medical imaging models.In this study, we perform transfer learning pre-training dataset and fine-tuning effect analysis in cancer histopathology imaging by evaluating three popular deep neural network algorithms on three target datasets under various fine-tuning configurations.Pre-training models with cancer histopathology image datasets appear to perform worse or not better than pre-training mod-els with ImageNet or random initialization. Furthermore, this study demonstrates that the performance of pre-trained models improves with the increase of images used in fine-tuning, which was previously overlooked.
Full-text available
Deep Convolutional Neural Networks (CNN) are at the backbone of the state–of–the art methods to automatically analyze Whole Slide Images (WSIs) of digital tissue slides. One challenge to train fully-supervised CNN models with WSIs is providing the required amount of costly, manually annotated data. This paper presents a semi-weakly supervised model for classifying prostate cancer tissue. The approach follows a teacher-student learning paradigm that allows combining a small amount of annotated data (tissue microarrays with regions of interest traced by pathologists) with a large amount of weakly-annotated data (whole slide images with labels extracted from the diagnostic reports). The task of the teacher model is to annotate the weakly-annotated images. The student is trained with the pseudo-labeled images annotated by the teacher and fine-tuned with the small amount of strongly annotated data. The evaluation of the methods is in the task of classification of four Gleason patterns and the Gleason score in prostate cancer images. Results show that the teacher-student approach improves significatively the performance of the fully-supervised CNN, both at the Gleason pattern level in tissue microarrays (respectively \(\kappa = 0.594 \pm 0.022\) and \(\kappa = 0.559 \pm 0.034\)) and at the Gleason score level in WSIs (respectively \(\kappa = 0.403 \pm 0.046\) and \(\kappa = 0.273 \pm 0.12\)). Our approach opens the possibility of transforming large weakly–annotated (and unlabeled) datasets into valuable sources of supervision for training robust CNN models in computational pathology.
Full-text available
The Gleason grading system remains the most powerful prognostic predictor for patients with prostate cancer since the 1960’s. Its application requires highly-trained pathologists, is tedious and yet suffers from limited inter-pathologist reproducibility, especially for the intermediate Gleason score 7. Automated annotation procedures constitute a viable solution to remedy these limitations. In this study, we present a deep learning approach for automated Gleason grading of prostate cancer tissue microarrays with Hematoxylin and Eosin (H&E) staining. Our system was trained using detailed Gleason annotations on a discovery cohort of 641 patients and was then evaluated on an independent test cohort of 245 patients annotated by two pathologists. On the test cohort, the inter-annotator agreements between the model and each pathologist, quantified via Cohen’s quadratic kappa statistic, were 0.75 and 0.71 respectively, comparable with the inter-pathologist agreement (kappa=0.71). Furthermore, the model’s Gleason score assignments achieved pathology expert-level stratification of patients into prognostically distinct groups, on the basis of disease-specific survival data available for the test cohort. Overall, our study shows promising results regarding the applicability of deep learning-based solutions towards more objective and reproducible prostate cancer grading, especially for cases with heterogeneous Gleason patterns.
Conference Paper
Full-text available
Prostate cancer (PCa) is one of the most frequent cancers in men. Its grading is required before initiating its treatment. The Gleason Score (GS) aims at describing and measuring the regularity in gland patterns observed by a pathologist on the microscopic or digital images of prostate biopsies and prostatectomies. Deep Learning-based (DL) models are the state-of-the-art computer vision techniques for Gleason grading, learning high-level features with high classification power. However, for obtaining robust models with clinical-grade performance, a large number of local annotations are needed. Previous research showed that it is feasible to detect low and high-grade PCa from digitized tissue slides relying only on the less expensive report-level (weakly) supervised labels, thus global rather than local labels. Despite this, few articles focus on classifying the finer-grained GS classes with weakly supervised models. The objective of this paper is to compare weakly supervised strategies for classification of the five classes of the GS from the whole slide image, using the global diagnostic label from the pathology reports as the only source of supervision. We compare different models trained on hand-crafted features, shallow and deep learning representations. The training and evaluation are done on the publicly available TCGA-PRAD dataset, comprising of 341 whole slide images of radical prostatectomies, where small patches are extracted within tissue areas and assigned the global report label as ground truth. Our results show that DL networks and class-wise data augmentation outperform other strategies and their combinations, reaching a kappa score of κ = 0.44, which could be further improved with a larger dataset or combining both strong and weakly supervised models.
Full-text available
One of the main obstacles for the implementation of deep convolutional neural networks (DCNNs) in the clinical pathology workflow is their low capability to overcome variability in slide preparation and scanner configuration, that leads to changes in tissue appearance. Some of these variations may not be not included in the training data, which means that the models have a risk to not generalize well. Addressing such variations and evaluating them in reproducible scenarios allows understanding of when the models generalize better, which is crucial for performance improvements and better DCNN models. Staining normalization techniques (often based on color deconvolution and deep learning) and color augmentation approaches have shown improvements in the generalization of the classification tasks for several tissue types. Domain-invariant training of DCNN's is also a promising technique to address the problem of training a single model for different domains, since it includes the source domain information to guide the training toward domain-invariant features, achieving state-of-the-art results in classification tasks. In this article, deep domain adaptation in convolutional networks (DANN) is applied to computational pathology and compared with widely used staining normalization and color augmentation methods in two challenging classification tasks. The classification tasks rely on two openly accessible datasets, targeting Gleason grading in prostate cancer, and mitosis classification in breast tissue. The benchmark of the different techniques and their combination in two DCNN architectures allows us to assess the generalization abilities and advantages of each method in the considered classification tasks. The code for reproducing our experiments and preprocessing the data is publicly available1. Quantitative and qualitative results show that the use of DANN helps model generalization to external datasets. The combination of several techniques to manage color heterogeneity suggests that several methods together, such as color augmentation methods with DANN training, can generalize even further. The results do not show a single best technique among the considered methods, even when combining them. However, color augmentation and DANN training obtain most often the best results (alone or combined with color normalization and color augmentation). The statistical significance of the results and the embeddings visualizations provide useful insights to design DCNN that generalizes to unseen staining appearances. Furthermore, in this work, we release for the first time code for DANN evaluation in open access datasets for computational pathology. This work opens the possibility for further research on using DANN models together with techniques that can overcome the tissue preparation differences across datasets to tackle limited generalization.
Full-text available
The development of decision support systems for pathology and their deployment in clinical practice have been hindered by the need for large manually annotated datasets. To overcome this problem, we present a multiple instance learning-based deep learning system that uses only the reported diagnoses as labels for training, thereby avoiding expensive and time-consuming pixel-wise manual annotations. We evaluated this framework at scale on a dataset of 44,732 whole slide images from 15,187 patients without any form of data curation. Tests on prostate cancer, basal cell carcinoma and breast cancer metastases to axillary lymph nodes resulted in areas under the curve above 0.98 for all cancer types. Its clinical application would allow pathologists to exclude 65–75% of slides while retaining 100% sensitivity. Our results show that this system has the ability to train accurate classification models at unprecedented scale, laying the foundation for the deployment of computational decision support systems in clinical practice.
The purpose of this research is to exploit a weak and semi-supervised deep learning framework to segment prostate cancer in TRUS images, alleviating the time-consuming work of radiologists to draw the boundary of the lesions and training the neural network on the data that do not have complete annotations. A histologic-proven benchmarking dataset of 102 case images was built and 22 images were randomly selected for evaluation. Some portion of the training images were strong supervised, annotated pixel by pixel. Using the strong supervised images, a deep learning neural network was trained. The rest of the training images with only weak supervision, which is just the location of the lesion, were fed to the trained network to produce the intermediate pixelwise labels for the weak supervised images. Then, we retrained the neural network on the all training images with the original labels and the intermediate labels and fed the training images to the retrained network to produce the refined labels. Comparing the distance of the center of mass of the refined labels and the intermediate labels to the weak supervision location, the closer one replaced the previous label, which could be considered as the label updates. After the label updates, test set images were fed to the retrained network for evaluation. The proposed method shows better result with weak and semi-supervised data than the method using only small portion of strong supervised data, although the improvement may not be as much as when the fully strong supervised dataset is used. In terms of mean intersection over union (mIoU), the proposed method reached about 0.6 when the ratio of the strong supervised data was 40%, about 2% decreased performance compared to that of 100% strong supervised case. The proposed method seems to be able to help to alleviate the time-consuming work of radiologists to draw the boundary of the lesions, and to train the neural network on the data that do not have complete annotations.
Background: The Gleason score is the strongest correlating predictor of recurrence for prostate cancer, but has substantial inter-observer variability, limiting its usefulness for individual patients. Specialised urological pathologists have greater concordance; however, such expertise is not widely available. Prostate cancer diagnostics could thus benefit from robust, reproducible Gleason grading. We aimed to investigate the potential of deep learning to perform automated Gleason grading of prostate biopsies. Methods: In this retrospective study, we developed a deep-learning system to grade prostate biopsies following the Gleason grading standard. The system was developed using randomly selected biopsies, sampled by the biopsy Gleason score, from patients at the Radboud University Medical Center (pathology report dated between Jan 1, 2012, and Dec 31, 2017). A semi-automatic labelling technique was used to circumvent the need for manual annotations by pathologists, using pathologists' reports as the reference standard during training. The system was developed to delineate individual glands, assign Gleason growth patterns, and determine the biopsy-level grade. For validation of the method, a consensus reference standard was set by three expert urological pathologists on an independent test set of 550 biopsies. Of these 550, 100 were used in an observer experiment, in which the system, 13 pathologists, and two pathologists in training were compared with respect to the reference standard. The system was also compared to an external test dataset of 886 cores, which contained 245 cores from a different centre that were independently graded by two pathologists. Findings: We collected 5759 biopsies from 1243 patients. The developed system achieved a high agreement with the reference standard (quadratic Cohen's kappa 0·918, 95% CI 0·891-0·941) and scored highly at clinical decision thresholds: benign versus malignant (area under the curve 0·990, 95% CI 0·982-0·996), grade group of 2 or more (0·978, 0·966-0·988), and grade group of 3 or more (0·974, 0·962-0·984). In an observer experiment, the deep-learning system scored higher (kappa 0·854) than the panel (median kappa 0·819), outperforming 10 of 15 pathologist observers. On the external test dataset, the system obtained a high agreement with the reference standard set independently by two pathologists (quadratic Cohen's kappa 0·723 and 0·707) and within inter-observer variability (kappa 0·71). Interpretation: Our automated deep-learning system achieved a performance similar to pathologists for Gleason grading and could potentially contribute to prostate cancer diagnosis. The system could potentially assist pathologists by screening biopsies, providing second opinions on grade group, and presenting quantitative measurements of volume percentages. Funding: Dutch Cancer Society.
A deep-learning model for cancer detection trained on a large number of scanned pathology slides and associated diagnosis labels enables model development without the need for pixel-level annotations.
We propose Neural Image Compression (NIC), a two-step method to build convolutional neural networks for gigapixel image analysis solely using weak image-level labels. First, gigapixel images are compressed using a neural network trained in an unsupervised fashion, retaining high-level information while suppressing pixel-level noise. Second, a convolutional neural network (CNN) is trained on these compressed image representations to predict image-level labels, avoiding the need for fine-grained manual annotations. We compared several encoding strategies, namely reconstruction error minimization, contrastive training and adversarial feature learning, and evaluated NIC on a synthetic task and two public histopathology datasets. We found that NIC can exploit visual cues associated with image-level labels successfully, integrating both global and local visual information. Furthermore, we visualized the regions of the input gigapixel images where the CNN attended to, and confirmed that they overlapped with annotations from human experts.
Stain variation is a phenomenon observed when distinct pathology laboratories stain tissue slides that exhibit similar but not identical color appearance. Due to this color shift between laboratories, convolutional neural networks (CNNs) trained with images from one lab often underperform on unseen images from the other lab. Several techniques have been proposed to reduce the generalization error, mainly grouped into two categories: stain color augmentation and stain color normalization. The former simulates a wide variety of realistic stain variations during training, producing stain-invariant CNNs. The latter aims to match training and test color distributions in order to reduce stain variation. For the first time, we compared some of these techniques and quantified their effect on CNN classification performance using a heterogeneous dataset of hematoxylin and eosin histopathology images from 4 organs and 9 pathology laboratories. Additionally, we propose a novel unsupervised method to perform stain color normalization using a neural network. Based on our experimental results, we provide practical guidelines on how to use stain color augmentation and stain color normalization in future computational pathology applications.