ArticlePDF Available

HookNet: Multi-resolution convolutional neural networks for semantic segmentation in histopathology whole-slide images

Authors:

Abstract and Figures

We propose HookNet, a semantic segmentation model for histopathology whole-slide images, which combines context and details via multiple branches of encoder-decoder convolutional neural networks. Concentric patches at multiple resolutions with different fields of view, feed different branches of HookNet, and intermediate representations are combined via a hooking mechanism. We describe a framework to design and train HookNet for achieving high-resolution semantic segmentation and introduce constraints to guarantee pixel-wise alignment in feature maps during hooking. We show the advantages of using HookNet in two histopathology image segmentation tasks where tissue type prediction accuracy strongly depends on contextual information, namely (1) multi-class tissue segmentation in breast cancer and, (2) segmentation of tertiary lymphoid structures and germinal centers in lung cancer. We show the superiority of HookNet when compared with single-resolution U-Net models working at different resolutions as well as with a recently published multi-resolution model for histopathology image segmentation. We have made HookNet publicly available by releasing the source code¹ as well as in the form of web-based applications²,³ based on the grand-challenge.org platform.
Content may be subject to copyright.
Medical Image Analysis 68 (2021) 101890
Contents lists available at ScienceDirect
Medical Image Analysis
journal homepage: www.elsevier.com/locate/media
HookNet: Multi-resolution convolutional neural networks for semantic
segmentation in histopathology whole-slide images
Mart van Rijthoven
a , , Maschenka Balkenhol
a
, Karina Sili ¸n a
b
, Jeroen van der Laak
a , c
,
Francesco Ciompi
a
a
Diagnostic Image Analysis Group and the Department of Pathology, Ra dboud University Medical Center, Nijmegen, the Netherlands
b
Institute of Experimental Immunology, University of Zurich, Zurich, Switzerland
c
Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden
a r t i c l e i n f o
Article history:
Received 30 December 2019
Revised 22 October 2020
Accepted 23 October 2020
Available online 29 October 2020
Keywo rds:
Computational pathology
Semantic segmentation
Multi-resolution
Deep learning
a b s t r a c t
We propose HookNet, a semantic segmentation model for histopathology whole-slide images, which com-
bines context and details via multiple branches of encoder-decoder convolutional neural networks. Con-
centric patches at multiple resolutions with different fields of view, feed different branches of HookNet,
and intermediate representations are combined via a hooking mechanism. We describe a framework to
design and train HookNet for achieving high-resolution semantic segmentation and introduce constraints
to guarantee pixel-wise alignment in feature maps during hooking. We show the advantages of using
HookNet in two histopathology image segmentation tasks where tissue type prediction accuracy strongly
depends on contextual information, namely (1) multi-class tissue segmentation in breast cancer and, (2)
segmentation of tertiary lymphoid structures and germinal centers in lung cancer. We show the superi-
ority of HookNet when compared with single-resolution U-Net models working at different resolutions
as well as with a recently published multi-resolution model for histopathology image segmentation. We
have made HookNet publicly available by releasing the source code
1
as well as in the form of web-based
applications
2 , 3
based on the grand-challenge.org platform.
©2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
1. Introduction
Semantic image segmentation is the separation of concepts by
grouping pixels belonging to the same concept, with the aim of
simplifying image representation and understanding. In medical
imaging, tumor detection and segmentation are necessary steps for
diagnosis and disease characterization. This is especially relevant
in histopathology, where tissue samples with a wide variety and
amount of cells within a specific context have to be analyzed by
pathologists for diagnostic purposes.
Introduction of high-resolution and high throughput digital
scanners have de-facto revolutionized the field of pathology by
digitizing tissue samples and producing gigapixel whole-slide im-
ages (WSI). In this context, the digital nature of WSIs allows for the
possibility to use computer algorithms for automated histopathol-
Corresponding author.
E-mail address: mart.vanrijthoven@radboudumc.nl (M. van Rijthoven).
1 https://github.com/computationalpathologygroup/hooknet
2 https://grand-challenge.org/algorithms/hooknet-breast/ .
3 https://grand-challenge.org/algorithms/hooknet-lung/ .
ogy image segmentation, which can be a valuable diagnostic tool
for pathologists to identify and characterize different types of tis-
sue, including cancer.
1.1. Context and details in histopathology
It has long been known that despite individual cancer cells may
share morphological characteristics, the way they grow into spe-
cific patterns make a profound difference in the prognosis of the
patient. As an example, in hematoxylin and eosin (H&E) stained
breast tissue samples, different histological types of breast can-
cer can be distinguished. For instance, an invasive tumor that
originates in the breast duct (invasive ductal carcinoma, IDC) can
show a wide variety in growth patterns. In contrast, an inva-
sive tumor that originates in the breast lobules (invasive lobu-
lar carcinoma, ILC), is characterized by individually arranged tu-
mor cells. Furthermore, the same type of ductal carcinoma cells
can be confined within the breast duct (ductal carcinoma in situ,
DCIS) or become invasive by spreading outside the duct (IDC) (see
Fig. 1 ) ( Lakhani, 2012 ). To differentiate between these types of can-
cer, pathologists typically combine observations. For example, they
https://doi.org/10.1016/j.media.2020.101890
1361-8415/© 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
Fig. 1. Examples of ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) in breast tissue, and tertiary lymphoid structures
(TLS) and germinal centers (GC) in lung tissue. For each example, multi-resolution/multi-field-of-view (MFMR) patches (introduced in Section 1.3 ) are shown: both a low
resolution/large-field-of-view and a concentric, high-resolution/small-field-of-view patch are depicted.
look at the global architectural composition of the tissue sample
and analyze the context of each tissue component, including can-
cer, to identify the presence of duct (both healthy and potentially
cancerous) and other tissue structures. Additionally, they zoom-in
into each region of interest, where the tissue is examined at a
high-resolution, to obtain the details of the cancer cells, and char-
acterize the tumor based on its local cellular composition. Another
example where pathologists take advantage of both context and
details is the spatial distribution of immune cells, which may be
detected in the presence of inflammation inside the tumor or the
stromal compartment of the cancer regions, as well as in specific
clustered groups called tertiary lymphoid structures (TLS), which
may develop in response to cancer. In subsequent stages of TLS
maturation, germinal centers (GC) are formed within the TLS (see
Fig. 1 ). It has been shown that the development of GC-containing
TLS have a significant relevance for patient survival and is an es-
sential factor for the understanding of tumor development and
treatment ( Sauts-Fridman et al., 2016; Sili ¸n a et al., 2018 ). A GC al-
ways lays within a TLS. A TLS contains a high density of lympho-
cytes with poorly visible cytoplasm, while GCs rather share simi-
larities with other less dense tissues like tumor nests. To identify
the TLS region and to differentiate between TLS and GC, both fine-
grained details, as well as contextual information, are needed.
1.2 . The receptive field and the field of view
In recent years, the vast majority of image analysis algorithms
are based on convolutional neural networks (CNN), a deep learn-
ing model that can tackle several computer vision tasks, includ-
ing semantic segmentation ( Long et al., 2015 ; Jégou et al., 2017 ;
Chen et al., 2018 ). In semantic segmentation, label prediction at
pixel level depends on the receptive field (RF), which is the extent
of the area of the input that is observable by a model. The size
of the RF of a CNN depends on the filter size, the pooling factor,
and the number of convolutional and pooling layers. By increas-
ing these parameters, the RF also increases, allowing the model to
capture more contextual information. However, this often comes at
the cost of an increase in the input size, which causes a high mem-
ory consumption due to large feature maps. As a consequence, a
number of implicit restrictions in model optimization have to be
applied often, such as reduction of the number of model’s param-
eters, number of feature maps, mini-batch size, or size of predicted
output, which may result in an ineffective training and in an inef-
ficient inference.
Another aspect concerning the observable information is the
field of view (FoV), which is the distance over the area (i.e., the ac-
tual space that the pixels disclose) in an input image and depends
on the spatial resolution of the input image. The FoV has implica-
tions for the RF: the same model, the same input size, and the
same RF size, can comprise a wider FoV by considering an im-
age at lower resolution due to the compressed distance over the
area (i.e., fewer pixels disclose the same FoV of the original reso-
lution). Thereby, using a down-sampled representation of the orig-
inal input image, the model can benefit from more contextual ag-
gregation ( Graham and Rajpoot, 2018 ), at the cost of losing high-
resolution details. Furthermore, contextual aggregation is limited
by the input dimensions, meaning that a RF size can only exceed
the input dimensions if padded artificial input pixels are used (a
technique usually referred to as the use of same padding), which
do not contain contextual information. While reducing the origi-
nal input dimensions can be used to focus on scale information
( Kausar et al., 2018; Li et al., 2018 ), the potential contextual infor-
mation remains unchanged.
1.3 . Multi-field-of-view multi-resolution patches
Whole-slide images (WSI) are pyramidal data structures con-
taining multi-resolution gigapixel images, including down-sampled
representations of the original image. In the context of CNN mod-
els development, it is not possible to capture a complete WSI at
full resolution in the RF, due to the billions of pixels in a WSI,
which exceeds the capacity of the memory of a single modern
GPU that is usually used to train CNN models. A common way to
overcome this limitation is to train a CNN with patches (i.e., sub-
regions from a WSI). Due to the multi-resolution nature of WSIs,
the patches can originate from different spatial resolutions, which
are expressed in micrometers per pixel (μm/px). A patch is ex-
tracted by selecting a position within the WSI, together with a
size and a particular resolution. When extracting a patch at the
highest available resolution, the potential contextual information
is not yet depleted because there is available tissue around the
patch that is not considered in the RF. By extracting a concen-
tric (i.e., centered on the same location of the whole-slide im-
age) patch, with the same size but lower resolution, the same RF
aggregates more contextual information and includes information
that was not available before. Multiple concentric patches with the
same size, extracted at different resolutions, can be interpreted
as having multiple FoVs. Hence we call a set of these patches:
multi-field-view multi-resolution (MFMR) patches (see Fig. 1 ). To
date, research using MFMR patches extracted from histopathology
images, has mostly been focused on combining features obtained
from MFMR patches for patch classification ( Alsubaie et al., 2018;
Sirinukunwattana et al., 2018; Wetteland et al., 2019 ). However,
using patch classification for the purpose of semantic segmenta-
tion results in a coarse segmentation map or heavy computation
due to a required sliding window approach, which is needed for
the classification of every pixel. For the task of segmentation, the
use of MFMR patches is not straightforward: when combining fea-
2
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
tures obtained from MFMR patches, pixel-wise alignment should
be enforced when integrating them. Gu et al. (2018) proposed to
use a U-Net architecture ( Ronneberger et al., 2015 ) for processing
a high-resolution patch, and additional encoders to process lower
resolution patches. Subsequently, feature maps from the additional
encoders are cropped, up-sampled, and concatenated in the de-
coder parts of U-Net, at the places where skip connections are
concatenated as well. The feature maps from the additional en-
coder are up-sampled without skip connections, at the cost of lo-
calization precision. Furthermore, their proposed model concate-
nates feature maps at every depth in the decoder, which might
be redundant and results in a high memory consuming model.
Moreover, considering the necessity of pixel-wise alignment,
their model is restricted to same padding, which can introduce
artifacts.
A multi-class problem where classes are known to be subjected
to context and fine-grained details can benefit from the combined
information in a set of MFMR patches. However, this is still an un-
solved problem. The challenge is to simultaneously output a high-
resolution segmentation based on fine details detectable at high
resolution and the incorporation of contextual features.
1.4 . Our contribution
This paper introduces HookNet, a multi-branch segmentation
framework based on convolutional neural networks that can simul-
taneously incorporate contextual information and high-resolution
details to produce fine-grained segmentation maps of histopathol-
ogy images. From the perspective of multi-branch segmentation
models, the proposed framework introduces several novel technical
contributions: (1) it directly and efficiently combines pixel-aligned
feature maps across branches via a hooking mechanism; (2) it al-
lows for skip connections in all branches, thereby supporting local-
ization precision, and simultaneously uses valid convolutions while
keeping the combined feature maps pixel aligned.
From the perspective of applications in digital pathology, this
work contributes with two multi-class and multi-organ segmen-
tation models addressing problems where tissue is subjected to
high-resolution details and context. The first application focuses
on DCIS, IDC, and ILC segmentation in histopathology breast tis-
sue. The second application focuses on the segmentation of GC,
TLS, and tumor in lung tissue. To the best of our knowledge, these
tissue types have never been simultaneously and separately seg-
mented with a single model. Both models are released in the form
of web-applications.
2. Materials
In order to train and assess the performance of HookNet, we
collected two datasets: a breast dataset and a lung dataset.
2.1. Breast dataset
We collected 86 breast cancer tissue sections containing IDC
( n = 34), DCIS ( n = 35) and ILC ( n = 17 ). For the DCIS and IDC
cases, we used H&E stained tissue sections, which were initially
made for routine diagnostics. All tissue sections were prepared ac-
cording to the laboratory protocols from the Department of Pathol-
ogy of Radboud University Medical Center, Nijmegen (the Nether-
lands). Slides were digitized using a Pannoramic P250 Flash II scan-
ner (3DHistech, Hungary) at a spatial resolution of 0.24 μm/px. For
ILC, new tissue sections were cut and stained for H&E, after which,
slides were scanned using the same scanner and scanning protocol
as for the IDC/DCIS cases. After digitization of the WSIs, the H&E
stained ILC sections were de-stained and subsequently re-stained
Fig. 2. Example of procedure to manually annotate ILC regions in breast cancer
slides. Left: Slide stained with H&E, with manual annotation of the region con-
taining the bulk of tumor and details of manual (sparse) annotations of ILC and
of healthy epithelium. Right: Immunohistochemistry of the same tissue sample, de-
stained from H&E and re-stained with P120, used by medical research assistant to
guide manual annotations and to identify ILC cells.
( Brand et al., 2014 ) using P120 catenin antibody (P120) ( Canas-
Marques and Schnitt, 2016 ), which stains lobular carcinoma cells
cytoplasmic, rather than a membranous staining pattern in normal
epithelial cells. P120 stained sections were subsequently scanned
using the same scanner and protocol as the H&E sections. This pro-
cedure allowed us to have both H&E and immunohistochemistry
(IHC) of the same tissue section.
Three people were involved in the creation of manual annota-
tions: two medical research assistants, who had undergone a su-
pervised training procedure in the pathology department to specif-
ically recognize and annotate breast cancer tissue in histopathology
slides, and a resident pathologist (MB), with six years of experi-
ence in diagnostics and research in digital pathology. To guide the
procedure of annotating ILC, the resident pathologist visually iden-
tified and annotated the region containing the bulk of tumor in the
HE slide (see Fig. 2 ). Successively, the research assistants used this
information next to the available IHC slide to identify and annotate
ILC cells. Additionally, the research assistants made annotations of
DCIS, IDC, fatty tissue, benign epithelium, and an additional class
of other tissue, containing inflammatory cells, skin/nipple, erythro-
cytes, and stroma. All annotations were finally checked by the resi-
dent pathologist and corrections were made when needed. The in-
house developed open-source software ASAP ( Litjens et al., 2018 )
was used to make manual annotations. As a result, 6.279 regions
were annotated, of which 1.8 10 contained ILC cells. Sparse anno-
tations of tissue regions were made, meaning that drawn contours
could be both non-exhaustive (i.e., not all instances of that tissue
type were annotated) and non-dense (i.e., not all pixels belonging
to the same instance were included in the drawn contour). Ex-
amples of sparse annotations are depicted in Fig. 2 . As a result,
6 classes were annotated in this dataset. For training, validation,
and testing purposes, the WSIs were divided into training ( n = 50),
validation ( n = 18), and test ( n = 18) sets, all containing a similar
distribution of cancer types.
2.2. Lung dataset
We randomly selected 27 diagnostic H&E-stained digital slides
from the cancer genome atlas lung squamous cell carcinoma
(TCGA-LUSC) data collection, which is publicly available in ge-
nomic data commons (GDC) Data Portal ( Grossman et al., 2016 ).
For this dataset, sparse annotations of TLS, GC, tumor, and other
lung parenchyma were made by a senior researcher (KS) with
more than six years of experience in tumor immunology and
histopathology, and checked by a resident pathologist (MB). As a
result, 1.098 annotations, including 4 classes were annotated in
this dataset. For model development and performance assessment,
we used 3-fold cross-validation, which allowed us to test the per-
formance of the presented models on all available slides. All three
folds contain 12:6:9 slides for training:validating:testing. We made
sure that all splits had an appropriate class balance.
3
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
3. HookNet: multi-branch encoder-decoder network
In this section we present “HookNet”, a convolutional neural
network model for semantic segmentation that processes concen-
tric MFMR patches via multiple branches of encoder-decoder models
and combines information from different branches via a “hooking”
mechanism (see Fig. 3 ). The aim of HookNet is to produce semantic
segmentation by combining information from (1) low-resolution
patches with a large field of view, which carry contextual visual
information, and (2) high-resolution patches with a small field of
view, which carry fine-grained visual information. For this purpose,
we propose HookNet as a model that consists of two encoder-
decoder branches, namely a context branch, which extracts features
from input patches containing contextual information, and a target
branch, which extracts fine-grained details from the highest reso-
lution input patches for the target segmentation. The key idea of
this model is that fine-grained and contextual information can be
combined by concatenating feature maps across branches, thereby
resembling the process pathologist go through when zooming in
and out while examining tissue.
We present the four main HookNet components following
the order in which they should be designed to fulfill the con-
straints necessary for a seamless and accurate segmentation out-
put, namely (1) branches architecture and properties, (2) the ex-
traction of MFMR patches, (3) constraints of the “hooking” mecha-
nism, and (4) the handling of targets and losses.
3.1. Context and target branches
The first step in the design of HookNet is the definition of its
branches. Without loss of generality, we designed the model under
the assumptions that (1) the two branches have the same architec-
ture but do not share their weights, and (2) each branch consists of
an encoder-decoder model based on the U-Net ( Ronneberger et al.,
2015 ) architecture. As in the original U-Net model, each convolu-
tional layer performs valid 3 ×3 convolutions with stride 1, fol-
lowed by max-pooling layers with a 2 ×2 down-sampling fac-
tor. For the up-sampling path, we adopted the approach proposed
in Odena et al. (2016) consisting of nearest-neighbour 2 ×2 up-
scaling followed by convolutional layers.
3.2. MFMR input patches
The input to HookNet is a pair ( P
C
, P
T
) of ( M ×M ×3 ) MFMR
RGB concentric patches extracted at two different spatial resolu-
tions r
C and r
T measured in μm/px for the context (C) and the
target (T) branch, respectively. In this way, we ensure that the
field of view of P
T corresponds to the central square region of size
(M
r
T
r
C
×M
r
T
r
C
×3) of P
C but at higher resolution. In order to get a
seamless segmentation output and to avoid artifacts due to mis-
alignment of feature maps in the encoder-decoder branches, spe-
cific design choices should be made on (1) the size and (2) the
resolution of the input patches. First, Mhas to be chosen, such
that all feature maps in the encoder path have an even size before
each pooling layer. Introduced initially in Ronneberger et al. (2015) ,
this constraint is crucial for HookNet, as an unevenly sized fea-
ture map will cause misalignment of feature maps not only via
skip connections but also across branches. Hence, this constraint
ensures that feature maps across the two branches remain pixel-
wise aligned. Second, r
T and r
C should be chosen in such a way
that given the branches architecture, a pair of feature maps in the
decoding paths across branches comprise the same resolution. This
pair is an essential requisite for the “hooking” mechanism detailed
in Section 3.3 . In practice, given the depth D (i.e., the number
of pooling layers) of the encoder-decoder architecture, r
C and r
T
should take on values such that the following inequality is true:
2
D
r
T
r
C
.
3.3. Hooking mechanism
We propose to combine, i.e., hook-up information from the con-
text branch into the target branch via the simple concatenation
of feature maps extracted from the decoding paths of the two
branches. Our choice for concatenation as the operation to com-
bine feature maps is based on the success of skip connections
in the original U-Net, which are also using concatenation. More-
over, compared with pre-defined hard-coded operations such as
activation-wise multiplication or sum of feature maps, concatena-
tion allows the downstream convolutional layer to learn the opti-
mal operation to combine the features during the parameters op-
timization procedure.
The feature maps from the context and target branch should
not be concatenated before the bottleneck layer, such that seman-
tic encoding occurs separately. We postulate that concatenation
could be best done at the beginning of the decoder in the target
branch, to take maximum advantage of the inherent up-sampling
in the decoding path, where the concatenated feature maps can
benefit from every skip connection within the target branch. We
call this concatenation “hooking”, and in order to guarantee pixel-
wise alignment in feature maps, we define the spatial resolution of
a feature map as SRF = 2
d
r, where dis the depth in the encoder-
decoder model and ris the resolution of the input patch measured
Fig. 3. HookNet model architecture. Concentric patches with multiple views at multiple resolutions (MFMR patches) were used as input to a dual encoder-decoder model.
Skip connections for both branches are omitted for clarity. Feature maps were down- and up-sampled by a factor 2. In this example, the feature maps at depth 2 in the
decoder part of the context branch comprise the same resolution as the feature maps in the bottleneck of the target branch. To combine contextual information with
high-resolution information, feature maps from the context branch were hooked in the target branch before the first decoder layer by cropping
and concatenation.
4
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
in μm/px. To define the relative depths were the hooking can take
place, we define a SRF ratio between a pair of feature maps as
SRF
C
SRF
T
= 2
d
C
d
T
r
C
r
T
(1)
where d
T and d
C
, are the relative depths for the target and con-
text branch, respectively. In practice, hooking can take place when
feature maps from both branches comprise the same resolution:
SRF
C
SRF
T
= 1 .
As a result, the central square regions in the feature maps of the
context branch at depth d
C
, are corresponding to the feature maps
of the target branch at depth d
T
. The size of this central square
region is equal to the size of feature maps of the target branch be-
cause both feature maps comprise the same resolution. To do the
actual hooking, simple cropping can be applied, such that context
branch feature maps are pixel aligned concatenated together with
feature maps in the target branch.
3.4. Loss function
While training HookNet, a separate loss can be computed for
each branch. We propose a loss function L = λL
high
+ (1 λ) L
low
,
where L
high
and L
low
are pixel-wise categorical cross-entropy for
the target and the context branch, respectively, and λcontrols the
importance of each branch.
3.5. Pixel-based-sampling
Patches were sampled with a particular tissue type, i.e., class
label, at the center location of the sampled patch. Due to the
sparseness of the ground truth labels, some patches contained less
ground truth pixels than other patches. During training, we en-
sured that every class label is equally represented through the
following pixel-based sampling strategy. In the first mini-batch,
patches were randomly sampled. In all subsequent mini-batches,
patch sampling was guided based on the accumulation of the num-
ber of ground-truth pixels for every class seen in the previous
mini-batches. Classes that had a lower amount of pixel accumu-
lation had a higher chance of being sampled to compensate under-
represented classes.
3.6. Model training setup
Patches were extracted with 284 ×284 ×3 in dimensions, and
we used a mini-batch size of 12, which allows for two times the
number of classes to be in a batch. Convolutional layers used valid
convolutions, L
2
regularizer, and the ReLU activation function. Each
convolutional layer was followed by batch-normalization. Both
branches consisted of a depth of 4 (i.e., 4 down-sampling and 4
up-sampling operations). As mentioned in Section 3.1 , for down-
and up-sampling operations, we used max-pooling and nearest-
neighbours followed by a convolutional layer. To predict the soft
labels, we used the softmax activation function.
The contribution of the losses from the target and context
branch can be controlled with a λvalue. We have tested λ=1
to ignore the context loss, λ=0 . 75 to give more importance to
the target branch, λ=0 . 5 for equal importance and λ= 0 . 25 to
give more importance to the context loss. Moreover, we made
use of the Adam optimizer with a learning rate of 5 ×10
6
. We
customized the number of filters for all models, such that every
model has approximately 50 million parameters. We trained for
200 epochs where each epoch consist of 10 0 0 training steps fol-
lowed by the calculation of the F
1 score on the validation set,
which was used to determine the best model. To increase the vari-
ation of the datasets and account for color changes induced by the
variability of staining, we applied spatial, color, noise and stain
( Tellez et al., 2018 ) augmentations. No stain normalization tech-
niques were used in this work.
4. Experiments
In order to assess HookNet we compared it to five individual
U-Net models trained with patches extracted at the following res-
olutions: 0.5, 1.0, 2.0, 4.0, and 8.0 μm/px. The models are expressed
as U-Net( r
t
) and HookNet( r
t
, r
c
), where r
t and r
c are the input res-
olutions for the target and context branch, respectively. The aim
of HookNet is to output high-resolution segmentation maps, and
thereupon the target branch will process input patches extracted
at 0.5 μm/px. For the context branch, we extracted patches at the
intermediate (2.0 μm/px) and extreme (8.0 μm/px) resolutions that
were tested for the single-resolution models and showed poten-
tial value in single resolution performance measures. For the breast
data, these resolutions were 2.0 and 8.0 μm/px (see Table 1 ). For
the lung data, only the intermediate resolution 2.0 μm/px showed
potential value (see Table 3 ). In the HookNet models, ‘hooking’,
from the context branch into the target branch took place at rel-
ative depths where the features maps of both branches comprise
the same resolution, which is dependent on the input resolutions.
Considering the target resolution 0.5 μm/px, we applied ‘hooking’
from depth 2 (the middle) of the context encoder and depth 0
(the end) of the context decoder into depth 4 (the start or bottle-
neck) of the target decoder, respectively for the context resolution
2.0 μm/px and 8.0 μm/px.
To the best of our knowledge, the model proposed by
Gu et al. (2018) , namely MRN, is the most recent model along the
same line as HookNet. Therefore, we compared HookNet to MRN.
However, HookNet is different from MRN by (1) using ’valid’ in-
stead of ’same’ convolutions, (2) using an additional branch con-
sisting of an encoder-decoder (which enables multi loss models)
instead of a branch with an encoder only and (3) single upsam-
pling via the decoder of the target branch instead of multiple in-
dependent upsamplings. We instantiated MRN with one extra en-
coder and used input sizes of 256 ×256 ×3 . The convolutions in
MRN make use of same padding, which results in a bigger output
size compared to using valid convolutions, therefore allowing for
more pixel examples in each output prediction. For this reason and
to allow MRN to be trained on a single GPU, we used a mini-batch
size of 6 instead of 12.
All U-Net models and the HookNet model using a single loss
(where λ= 1 ) were trained within approximately 2 days. HookNet
trained with the additional contextual loss and MRN, were trained
within approximately 2.5 days. We argue that this increase in
training time is due to the extra loss in HookNet and the larger
size of the feature maps in MRN, which were a result of using
’same’ padding. All training times were measured using a GeForce
GTX 1080 Ti and 10 CPUs for parallel patch extraction and data
augmentation.
Tabl e 1
Performance of U-Net with different input resolutions on the Radboudumc test set
of breast cancer tissue types. Performance are reported in terms of F
1
score per tis-
sue type: ductal carcinoma in-situ (DCIS), invasive ductal carcinoma (IDC), invasive
lobular carcinoma (ILC), benign epithelium (BE), Other, and Fat, as well as overall
score (Macro F
1
) measured on all classes together.
Models F1-score
Model Resolution DCIS IDC ILC Benign Other Fat Overall
U-Net 0.5 0.47 0.55 0.85 0.75 0.95 0.99 0.76
U-Net 1.0 0.67 0.69 0.79 0.87 0.98 1.00 0.83
U-Net 2.0 0.79 0.83 0.79 0.84 0.98 1.00 0.87
U-Net 4.0 0.83 0.85 0.63 0.73 0.96 1.00 0.83
U-Net 8.0 0.86 0.81 0.20 0.66 0.96 1.00 0.75
5
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
Tabl e 2
Performance of U-Net(0.5), Multi-Resolution Network (MRN) from Gu et al. (2018) , and HookNet on the Radboudumc test set of breast cancer tissue types. Performance
are reported in terms of F
1
score per tissue type: ductal carcinoma in-situ (DCIS), invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC), benign epithelium (BE),
Other, and Fat, as well as overall score (Macro F
1
) measured on all classes together.
Models F1-score
Model Targ et resolution Context resolution λDCIS IDC ILC Benign Other Fat Overall
U-Net 0.5 N/A N/A 0.47 0.55 0.85 0.75 0.95 0.99 0.76
MRN 0.5 8.0 N/A 0.72 0.75 0.81 0.74 0.92 1.00 0.83
HookNet 0.5 2.0 1.0 0.62 0.75 0.82 0.82 0.98 1.00 0.83
HookNet 0.5 8.0 1.0 0.84 0.90 0.86 0.80 0.97 1.00 0.90
HookNet 0.5 8.0 0.5 0.83 0.87 0.88 0.76 0.98 1.00 0.89
HookNet 0.5 8.0 0.25 0.83 0.88 0.81 0.72 0.97 1.00 0.87
HookNet 0.5 8.0 0.75 0.84 0.89 0.91 0.84 0.98 1.00 0.91
5. Results
Quantitative performance, for the breast data set, in terms of
F
1 score for each considered class as well as an overall Macro F
1
( Haghighi et al., 2018 ) are reported in Table 1 , for all U-Net mod-
els and for each considered resolution. Quantitative performance
for all models with target resolution 0.5 μm/px (i.e., U-Net(0.5),
MRN and HookNet) are reported in Table 2 . Likewise, for the lung
dataset, quantitative performance are reported in Table 3 for all U-
Net models for each considered resolution and in Table 4 quantita-
tive performance are reported for all models with target resolution
0.5 μm/px (i.e., U-Net(0.5), MRN and HookNet).
Confusion matrices for U-Net and HookNet models for breast
and lung test sets are depicted in Figs. 4 and 5 , respectively. Finally,
visual results are shown for each class of breast and lung tissue in
Figs. 6 and 7 respectively.
5.1. Single-resolution models
Experimental results of single-resolution U-Net on DCIS and ILC
confirm our initial hypothesis, namely an increase in performance
that correlates with increase of context (microns per pixel) for
DCIS (e.g., from F
1
= 0 . 47 at 0.5 μm/px to F
1
= 0 . 86 at 8.0 μm/px),
and a completely opposite trend for ILC (e.g., from F
1
= 0 . 85 at
Tabl e 3
Performance of U-Net trained with different input resolutions on the TCGA test set
of lung cancer tissue. Performance are reported in terms of F
1
score per tissue type:
tertiary lymphoid structures (TLS), germinal centers (GC), Tumor, and Other, as well
as overall score (Macro F
1
) measured on all classes together.
Models F1-score
Model Resolution TLS GC Tumor Other Overall
U-Net 0.5 0.81 0.38 0.75 0.87 0.70
U-Net 1.0 0.86 0.44 0.71 0.86 0.72
U-Net 2.0 0.84 0.49 0.67 0.85 0.71
U-Net 4.0 0.80 0.37 0.56 0.80 0.63
U-Net 8.0 0.78 0.35 0.39 0.77 0.57
0.5 μm/px to F
1
= 0 . 20 at 8.0 μm/px), corroborating the needs for
a multi-resolution model. As expected, the lack of context causes
confusion between DCIS and IDC in U-Net(0.5), where breast duct
structures are not visible due to the limited field of view, whereas
the lack of details causes confusion between ILC and IDC in U-
Net(8.0) (see Fig. 4 ), where all loose tumor cells are interpreted
as part of a single bulk.
The highest performance in IDC and benign breast epithelium
where observed at relatively intermediate resolutions. The per-
formance of segmentation of fatty tissue is comparable in every
model, and the performance of segmenting other tissue decreases
when using relatively low resolutions (4.0 and 8.0 μm/px) or high-
resolution (0.5 μm/px).
For lung tissue, we observed an increase in performance that
correlates with an increase in context and decrease in resolution.
This is mostly due to an increase in F
1 score for GC in U-Net(2.0),
similar to the increase observed for DCIS, whereas lack of details
causes confusion between Tumor and Other in U-Net(2.0), similar
to what observed for ILC in breast tissue.
5.2. Multi-resolution models
In breast tissue segmentation, performance of HookNet strongly
depends on which fields of view are combined. We obtained the
best results with an overall F
1 score of 0.91 for HookNet(0.5, 8.0)
with λ= 0 . 75 , which substantially differs from HookNet(0.5, 2.0).
HookNet(0.5, 8.0) shows an overall increase in all tissue types for
the output resolution 0.5 μm/px except for a small decrease in
performance for DCIS of U-Net trained with patches at resolution
8.0 μm/px, and improves the performance on IDC, mostly due to
an improvement in ILC segmentation, which likely increases the
precision of the model for IDC. Note that DCIS is the only class
where U-Net working at the lowest considered resolution gives the
best performance ( F
1
= 0 . 86 ). However, that same U-Net has a dis-
mal F
1 score of 0.2 for ILC. HookNet(0.5, 8.0) processes the same
low-resolution input, but increases F
1 score for ILC by 0.66 com-
pared to U-Net(8.0), and at the same time increases F
1 score for
DCIS by 0.37 compared to U-Net(0.5). In general, HookNet(0.5, 8.0)
Fig. 4. Confusion matrices for models U-Net(0.5), U-Net(8.0) and HookNet(0.5, 8.0, λ= 0 . 75 ), which were trained on the breast dataset.
6
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
Tabl e 4
Performance of U-Net(0.5), Multi-Resolution Network (MRN) from Gu et al. (2018) , and HookNet on the TCGA test set of lung cancer tissue. Performance are reported in
terms of F
1
score per tissue type: tertiary lymphoid structures (TLS), germinal centers (GC), Tumor, and Other, as well as overall score (Macro F
1
) measured on all classes
together.
Models F1-score
Model Targ et resolution Context resolution λTLS GC Tumor Other Overall
U-Net 0.5 N/A N/A 0.81 0.38 0.75 0.87 0.70
MRN 0.5 2.0 N/A 0.83 0.40 0.71 0.88 0.71
HookNet 0.5 2.0 1.0 0.84 0.48 0.72 0.87 0.73
HookNet 0.5 2.0 0.5 0.85 0.48 0.68 0.85 0.72
HookNet 0.5 2.0 0.25 0.82 0.43 0.69 0.86 0.70
HookNet 0.5 2.0 0.75 0.85 0.45 0.71 0.87 0.72
Fig. 5. Confusion matrices for models U-Net(0.5), U-Net(2.0) and HookNet(0.5, 2.0, λ= 1 ), which were trained on the lung dataset.
improves F
1 score for all classes compared to U-Net(0.5) and to U-
Net(8.0), except for a small difference of 0.02 F
1 score in DCIS seg-
mentation. As for single U-Net models, all HookNet models per-
form comparably in fatty tissue and other tissue classes, as can be
observed in Fig. 6 .
In lung tissue segmentation, the best HookNet (with λ= 1 . 0 )
outperforms U-net(0.5) on the classes TLS and GC with an increase
of 0.03 and 0.1 in F
1 score, respectively, and at the same time
shows a decrease in F
1 score for Tumor by 0.01. F
1 scores for the
’other’ class are the same for both models. A mixture of differ-
ent models is outperforming HookNet on all distinct classes (U-
Net(1.0) for TLS, U-Net(2.0) for GC, U-Net(0.5) for Tumor, and MRN
for Other). However, HookNet achieves the highest overall F
1
score,
compared to all the considered models.
We observed that HookNet, using the same fields of view as
MRN, performs better than MRN, for both the breast and lung tis-
sue segmentation, with respect to the overall F
1 score. Finally, we
observe that for breast tissue segmentation, HookNet(0.5, 8.0) per-
forms best when giving more importance to the target branch (i.e.,
λ= 0 . 75 ), while for the lung tissue segmentation the best F
1
scores
were obtained when ignoring the context loss (i.e., λ= 1 ).
To verify if there is a significant difference between HookNet
and other models with the same target resolution (i.e., 0.5 μm/px)
we calculated the F
1 score per test slide. We applied the Wilcoxon
test, which revealed that for the breast dataset, the difference
between HookNet and U-Net ( p -value = 0.004), and HookNet
and MRN ( p -value = 0.001) are statistically significant. For the
lung dataset, the differences between HookNet and U-Net ( p -
value = 0.442), and HookNet and MRN ( p -value = 0.719) are
not statistically significant. These results suggest that HookNet
substantially benefits from wide contextual information (e.g.,
8.0 μm/px for the input resolution), whereas the added value of
context may be less prominent, but still beneficial, when relevant
contextual information is restricted (e.g., 2.0 μm/px for the input
resolution).
Nonetheless, based on the improved F
1 scores (see Table 4 )
and confusion scores (see Fig. 5 ) for TLS and GC, we argue that
HookNet can reduce the confusion between classes that are sub-
jected to contextual information.
6. Discussion
The main outcome of this research paper is two-fold. The first
outcome is a framework to effectively combine information from
context and details in histopathology images. We have shown its
effect in segmentation tasks, in comparison with other single-
resolution approaches, and with one multi-resolution recently pre-
sented. The presented framework takes MFMR patches as input,
and applies a series of convolutional and pooling layers, ensuring
that feature maps were combined according to (1) the same spatial
resolution, without the needs for arbitrary up-scaling and interpo-
lation, as done in Gu et al. (2018) , but allowing a direct concatena-
tion of feature maps from the context branch to the target branch;
(2) pixel-wise alignment, effectively combined with the use of valid
convolutions, which mitigates the risk of artifacts in the output
segmentation map. The optimal combination of fields of view used
in the two branches of HookNet has been determined experimen-
tally. We first tested single-resolution U-Net models and then com-
bined the fields of view that showed the best performance in two
critical aspects of the problem-specific segmentation task, namely
segmentation of DCIS and ILC for the breast dataset, and Tumor
and GC for the lung dataset. At the moment, no procedure exists
to select the optimal combination of spatial resolutions a priori,
and empirical case-based analysis is needed.
The second outcome consists of two models for multi-class
semantic segmentation in breast and lung cancer histopathology
samples stained with H&E. In both cases, we have included tu-
mor as one of the classes to segment, as well as other types of
tissue that can be present in the tumor tissue compartment, and
made a specific distinction between three breast cancer subtypes,
namely DCIS, IDC, and ILC. Although a specific set of classes in
breast and lung cancer tissue samples have been used as appli-
cations to show the potential of HookNet, presented methods are
general and extendable to an arbitrary number of classes, as well
as applicable to histopathology images of other organs. Qualitative
examples of segmentation output at whole-slide image level are
depicted in Fig. 8 , which shows the potential for using the outcome
of this paper in several applications. Segmentation of TLS and GC
in lung squamous cell carcinoma can be used to automate TLS
7
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
Fig. 6. Segmentation results on breast tissue shown for DCIS, IDC, ILC, Benign ep-
ithelium, Other and Fat . HookNet results are shown for
λ= 0 . 75 . The last three rows
focus on failure examples of HookNet.
detection in lung cancer histopathology images, which will allow
us to easily scale the analysis to a large number of cases, with
the aim of further investigating the prognostic and predictive value
of TLS count. At the same time, segmentation of tumor and other
tissue types allows to describe the morphology and tissue archi-
tecture of the tumor microenvironment, for example identifying
the region of the tumor bulk, or the interface between tumor and
stroma, an active research topic in immune-oncology, due to the
role of tumor-infiltrating lymphocytes (TILs), which have to be as-
sessed in the tumor-associated stroma ( Salgado et al., 2015 ) as well
as in the tumor bulk and at the invasive margin ( Galon et al., 2006;
2014 ). Furthermore, segmentation of both benign and malignant
epithelial cells in breast cancer can be used as the first step in an
automated pipeline for breast cancer grading, where the tumor re-
gion has to be identified to perform mitotic count , and regions of
both healthy and cancer epithelial cells have to be compared to
assess nuclear pleormophism .
In order to show the advantage of a multi-resolution approach
compared to a single-resolution model in semantic segmentation
of histopathology images, several design choices have been made
in this paper. Our future research will be focused on investigat-
Fig. 7. Segmentation results on lung tissue for TLS and GC, Tumor, and Other.
HookNet results are shown for
λ=1 . 0 . The last two rows focus on failure exam-
ples of HookNet.
ing the general applicability and design of HookNet with respect
to the used constraints. First, U-Net was used as the base model
for HookNet branches. This choice was motivated by the effective-
ness and flexibility of the encoder-decoder U-Net model, as well
as the presence of skip connections. Other encoder-decoder mod-
els can be adopted to build a HookNet model. Second, inspired
and motivated by the multi-resolution nature of WSIs, we devel-
oped and solely applied HookNet to histopathology images. How-
ever, we argue that HookNet has the potential to be useful for any
application where a combination of context and details is essential
to produce an accurate segmentation map. Several applications of
HookNet can be found in medical imaging, but it has the potential
for being extended to natural images as well. Third, we showed
that using two branches allows to take advantage of clear trends
like the performance of single-resolution models in DCIS and ILC
(see Fig. 1 ) in breast cancer data. However, when focusing on the
IDC class, we note that a single-resolution U-Net performs best at
intermediate resolutions. This motivates further research in incor-
porating more branches, to include intermediate fields of view as
well. Fourth, we limited HookNet, as well as models used in com-
parison to 50 million parameters, which allow model training us-
ing a single modern GPU with 11 GB of RAM. Introducing more
branches will likely require a multi-GPU approach, allowing for ex-
perimenting with deeper/wider networks and speed-up inference
time.
We compared HookNet in a single-loss ( λ= 1 ) and in a multi-
loss setup ( λ= 0 . 75 , 0.5, or 0.25). Our results showed that the
multi-loss model, when giving more importance to the target
branch (e.g., λ= 0 . 75 ), performs best for the breast tissue segmen-
tation, and that the single-loss model (e.g., λ= 1 . 0 ) scores best for
the lung tissue segmentation. Future work will focus on an exten-
sive optimization search for the value of λ.
8
M. van Rijthoven, M. Balkenhol, K. Sili ¸n a et al. Medical Image Analysis 68 (2021) 101890
Fig. 8. WSI predictions for the different models. First row: WSI example with DCIS. Second row: WSI example with IDC. Third row: WSI example with ILC.
Finally, we reported results in terms of F
1 scores for both the
Radboudumc and TCGA datasets based on sparse manual annota-
tions. Although this is a common approach to obtain a large het-
erogeneous set of data in medical imaging, we observed that this
approach limits the assessment of performance in the transition
zones of different tissue types. Extending the evaluation to an ad-
ditional set of densely annotated data is part of our future research
as well as effort in generating such manual annotations.
7. Conclusion
In this