Content uploaded by Soumick Chatterjee
Author content
All content in this area was uploaded by Soumick Chatterjee on Jan 30, 2022
Content may be subject to copyright.
CHAOS challenge - combined (CT-MR) Healthy
Abdominal Organ Segmentation
A. Emre Kavura,∗, N. Sinem Gezerb, Mustafa Barıs¸b, Sinem Aslanc,d, Pierre-Henri Conzee, Vladimir Grozaf, Duc Duy Phamg,
Soumick Chatterjeeh,i, Philipp Ernsth, Savas¸ ¨
Ozkanj, Bora Baydarj, Dmitry Lachinovk, Shuo Hanl, Josef Paulig, Fabian Isenseem,
Matthias Perkoniggn, Rachana Sathisho, Ronnie Rajanp, Debdoot Sheeto, Gurbandurdy Dovletovg, Oliver Specki, Andreas
N¨
urnbergerh, Klaus H. Maier-Heinm, G¨
ozde Bozda˘
gı Akarj, G¨
ozde ¨
Unalq, O˘
guz Dicleb, M. Alper Selverr,∗∗
aGraduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir, Turkey
bDepartment of Radiology, Faculty Of Medicine, Dokuz Eylul University, Izmir, Turkey
cCa’ Foscari University of Venice, ECLT and DAIS, Venice, Italy
dEge University, International Computer Institute, Izmir, Turkey
eIMT Atlantique, LaTIM UMR 1101, Brest, France
fMedian Technologies, Valbonne, France
gIntelligent Systems, Faculty of Engineering, University of Duisburg-Essen, Germany
hData and Knowledge Engineering Group, Otto von Guericke University, Magdeburg, Germany
iBiomedical Magnetic Resonance, Otto von Guericke University Magdeburg, Germany.
jDepartment of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey
kDepartment of Ophthalmology and Optometry, Medical Uni. of Vienna, Austria
lJohns Hopkins University, Baltimore, USA
mDivision of Medical Image Computing, German Cancer Research Center, Heidelberg, Germany
nCIR Lab Dept of Biomedical Imaging and Image-guided Therapy Medical Uni. of Vienna, Austria
oDepartment of Electrical Engineering, Indian Institute of Technology, Kharagpur, India
pSchool of Medical Science and Technology, Indian Institute of Technology, Kharagpur, India
qDepartment of Computer Engineering, ˙
Istanbul Technical University, ˙
Istanbul, Turkey
rDepartment of Electrical and Electronics Engineering, Dokuz Eylul University, Izmir, Turkey
ABSTRACT
Abstract
Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive de-
velopments in deep learning (DL) introduced new state-of-the-art segmentation systems. Despite outperforming the overall accuracy of existing
systems, the effects of DL model properties and parameters on the performance are hard to interpret. This makes comparative analysis a necessary
tool towards interpretable studies and systems. Moreover, the performance of DL for emerging learning approaches such as cross-modality and
multi-modal semantic segmentation tasks has been rarely discussed. In order to expand the knowledge on these topics, the CHAOS – Combined
(CT-MR) Healthy Abdominal Organ Segmentation challenge was organized in conjunction with the IEEE International Symposium on Biomed-
ical Imaging (ISBI), 2019, in Venice, Italy. Abdominal organ segmentation from routine acquisitions plays an important role in several clinical
applications, such as pre-surgical planning or morphological and volumetric follow-ups for various diseases. These applications require a certain
level of performance on a diverse set of metrics such as maximum symmetric surface distance (MSSD) to determine surgical error-margin or
overlap errors for tracking size and shape differences. Previous abdomen related challenges are mainly focused on tumor/lesion detection and/or
classification with a single modality. Conversely, CHAOS provides both abdominal CT and MR data from healthy subjects for single and multiple
abdominal organ segmentation. Five different but complementary tasks were designed to analyze the capabilities of participating approaches from
multiple perspectives. The results were investigated thoroughly, compared with manual annotations and interactive methods. The analysis shows
that the performance of DL models for single modality (CT /MR) can show reliable volumetric analysis performance (DICE: 0.98 ±0.00 /0.95 ±
0.01), but the best MSSD performance remains limited (21.89 ±13.94 /20.85 ±10.63 mm). The performances of participating models decrease
dramatically for cross-modality tasks both for the liver (DICE: 0.88 ±0.15 MSSD: 36.33 ±21.97 mm). Despite contrary examples on different
applications, multi-tasking DL models designed to segment all organs are observed to perform worse compared to organ-specific ones (perfor-
mance drop around 5%). Nevertheless, some of the successful models show better performance with their multi-organ versions. We conclude
that the exploration of those pros and cons in both single vs multi-organ and cross-modality segmentations is poised to have an impact on further
research for developing effective algorithms that would support real-world clinical applications. Finally, having more than 1500 participants and
receiving more than 550 submissions, another important contribution of this study is the analysis on shortcomings of challenge organizations such
as the effects of multiple submissions and peeking phenomenon.
∗Corresponding author: e-mail: emrekavur@gmail.com
∗∗Corresponding author: e-mail: alper.selver@deu.edu.tr
arXiv:2001.06535v3 [eess.IV] 7 Jan 2021
2
1. Introduction
In the last decade, medical imaging and image processing
benchmarks have become effective venues to compare perfor-
mance of different approaches in clinically important tasks (Ay-
ache and Duncan, 2016). These benchmarks have gained a par-
ticularly important role in the analysis of learning-based sys-
tems by enabling the use of common datasets for training and
testing (Simpson et al., 2019). Challenges that use these bench-
marks, bear a prominent role in reporting outcomes of the state-
of-the-art results in a structured way (Kozubek, 2016). In this
respect, the benchmarks establish standard datasets, evaluation
strategies, fusion possibilities (e.g. ensembles), and unresolved
difficulties related to the specific biomedical image processing
task(s) being tested (Menze et al., 2014). An extensive web-
site, grand-challenge.org (van Ginneken and Kerkstra, 2015),
has been designed for hosting the challenges related to medi-
cal image segmentation and currently includes around 200 chal-
lenges.
A comprehensive exploration of biomedical image analysis
challenges reveals that the construction of datasets, inter- and
intra-observer variations for ground truth generation as well as
evaluation criteria might prevent establishing the true potential
of such events (Reinke et al., 2018b). Suggestions, caveats,
and roadmaps are being provided by reviews (Maier-Hein et al.,
2018; Reinke et al., 2018a) to improve the challenges.
Considering the dominance of machine learning (ML) ap-
proaches, two main points are continuously being emphasized:
1) recognition of current roadblocks in applying ML to med-
ical imaging, 2) increasing the dialogue between radiologists
and data scientists (Prevedello et al., 2019). Accordingly, chal-
lenges are either continuously updated (Menze et al., 2014), re-
peated after some time (Staal et al., 2004), or new ones having
similar focuses are being organized to overcome the pitfalls and
shortcomings of existing ones.
Abdominal imaging is one of the important sub-fields of di-
agnostic radiology. It focuses on imaging the organs/structures
in the abdomen such as the liver, kidneys, spleen, bladder,
prostate, pancreas by CT, MRI, ultrasonography, or any other
dedicated imaging modality. Emergencies that require treat-
ment or intervention such as acute liver failure, impaired kidney
function, and abdominal aortic aneurysm must be immediately
detected by abdominal imaging. It plays important role in iden-
tifying various diseases during routine controls and follow-ups.
Therefore, studies and challenges in the segmentation of ab-
dominal organs/structures have always constituted an important
research field.
A detailed literature review about the challenges related to
abdominal organs (see Section II) revealed that the existing
challenges in the field are dominated by CT scans and tu-
mor/lesion classification tasks. Up to now, there have only been
a few benchmarks containing abdominal MRI series (Table I).
Although this situation was typical for the last decades, the
emerging technology of MRI makes it the preferred modality
for further and detailed analysis of the abdomen. The remark-
able developments in MRI technology in terms of resolution,
dynamic range, and speed enable joint analyses of these modal-
ities (Hirokawa et al., 2008).
To gauge the current state-of-the-art in automated abdomi-
nal segmentation and observe the performance of various ap-
proaches on different tasks such as cross-modality learning and
multi-modal segmentation, we organized the Combined (CT-
MR) Healthy Abdominal Organ Segmentation (CHAOS) chal-
lenge in conjunction with the IEEE International Symposium
on Biomedical Imaging (ISBI) in 2019. For this purpose, we
prepared and made available a unique dataset of CT and MR
scans from unpaired abdominal image series. A consensus-
based multiple expert annotation strategy was used to generate
ground truth. A subset of this dataset was provided to the par-
ticipants for training, and the remaining images were used to
test performance against the (hidden) manual delineations us-
ing various metrics. In this paper, we report both setup and the
results of this CHAOS benchmark as well as its outcomes.
The rest of the paper is organized as follows. A review of the
current challenges in abdominal organ segmentation is given in
Section II together with surveys on benchmark methods. Next,
CHAOS datasets, setup, ground truth generation, and the tasks
are presented in Section III. Section IV describes the evalua-
tion strategy. Then, participating methods are comparatively
summarized in Section V. Section VI presents the results, and
Section VII provides a discussion and concludes the paper.
2. Related Work
According to our literature analysis, currently, there exist 12
challenges focusing on abdominal organs (van Ginneken and
Kerkstra, 2015) (see Tab. 1). Being one of the pioneering chal-
lenges, SLIVER07 initialized the liver benchmarking (Heimann
et al., 2009; Van Ginneken et al., 2007). It provided a com-
parative study of a range of algorithms for liver segmentation
under several intentionally included difficulties such as patient
orientation variations or tumors and lesions. Its outcomes re-
ported a snapshot of the methods that were popular for medical
image analysis at that time. However, since then, abdomen-
related challenges mostly targeted disease and tumor detec-
tion rather than organ segmentation. In 2008, “3D Liver Tu-
mor Segmentation Challenge (LTSC08)” (Deng and Du, 2008)
was organized as the continuation of the SLIVER07 challenge
to segment liver tumors from abdomen CT scans. Similarly,
Shape 2014 and 2015 (Kistler et al., 2013) challenges focused
on liver segmentation from CT data. VISCERAL Anatomy 3
(Jimenez-del Toro et al., 2016) provided a unique challenge,
which was a very comprehensive platform for segmenting not
only upper-abdominal organs, but also various other organs
such as left/right lung, urinary bladder, and pancreas. “Multi-
Atlas Labeling Beyond the Cranial Vault - Workshop and Chal-
lenge” focused on multi-atlas segmentation with abdominal and
cervix clinically acquired CT scans (Landman et al., 2015).
LiTS - Liver Tumor Segmentation Challenge (Bilic et al., 2019)
3
Table 1. Overview of challenges that have upper abdomen data and task. (Other structures are not shown in the table.)
Challenge Task(s) Structure (Modality) Organization and year
SLIVER07
(Van Ginneken et al.,
2007)
Single model
segmentation Liver (CT) MICCAI 2007, Australia
LTSC08
(Deng and Du, 2008)
Single model
segmentation Liver tumor (CT) MICCAI 2008, USA
Shape 2014
(Kistler et al., 2013)
Building organ
model Liver (CT) Del´
emont, Switzerland
Shape 2015
(Kistler et al., 2013)
Completing partial
segmentation Liver (CT) Del´
emont, Switzerland
VISCERAL Anatomy 3
(Jimenez-del Toro et al.,
2016)
Multi-model
segmentation
Kidney, urinary bladder, gallbladder,
spleen, liver, and pancreas (CT and MRI
for all organs)
VISCERAL Consortium,
2014
Multi-Atlas Labeling
Beyond the Cranial
Vault
(Landman et al., 2015)
Multi-atlas
segmentation
Adrenal glands, aorta, esophagus, gall
bladder, kidneys, liver, pancreas,
splenic/portal veins, spleen, stomach, and
vena cava (CT)
MICCAI 2015
LiTS
(Bilic et al., 2019)
Single model
segmentation Liver and liver tumor (CT) ISBI 2017, Australia;
MICCAI 2017, Canada
Pancreatic Cancer
Survival Prediction
(Guinney et al., 2017)
Quantitative
assessment of cancer Pancreas (CT) MICCAI 2018, Spain
MSD
(Simpson et al., 2019)
Multi-model
segmentation
Liver (CT), liver tumor (CT), spleen (CT),
hepatic vessels in the liver (CT), pancreas
and pancreas tumor (CT)
MICCAI 2018, Spain
KiTS19
(Weight et al., 2019)
Single model
segmentation Kidney and kidney tumor (CT) MICCAI 2019, China
CHAOS Multi-model
segmentation
Liver, kidney(s), spleen (CT, MRI for all
organs) ISBI 2019, Italy
is another example that covers liver and liver tumor segmenta-
tion tasks in CT. Other similar challenges can be listed as Pan-
creatic Cancer Survival Prediction (Guinney et al., 2017), which
targets pancreas cancer tissues in CT scans; KiTS19 (Weight
et al., 2019) challenge, which provides CT data for kidney tu-
mor segmentation.
In 2018, Medical Segmentation Decathlon (MSD) (Simpson
et al., 2019) was organized by a joint team and provided a sub-
stantial challenge that contained many structures such as liver
parenchyma, hepatic vessels and tumors, spleen, brain tumors,
hippocampus, and lung tumors. The focus of the challenge was
not only to evaluate the performance for each structure, but to
observe the generalizability, translatability, and transferability
of a system. Thus, the main idea behind MSD was to under-
stand the key elements of DL systems that can work on many
tasks. To provide such a source, MSD included a wide range of
challenges including small and unbalanced sample sizes, vary-
ing object scales, and multi-class labels. The approach of MSD
underlines the ultimate goal of the challenges that is to pro-
vide extensive datasets on several different tasks, and evaluation
through a standardized analysis and validation process.
In this respect, a recent survey showed that another trend in
medical image segmentation is the development of more com-
prehensive computational anatomical models leading to multi-
organ related tasks rather than traditional organ and/or disease-
specific tasks (Cerrolaza et al., 2019). By incorporating inter-
organ relations into the process, multi-organ related tasks re-
quire a complete representation of the complex and flexible ab-
dominal anatomy. Thus, this emerging field requires new effi-
cient computational and machine learning models.
Inspired by the above-mentioned visionary studies, CHAOS
was organized to strengthen the field by aiming at objec-
tives that involve emerging ML concepts, including cross-
modality learning, and multi-modal segmentation. In this re-
spect, CHAOS focuses on segmenting multiple organs from un-
paired patient datasets acquired by two modalities: CT and MR
(including two different pulse sequences).
4
3. CHAOS Challenge
3.1. Data Information and Details
The CHAOS challenge data contains 80 patients. 40 of them
(22 male, 18 female, ages between 18 and 63 with average
44.85±11.29) went through a single CT scan and 40 of them
(23 male, 17 female, ages between 18 and 76 with average
54.60±14.25) went through MR scans including 2 pulse se-
quences of the upper abdomen area. We present example im-
ages for CT and MR modalities in Fig.1. Both CT and MR
datasets include healthy abdomen organs without any patholog-
ical abnormalities (tumors, metastasis, and so on).
There are various clinical reasons for measurement of vol-
ume, size, and shape through the precise segmentation of
healthy abdominal organs. For instance, the liver volume is
affected by several diseases including congestive heart failure,
cancer, cirrhosis, infections, metabolic disorders, and congeni-
tal diseases. The dimensions of the liver may give clues about
the severity of the disease. The growth pattern of the liver and
its change in the treatment process also provide valuable in-
formation about the importance and prognosis of the disease.
For this reason, determining whether the liver is enlarged or
not, calculating its volume, and specifying related effects are
very important. Precise segmentation is also required to plan
liver transplant surgeries. For example, determining whether
a portion of the liver to be resected is sufficient for the recip-
ient patient and whether the remaining liver will be sufficient
for the donor is an important part of treatment decisions (Low
et al., 2008). Furthermore, the determination of the most suit-
able donors for living donated transplantation and pre-operative
planning needs accurate segmentation of the liver. For these
reasons, the objective was to evaluate the liver volume per-
fectly in the challenge, and accordingly, healthy patient livers
were studied. Besides, there are several medical reasons that
require the segmentation of not only the liver but also other
solid organs in the abdominal region. For example, the spleen
enlarges in cases of portal hypertension, infections and espe-
cially in lymphoproliferative diseases (Robertson et al., 2001).
Because of its amorphous structure, a 3-dimensional visualiza-
tion, which requires segmentation, can provide much better in-
formation about the organ compared to a 2-dimentional image
analysis (Joiner et al., 2015; Lamb et al., 2002; Linguraru et al.,
2013). Furthermore, segmentation can be used for monitoring
the cortex thickness of kidneys, calculating cyst-parenchyma
ratios as in polycystic kidney disease and particularly valuable
for volumetric monitoring in renal tumors (King et al., 2000).
The datasets for the CHAOS challenge were collected from
the Department of Radiology, Dokuz Eylul University Hospi-
tal, Izmir, Turkey. The scan protocols are briefly explained in
the following subsections. Further details and explanations are
available on the CHAOS website1. This study was approved by
the Institutional Review Board of Dokuz Eylul University.
1CHAOS data information: https://chaos.grand-challenge.org/Data/
3.1.1. CT Data Specifications
The CT volumes were acquired at the portal venous phase af-
ter contrast agent injection. In this phase, the liver parenchyma
is enhanced maximally through blood supply by the portal vein.
Portal veins are enhanced well but some enhancements also ex-
ist for hepatic veins. This phase is widely used for liver and
vessel segmentation, prior to surgery. Since the tasks related to
CT data only include liver segmentation, this set has only an-
notations for the liver. The details of the data are presented in
Tab.2 and a sample case is illustrated in Fig.1, left and Fig.2.
3.1.2. MRI Data Specifications
The MRI dataset includes two different sequences (T1 and
T2) for 40 patients. In total, there are 120 DICOM datasets from
T1-DUAL in-phase (40 datasets), oppose-phase (40 datasets),
and T2-SPIR (40 datasets). Each of these sets was acquired
from routine screening of the abdomen in the clinic. T1-DUAL
in-phase and oppose-phase images are registered. Therefore,
their ground truth is the same. On the other hand, T1 and T2
sequences are not registered. The datasets were acquired on a
1.5T Philips MRI, which produces 12-bit DICOM images. The
details of this dataset are given in Tab.2 and a sample case is
illustrated in Fig.1, middle and right, and Fig.3.
3.2. Aims and Tasks
The CHAOS challenge has three separate but related aims:
1. segmentation of the liver from CT scans,
2. segmentation of solid abdominal organs (liver, spleen, kid-
neys) from MRI sequences.
3. segmentation of organs from mixed (CT-MRI) datasets.
CHAOS provides different opportunities for segmentation al-
gorithm design to the participants through five individual tasks:
Task 1: Liver Segmentation (CT-MRI) focuses on using
a single system that can segment the liver from both CT and
multi-modal MRI (T1-DUAL and T2-SPIR sequences). This
corresponds to “cross-modality” learning, which is expected to
be used more frequently as the abilities of DL are improving
(Valindria et al., 2018).
Task 2: Liver Segmentation (CT) covers a regular segmen-
tation task, which can be considered relatively easy due to the
inclusion of only healthy livers aligned in the same direction
and patient position. On the other hand, the diffusion of contrast
agent to parenchyma and the enhancement of the inner vascular
tree creates challenging difficulties.
Task 3: Liver Segmentation (MRI) has a similar objective
to Task 2, but targets multi-modal MRI data randomly collected
within the routine clinical workflow. The methods are expected
to work on both T1-DUAL (in-phase and oppose-phase) as well
as T2-SPIR MR sequences.
Task 4: Segmentation of abdominal organs (CT-MRI) is
similar to Task 1 with an extension to multiple organ segmen-
tation from MR. In this task, the interesting part is that only
the liver is annotated as ground truth in the CT datasets, but the
MRI datasets have four annotated abdominal organs. In other
5
Fig. 1. Example slices from CHAOS CT, MR (T1-DUAL in-phase) and MR (T2-SPIR) datasets (liver:red, right kidney:dark blue, left kidney:light blue
and spleen:yellow).
Fig. 2. 3D visualization of the liver from the CHAOS CT dataset (case 35). Fig. 3. 3D visualization of liver (red), right kidney (dark blue), left kidney
(light blue) and spleen (yellow) from the CHAOS MR dataset (case 40).
Table 2. Statistics about the CHAOS CT and MRI datasets.
Specification CT MR
Number of patients (Train +Test) 20 +20 20 +20
Number of sets (Train +Test) 20 +20 60 +60*
In-plane spatial resolution 512 x 512 256 x 256
Number of axial slices in each examination [min-max] [78 - 294] [26 - 50]
Average axial slice number 160 32x3*
Total axial slice number 6407 3868x3*
X spacing (mm/voxel) left-right [min-max] [0.54 - 0.79] [0.72 - 2.03]
Y spacing (mm/voxel) anterior-posterior [min-max] [0.54 - 0.79] [0.72 - 2.03]
Slice thickness (mm) [min-max] [2.0 - 3.2] [4.4 - 8.0]
* MRI sets are collected from 3 different pulse sequences. For each patient, T1-DUAL registered in-phase and oppose-phase
and T2-SPIR MRI data are acquired.
6
words, model input is the same (i.e. 2D slice image or 3D vol-
ume), but the number of outputs is different for CT and MR.
Such a task is added to the challenge because, in the routine
clinical workflow, the aim of acquisition can vary: it can be for
multiple organs or a single one. When the scan is performed
for a single organ, the remaining organs might not be acquired
completely. Thus, a model, which would be used for abdomi-
nal organ segmentation in daily workflow, should handle these
varying conditions and output types.
Task 5: Segmentation of abdominal organs (MRI) is the
same as Task 3 but extended to four abdominal organs.
Here, it is important to point out that Task 1 can be seen as a
union of Tasks 2 and 3. Similarly, Task 4 can be seen as a union
of Tasks 1/2 and 5. In that case, the teams might externally di-
vide the datasets into two (i.e. CT and MR originated) and feed
them separately to the systems they have developed for Tasks 1
and 4. However, the main aim of these tasks is obtaining sys-
tems that can handle variations of different datasets and better
fit to the clinical workflow. Accordingly, for Tasks 1 and 4, a
fusion of individual models obtained from different modalities
(i.e. two models, one working on CT and the other on MRI) is
not valid. In more detail, it is not allowed to combine systems
that are specifically set for a single modality and operate com-
pletely independently. Instead, novel model designs and better
training strategies are expected to handle the challenges asso-
ciated with these tasks. Alternatively, for Task 4, the fusion of
individual modality-specific models can be used if a shared pre-
processing block detects modality type and processes the output
by different sub-systems. However, this is not valid for Task 1,
which specifically aims at cross-modality training. Besides, the
fusion of individual models for MRI sequences (T1-DUAL and
T2-SPIR) is allowed in all MRI-included tasks due to the lower
spatial dimension of the MR scans. More details about the tasks
are available on the CHAOS challenge website.2 3
3.3. Annotations for reference segmentation
All 2D slices were labeled manually by three different ra-
diology experts who have 10, 12, and 28 years of experience,
respectively. The final shapes of the reference segmentations
were decided by majority voting. Also, in some extraordinary
situations such as when inferior vena cava (IVC) is accepted
as a part of the liver, experts have made joint decisions. In
CHAOS, voxels that belong to IVC were excluded unless they
are not completely inside the liver. Although this handcrafted
annotation process has taken a considerable amount of time,
it was carried out to create a consistent and consensus-based
ground truth image series.
3.4. Challenge Setup and Distribution of the Data
Both CT and MRI datasets were divided into 20 sets for train-
ing and 20 sets for testing. When dividing the sets into training
and testing, attention was paid to the fact that the cases in both
sets contain similar features (resolution, slice thickness, age of
2CHAOS description: https://chaos.grand-challenge.org/
3CHAOS FAQ: https://chaos.grand-challenge.org/News and FAQ/
patients) as stratification criteria. We presented to the CHAOS
participants training data with ground truth labels, and test data
containing only the original images. To provide sufficient data
that contains enough variability, the datasets in the training data
were selected to represent all the difficulties that are observed
on the whole database, such as varying Hounsfield range and
non-homogeneous parenchyma texture of the liver due to the
injection of contrast media in CT images, sudden changes in
planar view, and the effect of bias field in MR images.
The images were distributed as DICOM files to present the
data in its original form. The only modification was remov-
ing patient-related information for anonymization. The ground
truth was also presented as image series to match the original
format. CHAOS data can be accessed with its DOI number via
the zenodo.org webpage under CC-BY-SA 4.0 license (Kavur
et al., 2019). One of the important aims of the challenges is
to provide data for long-term academic studies. We expect that
this data will be used not only for the CHAOS challenge but
also for other scientific studies such as cross-modality work or
medical image synthesis from different modalities.
4. Evaluation
4.1. Metrics
Since the outcomes of medical image segmentation are used
for various clinical procedures, using a single metric for 3D
segmentation evaluation is not a proper approach to ensure ac-
ceptable results for all requirements (Maier-Hein et al., 2018;
Yeghiazaryan and Voiculescu, 2015). Thus, in the CHAOS
challenge, four metrics were combined. The metrics were cho-
sen among the most frequently used ones in previous chal-
lenges (Maier-Hein et al., 2018). Their purpose was to analyze
results in terms of overlapping, volumetric, and spatial differ-
ences between a solution and the ground truth. Distance mea-
sures were transformed to millimeters according to an affine
transform matrix which was calculated by attributes (pixel spac-
ing, patient image position, patient image orientation) from DI-
COM metadata.
Let us assume that Srepresents the set of voxels in a segmen-
tation result, and G, the set of voxels in the ground truth. The
utilized metrics are as follows:
1. DICE coefficient (DICE) is calculated as 2|S∩G|/(|S|+
|G|), where |.|denotes cardinality (the larger, the better).
2. Relative absolute volume difference (RAVD) compares
two volumes in percent. RAVD =(abs(|S|−|G|)/|G|)×100,
where ‘abs’ denotes the absolute value (the smaller, the
better).
3. Average symmetric surface distance (ASSD) is the average
Hausdorffdistance between border voxels in Sand G. The
unit of this metric is millimeters (the smaller, the better).
4. Maximum symmetric surface distance (MSSD) is the max-
imum Hausdorffdistance between border voxels in Sand
G. The unit of this metric is millimeters (the smaller, the
better).
7
4.2. Scoring System
In the literature, there are mainly two ways of ranking re-
sults via multiple metrics. One way is ordering the results by
metrics’ statistical significance with respect to all results. An-
other way is converting the metric outputs to the same scale
and averaging all (Langville and Meyer, 2013). In CHAOS,
we adopted the second approach. Values coming from each
metric have been transformed to span the interval [0,100] so
that higher values correspond to better segmentation. For this
transformation, it was reasonable to apply thresholds in order
to cut offunacceptable results and increase the sensitivity of
the corresponding metric. We are aware of the fact that deci-
sions on metrics and thresholds have a very critical impact on
ranking (Maier-Hein et al., 2018). Therefore, instead of deter-
mining the threshold in an ad-hoc manner, we used intra- and
inter-annotator scores obtained from the experts, who created
the ground truth.
The radiologists repeated the annotation process of the five
abdomen scans for both CT and MRI (i.e. 10 datasets in total)
two times to enable intra-annotator variability analysis. These
reference masks were used for the calculation of the challenge
metrics in a pair-wise manner. In Tab.3, all metrics were calcu-
lated among repeatedly labeled patient sets for each annotator
individually. The average is used to observe the intra-annotator
variability. In Tab.4, we run all metrics among patient sets that
were annotated by different experts. Their averages are given in
the table.
Table 3. Metrics between two ground truth masks generated by the same
annotators (A1, A2, A3) over 5 CT and 5 MRI sets in order to observe
intra-annotator variability.
A1 A2 A3
CT - DICE 0.979 ±0.013 0.982 ±0.011 0.971 ±0.019
CT - RAVD (%) 0.358 ±0.133 0.339 ±0.105 0.344 ±0.112
CT - ASSD (mm) 0.289 ±0.108 0.257 ±0.102 0.243 ±0.113
CT - MSSD (mm) 5.783 ±2.154 5.756 ±2.098 5.357 ±2.127
MRI - DICE 0.968 ±0.035 0.976 ±0.076 0.969 ±0.192
MRI - RAVD (%) 0.438 ±0.191 0.408 ±0.312 0.472 ±0.394
MRI - ASSD (mm) 0.464 ±0.155 0.423 ±0.440 0.412 ±0.421
MRI - MSSD (mm) 6.113 ±2.961 6.147 ±5.903 6.057 ±5.918
The differences between intra- and inter-annotator variability
show the amount of performance change when the same anno-
Table 4. Metrics between two ground truth masks generated between an-
notator pairs (A1 and A2, A1 and A3, A2 and A3) over all patient sets in
order to observe inter-annotator variability.
A1 and A2 A1 and A3 A2 and A3
DICE 0.952 ±0.098 0.949 ±0.092 0.961 ±0.091
RAVD (%) 1.525 ±0.125 1.569 ±0.118 1.465 ±0.119
ASSD (mm) 1.622 ±0.961 1.564 ±0.989 1.492 ±0.9574
MSSD (mm) 9.174 ±4.487 9.028 ±0.428 8.877 ±0.421
tation process is repeated by the same expert at a different time
or by another expert, respectively. According to Tables 3 and
4, the amount of change for inter- and intra-annotator cases de-
pends on the chosen metric and modality type. Regarding the
modality type, the small spacing and inter-slice distance of CT
allow a narrow range compared to MRI. Regarding the depen-
dence on the chosen metric, for example, DICE, change be-
tween intra- and inter-annotator variability is relatively small.
On the other hand, the changes in other metrics are observed
to be higher. Based on these analyses, the thresholds are de-
termined by discussions among physicians and computer scien-
tists. As a result, the thresholds were determined as given in
Tab.5. (The effects of thresholds on ranking stability and ro-
bustness is discussed in Section 6.7)
Table 5. Summary of the metrics and thresholds. ∆represents longest pos-
sible distance in the 3D volume.
Metric name Best value Worst value Threshold
DICE 1 0 DICE >0.8
RAVD 0% ∞RAVD <5%
ASSD 0 mm ∆ASSD <15 mm
MSSD 0 mm ∆MSSD <60 mm
The metric values outside the threshold range get zero points.
The values within the range are mapped to the interval [0,100].
Then, the scores of each case in the test data are calculated
as the mean of the four scores. The missing cases (sets that
do not have segmentation results) get zero points and these
points are included in the final score calculation. The aver-
age of the scores across all test cases determines the overall
score of the team for the specified task. The code for all met-
rics (in MATLAB, Python, and Julia) is available at https://
github.com/emrekavur/CHAOS-evaluation. Also, more
details about the metrics, the CHAOS scoring system, and a
mini-experiment that compares sensitivities of different metrics
to distorted segmentations are provided on the same website.
5. Participating Methods
In this section, we present the majority of the results from the
conference participants and the best two of the post-conference
results collected among the online submissions. To be specific,
Metu MMLab and nnU-Net results belong to online submis-
sions while others are from the conference session. Statistics
about submission numbers are presented in Tab.6. Each method
is assigned a unique color code as shown in the figures and
tables. The majority of the applied methods (i.e. all except
IITKGP-KLIV) used variations of U-Net, which is a Convolu-
tional Neural Networks (CNN) approach that was first proposed
by Ronneberger et al. (2015) for segmentation on biomedical
images. This seems to be a typical situation as the correspond-
ing architecture dominates most of the recent DL based seg-
mentation studies even in the presence of limited annotated data
which is a typical scenario for biomedical image applications.
Among all studies, two rely on ensembles (i.e. MedianCHAOS
8
Table 6. CHAOS challenge submission statistics for on-site and online sessions (between 11 April 2019 - 1 October 2020).
Submission numbers Task 1 Task 2 Task 3 Task 4 Task 5
On-site 5 14 7 4 5
Online 30 379 178 29 187
Maximum number of submissions by one team (On-site) 1 5 1 1 1
Maximum number of submissions by one team (Online) 3 12 10 5 9
and nnU-Net), which uses multiple models and combine their
results.
The following paragraphs, summarize the participants’ meth-
ods. Brief comparisons of them in terms of methodological
details and training strategy are given in Tab. 7. Also, pre-,
post-processing and data augmentation strategies are provided
in Tab. 8.
OvGUMEMoRIAL: A modified version of Attention U-
Net proposed in (Abraham and Khan, 2019) is used. Differently
from the original UNet architecture (Ronneberger et al., 2015),
in Attention U-Net (Abraham and Khan, 2019), soft attention
gates are used, a multiscaled input image pyramid is employed
for better feature representation, and Tversky loss is computed
for the four different scaled levels. The modification adopted by
the OvGUMEMoRIAL team is that they employed parametric
ReLU activation function instead of ReLU, where an extra pa-
rameter, i.e., coefficient of leakage, is learned during training.
The ADAM optimizer is used; training is accomplished by 120
epochs with a batch size of 256.
ISDUE: The proposed architecture is constructed by three
main modules, namely 1) a convolutional autoencoder network
which is composed of the prior encoder fencp, and decoder gd ec;
2) a segmentation hourglass network which is composed of the
imitating encoder fenci, and decoder gdec ; 3) U-Net module, i.e.
hunet, which is used to enhance the decoder gd ec by guiding the
decoding process for better localization capabilities. The seg-
mentation networks, i.e. U-Net module and hourglass network
module, are optimized separately using the DICE loss and reg-
ularized by Lsc with a regularization weight of 0.001. The au-
toencoder is optimized separately using DICE loss. The ADAM
optimizer is used with initial learning rate of 0.001, batch size
of 1 is used and 2400 iterations are performed to train each
model. Data augmentation is performed by applying random
translation and rotation operations during training.
Lachinov: The proposed model is based on the 3D U-Net
architecture, with skip connections between contracting and ex-
panding paths and exponentially growing number of channels
across consecutive spatial resolution levels. The encoding path
is constructed by a residual network which provides efficient
training. Group normalization (Wu and He, 2018) is adopted
instead of the batch normalization (Ioffe and Szegedy, 2015),
by assigning the number of groups to 4. Data augmentation is
applied by performing random mirroring of the first two axes
of the cropped regions which is followed by random 90 degrees
rotation along the last axis and intensity shift with contrast aug-
mentations.
IITKGP-KLIV: In order to accomplish multi-modality
segmentation using a single framework, a multi-task adversarial
learning strategy is employed to train a base segmentation net-
work SUMNet (Nandamuri et al., 2019) with batch normaliza-
tion. To perform adversarial learning, two auxiliary classifiers,
namely C1 and C2, and a discriminator network, i.e. D, are
used. C1 is trained by the input from the encoder part of SUM-
Net which provides modality-specific features. A C2 classifier
is used to predict the class labels for the selected segmentation
maps. The segmentation network and classifier C2 are trained
using cross-entropy loss while the discriminator D and auxil-
iary classifier C1 are trained by binary cross-entropy loss. The
ADAM optimizer is used for optimization. The input data to
the network is the combination of all four modalities, i.e. CT,
MRI T1-DUAL in-phase, and oppose-phase as well as MRI T2-
SPIR.
METU MMLAB: This model is also designed as a varia-
tion of U-Net. In addition, a Conditional Adversarial Network
(CAN) is introduced in the proposed model. Batch Normaliza-
tion is performed before convolution. In this way, vanishing
gradients are prevented and selectivity is increased. Moreover,
parametric ReLU is employed to preserve the negative values
using a trainable leakage parameter. In order to improve the per-
formance around the edges, a CAN is employed during train-
ing (not as a post-process operation). This introduces a new
loss function to the system which regularizes the parameters
for sharper edge responses. Normalization of each CT image
is performed for pre-processing and 3D connected component
analysis is utilized for post-processing.
PKDIA: The team proposed an approach based on con-
ditional generative adversarial networks where the generator is
constructed by cascaded partially pre-trained encoder-decoder
networks (Conze et al., 2020b) extending the standard U-Net
(Ronneberger et al., 2015) architecture. More specifically, first,
the standard U-Net encoder part is exchanged for a deeper net-
work, i.e. VGG-19 by omitting the top layers. Differently from
the standard U-Net (Ronneberger et al., 2015), 1) 64 channels
(32 channels for standard U-Net) are generated by the first con-
volutional layer; 2) after each max-pooling operation, the num-
ber of channels doubles until it reaches 512 (256 for standard U-
Net); 3) second max-pooling operation is followed by 4 consec-
utive convolutional layers instead of 2. For training, the ADAM
optimizer with a learning rate of 10−5is used. The fuzzy DICE
score is employed as the loss function.
MedianCHAOS: Averaged ensemble of
five different networks is used. The first one is the DualTail-
Net architecture that is composed of an encoder, central block,
and 2 dependent decoders. While performing downsampling by
9
max-pooling operation, the max-pooling indices are saved for
each feature map to be used during the upsampling operation.
The decoder is composed of two branches: one that consists of
four blocks and starts from the central block of the U-net ar-
chitecture, and another one that consists of 3 blocks and starts
from the last encoder block. These two branches are processed
in parallel where the corresponding feature maps are concate-
nated after each upsampling operation. The decoder is followed
by a 1 ×1 convolution and sigmoid activation function which
provides a binary segmentation map at the output.
The other four networks are U-Net architecture variants,
i.e. TernausNet (U-Net with VGG11 backbone (Iglovikov and
Shvets, 2018)), LinkNet34 (Shvets et al., 2018), and two net-
works with ResNet-50 and SE-Resnet50. The latter two were
both pretrained on ImageNet encoders and decoders and consist
of convolution, ReLU, and transposed convolutions with stride
2. The two best final submissions were the averaged ensem-
bles of predictions obtained by these five networks. The train-
ing process for each network was performed with the ADAM
optimizer. DualTail-Net and LinkNet34 were trained with soft
DICE loss and the other three networks were trained with
the combined loss: 0.5*soft DICE +0.5*BCE (binary cross-
entropy). No additional post-processing was performed.
Mountain: A 3D network architecture modified from the
U-Net in (Han et al., 2019) is used. Differently from U-Net in
(Ronneberger et al., 2015), in (Han et al., 2019) a pre-activation
residual block in each scale level is used at the encoder part;
instead of max pooling, convolutions with stride 2 to reduce the
spatial size is employed; and instead of batch normalization,
instance normalization (Ulyanov et al., 2017) is used since in-
stance normalization is invariant to linear changes in the inten-
sity of each individual image. Finally, it sums up the outputs of
all levels in the decoder as the final output to encourage conver-
gence. Two networks adopting the aforementioned architecture
with a different number of channels and levels are used here.
The first network, NET1, is used to locate an organ such as the
liver. It outputs a mask of the organ to crop out the region of
interest to reduce the spatial size of the input to the second net-
work, NET2. The output of NET2 is used as the final segmenta-
tion of this organ. The ADAM optimizer is used with the initial
learning rate =1×10−3,β1=0.9, β2=0.999, and =1×10−8.
DICE coefficient was used as the loss function. The batch size
was set to 1. Random rotation, scaling, and elastic deformation
were used for data augmentation during training.
CIR MPerkonigg: In order to train the network jointly for
all modalities, the IVD-Net architecture of (Dolz et al., 2018) is
employed. It follows the structure of U-Net (Ronneberger et al.,
2015) with a number of modifications listed as follows:
1) Dense connections between encoder path of IVD-Net are
not used since no improvement is obtained with that scheme,
2) Not all images are used as input to the network during
training.
Residual convolutional blocks (He et al., 2016) are used.
Data augmentation is performed by accomplishing affine trans-
formations, elastic transformations in 2D, histogram shifting,
flipping, and Gaussian noise addition. In addition, Modality
Dropout (Li et al., 2016) is used as the regularization technique
where modalities are dropped with a certain probability when
the training is performed using multiple modalities which helps
decrease overfitting on certain modalities. Training is done by
using the ADAM optimizer with a learning rate of 0.001 for 75
epochs.
nnU-Net: The nnU-Net team participated in the challenge
with an internal variant of nnU-Net (Isensee et al., 2019), which
is the winner of Medical Segmentation Decathlon (MSD) in
2018, (Simpson et al., 2019). They have made submissions for
Task 3 and Task 5. These tasks need to process T1-DUAL in-
phase and oppose-phase images as well as T2-SPIR images.
While the T1-DUAL images are registered and can be used
as separate color channel inputs, it was not chosen to do so
because this would have required substantial modification to
nnU-Net (2 input modalities for T1-DUAL, 1 input modality
for T2-SPIR). Instead, T1-DUAL in-phase and oppose-phase
were treated as separate training examples, resulting in a total
of 60 training examples for the aforementioned tasks.
No external data was used. Task 3 is a subset of Task 5,
so training was done only once and the predictions for Task
3 were generated by isolating the liver label. The submitted
predictions are a result of an ensemble of three 3D U-Nets
(“3d fullres” configuration of nnU-Net). The five models orig-
inate from cross-validation on the training cases. Furthermore,
since only one prediction is accepted for both T1-DUAL image
types, an ensemble of the predictions of T1-DUAL in-phase and
oppose-phase was used.
6. Results
The training dataset was published approximately three
months before the on-site session. The test dataset was given 24
hours before the challenge session. The submissions were eval-
uated during the conference, and the winners were announced.
After the on-site session, training and test datasets were pub-
lished on the zenodo.org website (Kavur et al., 2019) and the
online submission system was activated on the challenge web-
site.
To compare the automatic DL methods with semi-automatic
ones, interactive methods including both traditional iterative
models and more recent techniques were employed from our
previous work (Kavur et al., 2020). In this respect, we report the
results and discuss the accuracy and repeatability of emerging
automatic DL algorithms with those of well-established interac-
tive methods, which are applied by a team of imaging scientists
and radiologists through two dedicated viewers: Slicer (Kikinis
et al., 2014) and exploreDICOM (Fischer et al., 2010).
There exist two separate leaderboards at the challenge web-
site, one for the conference session4and another for post-
conference online submissions5. Detailed metric values and
converted scores are presented in Tab.9. Box plots of all results
for each task are presented separately in Fig.4. Also, scores on
each testing case are shown in Fig.5 for all tasks. As expected,
4https://chaos.grand-challenge.org/Results CHAOS/
5https://chaos.grand-challenge.org/evaluation/results/
10
Table 7. Brief comparison of participating methods
Team Details of the method Training strategy
OvGUMEMoRIAL
(P.Ernst, S. Chatterjee, O. Speck,
A. N¨
urnberger)
•Modified Attention 2D U-Net (Abraham and Khan, 2019), employing soft atten-
tion gates and multiscaled input image pyramid for better feature representation is
used.
•Parametric ReLU activation is used instead of ReLU, where an extra parameter,
i.e. coefficient of leakage, is learned during training.
•Tversky loss is computed for the four different scaled levels.
•The ADAM optimizer is used, training is accomplished by 120
epochs with a batch size of 256.
ISDUE
(D. D. Pham, G. Dovletov,
J. Pauli)
•The proposed architecture consists of three main modules:
i. Autoencoder net composed of a prior encoder fenc p, and decoder gdec;
ii. Hourglass net composed of an imitating encoder fenci, and decoder gdec ;
iii. 2D U-Net module, i.e. hunet , which is used to enhance the decoder gd ec by
guiding the decoding process for better localization capabilities.
•The segmentation networks are optimized separately using the
DICE-loss and regularized by Lsc with weight of λ=0.001.
•The autoencoder is optimized separately using DICE loss.
•The ADAM optimizer with an initial learning rate of 0.001, and
2400 iterations are performed to train each model.
Lachinov
(D. Lachinov)
•3D U-Net, with skip connections between contracting/expanding paths and
exponentially growing number of channels across the consecutive resolution levels
(Lachinov, 2019).
•The encoding path is constructed by a residual network for efficient training.
•Group normalization (Wu and He, 2018) is adopted instead of batch (Ioffe and
Szegedy, 2015) (# of groups =4).
•Pixel shuffle is used as an upsampling operator
•The network was trained with ADAM optimizer with learning
rate 0.001 and decaying with a rate of 0.1 at the 7th and 9th epoch.
•The network is trained with batch size 6 for 10 epochs. Each
epoch has 3200 iterations in it.
•The loss function employed is DICE loss.
IITKGP-KLIV
(R. Sathish, R. Rajan, D. Sheet)
•To achieve multi-modality segmentation using a single framework, a multi-task
adversarial learning strategy is employed to train a base segmentation network 2D
SUMNet (Nandamuri et al., 2019) with batch normalization.
•Adversarial learning is performed by two auxiliary classifiers, namely C1 and C2,
and a discriminator network D.
•The segmentation network and C2 are trained using cross-
entropy loss while the discriminator D and auxiliary classifier C1
are trained by binary cross-entropy loss.
•The ADAM optimizer. Input is the combination of all four
modalities, i.e. CT, MRI T1 DUAL In-phase and Oppose-phase
MRI T2 SPIR.
METU MMLAB
(S. ¨
Ozkan, B. Baydar, G. B. Akar)
•A 2D U-Net variation and a Conditional Adversarial Network (CAN) is introduced.
•Batch Normalization is performed before convolution to prevent vanishing
gradients and increase selectivity.
•Parametric ReLU to preserve negative values using a trainable leakage parameter.
•To improve the performance around the edges, a CAN is
employed during training (not as a post-process operation).
•This introduces a new loss function to the system which regular-
izes the parameters for sharper edge responses.
PKDIA (P.-H.Conze) •An approach based on Conditional Generative Adversarial Networks (cGANs)
is proposed: the generator is built by cascaded pre-trained encoder-decoder (ED)
networks (Conze et al., 2020b) extending the standard 2D U-Net (sU-Net) (Ron-
neberger et al., 2015) (VGG19, following (Conze et al., 2020a)), with 64 channels
(instead of 32 for sU-Net) generated by the first convolutional layer.
•After each max-pooling, the channel number doubles until 512 (256 for sU-Net).
Max-pooling followed by 4 consecutive conv. layers instead of 2. The auto-context
paradigm is adopted by cascading two EDs (Yan et al., 2019): the output of the first
is used as features for the second.
•The ADAM optimizer with a learning rate of 10−5is used.
•The fuzzy DICE score is employed as a loss function.
•The batch size was set to 3 for CT and 5 for MR scans.
MedianCHAOS
(V.Groza)
•Averaged ensemble of five different networks is used. The first one is DualTail-Net
that is composed of an encoder, central block, and 2 dependent decoders.
•The other four networks are U-Net variants, i.e. TernausNet (2D U-Net with
VGG11 backbone (Iglovikov and Shvets, 2018)), LinkNet34 (Shvets et al., 2018),
and two with ResNet-50 and SE-Resnet50.
•The training for each network was performed with the ADAM.
•DualTail-Net and LinkNet34 were trained with soft DICE loss
and the other three networks were trained with the combined loss:
0.5*soft DICE +0.5*BCE (binary cross-entropy).
Mountain (Shuo Han) •3D network adopting the U-Net variant in (Han et al., 2019) is used. It differs
from U-Net in (Ronneberger et al., 2015), by adopting: i. A pre-activation residual
block in each scale level at the encoder, ii. Convolutions with stride 2 to reduce the
spatial size, iii. Instance normalization (Ulyanov et al., 2017).
•Two nets, i.e. NET1 and NET2, adopting (Han et al., 2019) with different channels
and levels. NET1 locates organ and outputs a mask for NET2 performing finer
segmentation.
•The ADAM optimizer is used with the initial learning rate
=1×10−3,β1=0.9, β2=0.999, and =1×10−8.
•DICE coefficient was used as the loss function. The batch size
was set to 1.
CIRMPerkonigg
(M. Perkonigg)
•For joint training with all modalities, the IVD-Net (Dolz et al., 2018) (which is
an extension of 2D U-Net Ronneberger et al. (2015)) is used with a number of
modifications:
i. dense connections between encoder path of IVD-Net are not used since no
improvement is achieved
ii. training images are split.
•Moreover, residual convolutional blocks (He et al., 2016) are used.
•Modality Dropout (Li et al., 2016) is used as the regularization
technique to decrease over-fitting on certain modalities.
•Training is done by using the ADAM optimizer with a learning
rate of 0.001 for 75 epochs.
nnU-Net
(F. Isensee, K. H. Maier-Hein)
•An internal variant of nnU-Net (Isensee et al., 2019), which is the winner of
Medical Segmentation Decathlon (MSD) in 2018 (Simpson et al., 2019), is used.
•The ensemble of five 3D U-Nets (“3d fullres” configuration), which originate from
cross-validation on the training cases. Ensemble of T1 in-phase and oppose-phase
was used.
•T1 in and out are treated as separate training examples, resulting
in a total of 60 training examples for the tasks.
•Task 3 is a subset of Task 5, so training was done only once and
the predictions for Task 3 were generated by isolating the liver.
11
Table 8. Pre-processing, post-processing and data augmentation operations together with participated tasks.
Team Pre-process Data augmentation Post-process Tasks
OvGUMEMoRIAL Training with resized images
(128 ×128). Inference: full-sized.
- Threshold by 0.5 1,2,3,4,5
ISDUE Training with resized images
(96,128,128)
Random translate and rotate Threshold by 0.5. Bicubic
interpolation for refinement.
1,2,3,4,5
Lachinov Resampling 1.4×1.4×2 z-score
normalization
Random ROI crop 192 ×192 ×64,
mirror X-Y, transpose X-Y, Window
Level - Window Width
Threshold by 0.5 1,2,3
IITKGP-KLIV Training with resized images
(256 ×256), whitening. Additional
class for body.
- Threshold by 0.5 1,2,3,4,5
METUMMLAB Min-max normalization for CT - Threshold by 0.5. Connected
component analysis for
selecting/eliminating some of the
model outputs.
1,3,5
PKDIA Training with resized
images: 256 ×256 MR, 512 ×512
CT.
Random scale, rotate, shear and shift Threshold by 0.5. Connected
component analysis for
selecting/eliminating some of the
model outputs.
1,2,3,4,5
MedianCHAOS
LUT [-240,160] HU range,
normalization.
- Threshold by 0.5. 2
Mountain Resampling 1.2×1.2×4.8, zero
padding. Training with resized
images: 384 ×384 ×64. Rigid
register MR.
Random rotate, scale, elastic
deformation
Threshold by 0.5. Connected
component analysis for
selecting/eliminating some of the
model outputs.
3,5
CIRMPerkonigg Normalization to zero mean unit
variance.
2D Affine and elastic transforms,
histogram shift, flip and adding
Gaussian noise.
Threshold by 0.5. 3
nnU-Net Normalization to zero mean unit
variance, Resampling 1.6×1.6×5.5
Add Gaussian noise /blur, rotate,
scale, WL-WW, simulated low
resolution, Gamma, mirroring
Threshold by 0.5. 3,5
the tasks that received the highest number of submissions and
scores were the ones focusing on the segmentation of a single
organ from a single modality. Thus, the vast majority of the
submissions were for liver segmentation from CT images (Task
2), followed by liver segmentation from MR images (Task 3).
Accordingly, in the following subsections, the results are pre-
sented in the order of performance/participation in Tab.6 (i.e.
from the task having the most submissions to the one having the
fewest). In this way, the segmentation from cross- and multi-
modality/organ concepts (Tasks 1 and 4) are discussed in light
of the performances of more conventional approaches (Tasks 2,
3, and 5).
6.1. Remarks about Multiple Submissions
Currently, more than 1500 participants are registered to
the CHAOS challenge through the “chaos.grand-challenge.org”
website. There is no direct correlation between “the number of
participants” and “the number of submitted results” (i.e. 550),
because there are passive participants, who never submitted a
result, as well as very active ones, who made multiple submis-
sions. The organizers put no restrictions for registration as long
as the candidates agree to the terms of use (i.e. research pur-
pose only etc.). This first step of joining is intentionally left
unrestricted to encourage wide participation. Moreover, there
is also no restriction requiring the participant to submit a result.
The registration and evaluation system is completely automated
(thanks to the grand-challenge.org website design) and all eval-
uated results are immediately published on the leaderboard re-
gardless of their score. There is no disqualification or exclusion
based on the number of submissions. However, the organizers
check the submitted results quantitatively and qualitatively to
prevent peeking as described in the following paragraphs.
Although test datasets should only be considered as the un-
seen (new) data provided to the algorithms to evaluate their
performance, there is a way to use them at the algorithm de-
velopment stage. This kind of use is called “peeking”, which
is done through reporting too many performance results by it-
erative submissions (Kuncheva, 2014). We claim that peeking
can be considered as one of the shortcomings of the image seg-
mentation grand-challenges. Since access to the ground truth is
not required, peeking makes it possible to use test data to tune
parameters, although the parameter tuning needs to be done dur-
12
Task 1 Task 2 Task 3 Task 4 Task 5
(a) (b) (c) (d) (e)
Fig. 4. Box plot of the methods’ score for (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, and (e) Task 5 on test data. White diamonds represent the mean
values of the scores. Solid vertical lines inside of the boxes represent medians. Separate dots show scores of each individual case.
(a) (b) (c) (d)
(e) (f) (g)
Fig. 5. Distribution of the scores for individual cases on test data.
13
Table 9. Metric values and corresponding scores of submissions. The given values represent the average of all cases and all organs of the related tasks in
the test data. The best results are given in bold.
Team Name Mean Score DICE DICE Score RAVD (%) RAVD Score ASSD (mm) ASSD Score MSSD (mm) MSSD Score
Task 1
OvGUMEMoRIAL 55.78 ±19.20 0.88 ±0.15 83.14 ±28.16 13.84 ±30.26 24.67 ±31.15 11.86 ±65.73 76.31 ±21.13 57.45 ±67.52 31.29 ±26.01
ISDUE 55.48 ±16.59 0.87 ±0.16 83.75 ±25.53 12.29 ±15.54 17.82 ±30.53 5.17 ±8.65 75.10 ±22.04 36.33 ±21.97 44.83 ±21.78
PKDIA 50.66 ±23.95 0.85 ±0.26 84.15 ±28.45 6.65 ±6.83 21.66 ±30.35 9.77 ±23.94 75.84 ±28.76 46.56 ±45.02 42.28 ±27.05
Lachinov 45.10 ±21.91 0.87 ±0.13 77.83 ±33.12 10.54 ±14.36 21.59 ±32.65 7.74 ±14.42 63.66 ±31.32 83.06 ±74.13 24.30 ±27.78
METU MMLAB 42.54 ±18.79 0.86 ±0.09 75.94 ±32.32 18.01 ±22.63 14.12 ±25.34 8.51 ±16.73 60.36 ±28.40 62.61 ±51.12 24.94 ±25.26
IITKGP-KLIV 40.34 ±20.25 0.72 ±0.31 60.64 ±44.95 9.87 ±16.27 24.38 ±32.20 11.85 ±16.87 50.48 ±37.71 95.43 ±53.17 7.22 ±18.68
Task 2
PKDIA* 82.46 ±8.47 0.98 ±0.00 97.79 ±0.43 1.32 ±1.30273.6 ±26.44 0.89 ±0.36 94.06 ±2.37 21.89 ±13.94 64.38 ±20.17
MedianCHAOS6 80.45 ±8.61 0.98 ±0.00 97.55 ±0.42 1.54 ±1.22 69.19 ±24.47 0.90 ±0.24 94.02 ±1.6 23.71 ±13.66 61.02 ±21.06
MedianCHAOS3 80.43 ±9.23 0.98 ±0.00 97.59 ±0.44 1.41 ±1.23 71.78 ±24.65 0.9 ±0.27 94.02 ±1.79 27.35 ±21.28 58.33 ±21.74
MedianCHAOS1 79.91 ±9.76 0.97 ±0.1 97.49 ±0.51 1.68 ±1.45 66.8 ±28.03 0.94 ±0.29 93.75 ±1.91 23.04 ±10 61.6 ±16.67
MedianCHAOS2 79.78 ±9.68 0.97 ±0.00 97.49 ±0.47 1.5 ±1.2 69.99 ±23.96 0.99 ±0.37 93.39 ±2.48 27.96 ±23.02 58.23 ±20.27
MedianCHAOS5 73.39 ±6.96 0.97 ±0.01 97.32 ±0.41 1.43 ±1.12 71.44 ±22.43 1.13 ±0.43 92.47 ±2.87 60.26 ±50.11 32.34 ±26.67
OvGUMEMoRIAL 61.13 ±19.72 0.90 ±0.21 90.18 ±21.25 9x103±4x10344.35 ±35.63 4.89 ±12.05 81.03 ±20.46 55.99 ±38.47 28.96 ±26.73
MedianCHAOS4 59.05 ±16 0.96 ±0.02 96.19 ±1.97 3.39 ±3.9 50.38 ±33.2 3.88 ±5.76 77.4 ±28.9 91.97 ±57.61 12.23 ±19.17
ISDUE 55.79 ±11.91 0.91 ±0.04 87.08 ±20.6 13.27 ±7.61 4.16 ±12.93 3.25 ±1.64 78.30 ±10.96 27.99 ±9.99 53.60 ±15.76
IITKGP-KLIV 55.35 ±17.58 0.92 ±0.22 91.51 ±21.54 8.36 ±21.62 30.41 ±27.12 27.55 ±114.04 81.97 ±21.88 102.37 ±110.9 17.50 ±21.79
Lachinov 39.86 ±27.90 0.83 ±0.20 68.00 ±40.45 13.91 ±20.4 22.67 ±33.54 11.47 ±22.34 53.28 ±33.71 93.70 ±79.40 15.47 ±24.15
Task 3
nnU-Net 75.10 ±7.61 0.95 ±0.01 95.42 ±1.32 2.85 ±1.55 47.92 ±25.36 1.32 ±0.83 91.19 ±5.55 20.85 ±10.63 65.87 ±15.73
PKDIA 70.71 ±6.40 0.94 ±0.01 94.47 ±1.38 3.53 ±2.14 41.8 ±24.85 1.56 ±0.68 89.58 ±4.54 26.06 ±8.20 56.99 ±12.73
Mountain 60.82 ±10.94 0.92 ±0.02 91.89 ±1.99 5.49 ±2.77 25.97 ±27.95 2.77 ±1.32 81.55 ±8.82 35.21 ±14.81 43.88 ±17.60
ISDUE 55.17 ±20.57 0.85 ±0.19 82.08 ±28.11 11.8 ±15.69 24.65 ±27.58 6.13 ±10.49 73.50 ±25.91 40.50 ±24.45 40.45 ±20.90
CIR MPerkonigg 53.60 ±17.92 0.91 ±0.07 84.35 ±19.83 10.69 ±20.44 31.38 ±25.51 3.52 ±3.05 77.42 ±18.06 82.16 ±50 21.27 ±23.61
METU MMLAB 53.15 ±10.92 0.89 ±0.03 81.06 ±18.76 12.64 ±6.74 10.94 ±15.27 3.48 ±1.97 77.03 ±12.37 35.74 ±14.98 43.57 ±17.88
Lachinov 50.34 ±12.22 0.90 ±0.05 82.74 ±18.74 8.85 ±6.15 21.04 ±21.51 5.87 ±5.07 68.85 ±19.21 77.74 ±43.7 28.72 ±15.36
OvGUMEMoRIAL 41.15 ±21.61 0.81 ±0.15 64.94 ±37.25 49.89 ±71.57 10.12 ±14.66 5.78 ±4.59 64.54 ±24.43 54.47 ±24.16 25.01 ±20.13
IITKGP-KLIV 34.69 ±8.49 0.63 ±0.07 46.45 ±1.44 6.09 ±6.05 43.89 ±27.02 13.11 ±3.65 40.66 ±9.35 85.24 ±23.37 7.77 ±12.81
Task 4
ISDUE 58.69 ±18.65 0.85 ±0.21 81.36 ±28.89 14.04 ±18.36 14.08 ±27.3 9.81 ±51.65 78.87 ±25.82 37.12 ±60.17 55.95 ±28.05
PKDIA 49.63 ±23.25 0.88 ±0.21 85.46 ±25.52 8.43 ±7.77 18.97 ±29.67 6.37 ±18.96 82.09 ±23.96 33.17 ±38.93 56.64 ±29.11
OvGUMEMoRIAL 43.15 ±13.88 0.85 ±0.16 79.10 ±29.51 5x103±5x10412.07 ±23.83 5.22 ±12.43 73.00 ±21.83 74.09 ±52.44 22.16 ±26.82
IITKGP-KLIV 35.33 ±17.79 0.63 ±0.36 50.14 ±46.58 13.51 ±20.33 15.17 ±27.32 16.69 ±19.87 40.46 ±38.26 130.3 ±67.59 8.39 ±22.29
Task 5
nnU-Net 72.44 ±5.05 0.95 ±0.02 94.6 ±1.59 5.07 ±2.57 37.17 ±20.83 1.05 ±0.55 92.98 ±3.69 14.87 ±5.88 75.52 ±8.83
PKDIA 66.46 ±5.81 0.93 ±0.02 92.97 ±1.78 6.91 ±3.27 28.65 ±18.05 1.43 ±0.59 90.44 ±3.96 20.1 ±5.90 66.71 ±9.38
Mountain 60.2 ±8.69 0.90 ±0.03 85.81 ±10.18 8.04 ±3.97 21.53 ±15.50 2.27 ±0.92 84.85 ±6.11 25.57 ±8.42 58.66 ±10.81
ISDUE 56.25 ±19.63 0.83 ±0.23 79.52 ±28.07 18.33 ±27.58 12.51 ±15.14 5.82 ±11.72 77.88 ±26.93 32.88 ±33.38 57.05 ±21.46
METU MMLAB 56.01 ±6.79 0.89 ±0.03 80.22 ±12.37 12.44 ±4.99 15.63 ±13.93 3.21 ±1.39 79.19 ±8.01 32.70 ±9.65 49.29 ±12.69
OvGUMEMoRIAL 44.34 ±14.92 0.79 ±0.15 64.37 ±32.19 76.64 ±122.44 9.45 ±11.98 4.56 ±3.15 71.11 ±18.22 42.93 ±17.86 39.48 ±16.67
IITKGP-KLIV 25.63 ±5.64 0.56 ±0.06 41.91 ±11.16 13.38 ±11.2 11.74 ±11.08 18.7 ±6.11 35.92 ±8.71 114.51 ±45.63 11.65 ±13.00
* Corrected submission of PKDIA right after the on-site session (i.e. During the challenge, they have submitted the same results, but in reversed orientation.
Therefore, the winner of Task 2 at conference session is the MedianCHAOS6).
ing the validation phase. Eventually, the peeking phenomenon
is an important issue and a still-existing problem, particularly
for evaluating the real-life performance of machine learning-
based models. Comparison of a fair development and peeking
14
Algorithm
Developing
Training on
train data
Validation on
validation
data
Parameter
tuning
Development stage
Evaluation
(on test data) Final Score
Evaluation stage
Algorithm
Developing
Training on
train data
Validation on
validation
data
Parameter
tuning
Development stage
Final Score
Peeking cycle + Evaluation stage
Submission
Evaluation
(on test data)
Additional
parameter
tuning
Fig. 6. Illustration of a fair study (green area) and peeking attempts (red area)
attempts is shown in Fig.6. After rigorous analysis of litera-
ture and the outcomes of the previous challenges, the CHAOS
organizers determine potential source of peeking as follows:
Multiple submissions are normally allowed, and the results
are disclosed to the participant. This allows the participants to
tune their model on the test data, which is a form of ‘peeking’
(Smialowski et al., 2010; Reunanen, 2003; Diciotti et al., 2013).
Therefore, using the outcomes (performance metrics vs result-
ing images) of successive submissions to fine-tune a model may
be over-tuned on the test data. Besides, at each iteration, the
model likely to become more sophisticated and not always re-
producible.
This issue requires further and in-depth analysis through ex-
Table 10. Results of selected peekingattempts that have been obtained from
online results of CHAOS challenge. The impact of peeking can be observed
from score changes. (Team names were anonymized.)
Participant Number of iterative
submissions Score change
Team A 21 +29.09%
Team B 19 +15.71%
Team C 16 +12.47%
Team D 15 +30.02%
Team E 15 +26.06%
Team F 13 +10.10%
tensive experimentation. During the evaluation of the results
submitted to CHAOS (besides allowing multiple submissions
without any restrictions) the results are also analyzed for possi-
ble peeking attempts: Related observations are discussed below.
Since there is still no statistical tool that can mathematically
prove peeking through multiple submissions, the corresponding
analyses are carried out interactively by communicating with
the participants.
To differentiate the reason behind a score increase (i.e. due to
the advancement of the technique/model or due to peeking), we
consider the indication of peeking to two cases: 1) Submitting
consecutive results in short time intervals, and 2) Having only
data specific changes at succeeding submission (i.e. changes
only at some particular cases). Although this is clearly a lim-
ited subset of all cases, the available tools only allow the anal-
ysis of these conditions. If the above-mentioned suspicions are
supported by evaluation metrics, then the participants are asked
to justify the improvements leading to their better performance.
An example of a suspicious condition can be described as fol-
lows:
Assume that a participating Team X submitted their results
and received Score Y. Just, a couple of hours later, they have
submitted another result and received score Y+α. Of course,
this can be attributed to model improvement or parameter ad-
justment, but when their results are investigated case by case,
it was observed that only the performance of some particular
cases was changed. If the team provides no reasonable explana-
tion for this improvement, then corresponding participants were
assumed to benefit from peeking. The number of submissions
15
and the percentage of score increase of the detected teams is
given in Tab.10. These results show that the impact of the peek-
ing can be noteworthy in some cases.
6.2. CT Liver Segmentation (Task 2)
This task includes one of the most frequently studied cases
and a very mature field of abdominal segmentation. There-
fore, it provides a good opportunity to test the effectiveness of
the participating models compared to the existing approaches.
Although the provided datasets only include healthy organs,
the injection of contrast media creates several additional chal-
lenges, as described in Section III.B. Nevertheless, the highest
scores of the challenge were obtained in this task (Fig.4b).
The on-site winner was MedianCHAOS with a score of
80.45±8.61 and the online winner is PKDIA with 82.46±8.47.
Being an ensemble strategy, the performance of the sub-
networks of MedianCHAOS is illustrated in Fig. 5.c. When
individual metrics are analyzed, DICE performance seem to
be outstanding (i.e. 0.98±0.00) for both winners (i.e. scores
97.79±0.43 for PKDIA and 97.55±0.42 for MedianCHAOS).
Similarly, ASSD performances have very high mean and small
variance (i.e. 0.89±0.36 [score: 94.06±2.37] for PKDIA and
0.90±0.24 [94.02±1.6] for MedianCHAOS). On the other hand,
RAVD and MSSD scores are dramatically low resulting in re-
duced overall performance. Actually, this outcome is valid for
all tasks and participating methods.
Regarding semi-automatic approaches in (Kavur et al.,
2020), the best three entries received scores of 72.8 (active con-
tours with a mean interaction time (MIT) of 25 minutes ), 68.1
(robust static segmenter having an MIT of 17 minutes), and 62.3
(i.e. watershed with MIT of 8 minutes). Thus, the successful
designs among participants in deep learning-based automatic
segmentation algorithms have outperformed the interactive ap-
proaches by a large margin. The quality of the segmentation
reaches almost the inter-expert level for volumetric analysis and
average surface differences. However, there is still a need for
improvement considering the metrics related to maximum er-
ror margins (i.e. RAVD and MSSD). An important drawback
of the deep approaches is that they might completely fail and
generate unreasonably low scores for particular cases, such as
the inferior vena cava region shown in Fig.7.
Regarding the effect of architectural design on performance,
comparative analyses have been performed through some well-
established deep frameworks (i.e. DeepMedic (Kamnitsas et al.,
2017) and NiftyNet (Gibson et al., 2018)). These models have
been applied with their default parameters and they have both
achieved scores of around 70. Thus, considering the partici-
pating models that have received scores below 70, it is safe to
conclude that, crafting the new deep architectural designs or ex-
tensive parameter tuning do not necessarily translate into more
successful systems.
6.3. MR Liver Segmentation (Task 3)
Segmentation from MR can be considered a more difficult
operation compared to segmentation from CT because CT im-
ages have a typical histogram and dynamic range defined by
Hounsfield Units (HU), whereas MRI does not have such a
standardization. Moreover, artifacts and other factors in clin-
ical routine cause critical degradation of the MR image qual-
ity. The on-site winner of this task is PKDIA with a score
of 70.71±6.40. PKDIA had the most successful results not
only for the mean score but also for the distribution of the re-
sults (shown in Fig.4c and 5d). Robustness to the deviations in
MR data quality is an important factor that affects performance.
For instance, CIR MPerkonigg, which has the most successful
scores for some cases, could not show a high overall score.
The online winner is nnU-Net with 75.10±7.61. When the
scores of individual metrics are analyzed for PKDIA and nnU-
Net, DICE (i.e. 0.94±0.01 [score: 94.47±1.38] for PKDIA
and 0.95±0.01 [score: 95.42±1.32] for nnU-Net) and ASSD
(i.e. 1.32±0.83 [score: 91.19±5.55] for nnU-Net and 1.56±0.68
[score: 89.58±4.54] for PKDIA) performance is again ex-
tremely good, while RAVD and MSSD scores are critically
lower than the CT results. The reason behind this can also
be attributed to the lower resolution and higher spacing of the
MR data, which cause a higher spatial error for each misclassi-
fied pixel/voxel (see Tab.2). Comparisons with the interactive
methods show that they tend to make regional mistakes due to
the spatial enlargement strategies. The main challenge for them
is to differentiate the outline when the liver is adjacent to iso-
dense structures. On the other hand, automatic methods show
much more distributed mistakes all over the liver. Further anal-
ysis also revealed that interactive segmentation methods tend to
make fewer over-segmentations. This is partly related to itera-
tive parameter adjustment by the operator which prevents unex-
pected results. Overall, the participating methods performed
equally well with interactive methods when only volumetric
metrics are considered. However, the interaction seems to out-
perform deep models for other metrics.
6.4. CT-MR Liver (Cross-Modality) Segmentation (Task 1)
This task targets cross-modality learning and it involves the
usage of CT and MR information together during training. A
model that can effectively accomplish cross-modality learning
would: 1) help to satisfy large amounts of training data by pro-
viding more images and 2) reveal common features of incor-
porated modalities for an organ. To compare cross-modality
learning with individual ones, Fig.5a should be compared to
Fig.5c for CT. Such a comparison clearly reveals that partici-
pating models trained only on CT data show obviously better
performance than models trained on both modalities. A simi-
lar observation can also be made for MR results by observing
Fig.5b and Fig.5d. This shows that there are still improvements
necessary for a single solution working on images of multiple
modalities. However, remarkable developments in the machine
learning field may overcome these problems.
The on-site winner of this task was OvGUMEMoRIAL with
a score of 55.78±19.20. Although its DICE performance is
quite satisfactory (i.e. 0.88±0.15, corresponding to a score of
83.14±0.43), the other measures cause the low grade. Here, a
very interesting observation is that the score of OvGUMEMo-
RIAL is lower than its score on CT (61.13±19.72) but higher
than MR (41.15±21.61). Another interesting observation of the
highest-scoring non-ensemble model, PKDIA, both for Task 2
16
MedianCHAOS6
IITKGP-KLIV
ISDUE
OvGUMEMoRIAL
PKDIA
MedianCHAOS5
Lachinov
MedianCHAOS3
MedianCHAOS2
MedianCHAOS4
MedianCHAOS1
Fig. 7. Example image from the CHAOS CT dataset, (case 35, slice 95), borders of segmentation results on ground truth mask and zoomed onto inferior
vena cava (IVC) region (marked with dashed lines on the middle image). In this example, the contrast between liver tissue and IVC is relatively lower
due to sub-optimal timing during the CT scan. Accordingly, it creates a challenging case for the the participating algorithms. (Scores of this slice are;
PKDIA:91.13, MedianCHAOS6:91.84, MedianCHAOS3:85.42, MedianCHAOS1:82.55, MedianCHAOS2:83.46, MedianCHAOS5:88.38, OvGUMEMo-
RIAL:85.61, MedianCHAOS4:81.74, ISDUE:65.49, IITKGP-KLIV:66.27, Lachinov:64.18)
(CT) and Task 1 (MR), had a dramatic performance drop in this
task.
It is important to examine the scores of cases with their distri-
bution across all data. This can help to analyze the generaliza-
tion capabilities and real-life use of these systems. For example,
Fig.4.a shows a noteworthy situation. The winner of Task 1,
OvGUMEMoRIAL, shows lower performance than the second
method (ISDUE) in terms of standard deviation. Fig.5a and 5b
show that the competing algorithms have slightly higher scores
on the CT data than on the MR data. However, if we consider
the scattering of the individual scores along with the data, CT
scores have higher variability. This shows that reaching equal
generalization for multiple modalities is a challenging task for
Convolutional Neural Networks (CNNs).
6.5. Multi-Modal MR Abdominal Organ Segmentation (Task 5)
Task 5 investigates how DL models contribute to the develop-
ment of more comprehensive computational anatomical models
leading to multi-organ related tasks. Deep models have the po-
tential to provide a complete representation of the complex and
flexible abdominal anatomy by incorporating inter-organ rela-
tions through their internal hierarchical feature extraction pro-
cesses. In order to qualitatively analyze their performance, an
illustration of ground truth and results of all teams on a sample
image was presented in Fig.8 and 9.
The on-site winner was PKDIA with a score of 66.46±0.81
and the online winner is nnU-Net with 72.44±5.05. When the
scores of individual metrics are analyzed in comparison to Task
3, the DICE performance seems to remain almost the same
for nnU-Net and PKDIA. This is an important outcome as all
four organs are segmented instead of a single one. It is also
worth to point out that the model of the third-place (i.e. Moun-
tain) has almost exactly the same overall score for Task 3 and
5. The same observation is also valid for the standard devi-
ation of these models. Considering RAVD, the performance
decrease seems to be higher compared to DICE. These reduced
IITKGP-KLIV
ISDUE
mountain
PKDIA
nnU-net
METU_MMLAB
OvGUMEMoRIAL
Fig. 8. Example image from the CHAOS MRI T2SPIR dataset, (case 24, slice 23), borders of segmentation results on ground truth mask for liver and
spleen, and zoomed onto the marked region. The inhomogeneous intensity the distribution of the liver cause errors (both over- and under-segmentation in
specific regions) on the segmentation results. In general, such a problem is not observed for the segmentation of kidneys and spleen. (Scores of this slice
are; nnU-net:62.07, PKDIA:65.15, mountain:59.08, METU MMLAB:57.98, ISDUE:53.76, OvGUMEMoRIAL:49.65, IITKGP-KLIV:23.19)
17
Fig. 9. Illustration of ground truth and all results for Task 5. The image was taken from the CHAOS MR dataset (case 40, slice 15). White lines on the
results represent borders of ground truth. (Scores of this slice are; nnU-net:74.52, PKDIA:74.37, mountain:44.09, METU MMLAB:42.32, ISDUE:60.90,
OvGUMEMoRIAL:55.21, IITKGP-KLIV:55.46)
DICE and RAVD performance is partially compensated by bet-
ter MSSD and ASSD performances.
Follow-up studies that use the CHAOS dataset, but their out-
puts were not submitted in the challenge have also reported
results (Sinha and Dolz, 2020) for this task. Considering
an attention-based model (Wang et al., 2018) as the baseline
(DICE: 0.83 ±0.06), an ablation study is carried out that re-
ported an increased segmentation performance (Sinha and Dolz,
2020). The limitations are reduced by capturing richer contex-
tual dependencies through guided self-attention mechanisms.
Architectural modifications for integrating local features with
their global dependencies and adaptive highlighting of interde-
pendent channel maps reveal better performance (DICE: 0.87
±0.05) compared to some other models such as U-net (Ron-
neberger et al., 2015) (DICE: 0.81 ±0.08), DANet (Fu et al.,
2019) (DICE: 0.83 ±0.10), PAN(ResNet) (Li et al., 2018)
(DICE: 0.84 ±0.06), and UNet Attention (Schlemper et al.,
2019) (DICE: 0.85 ±0.05).
Despite these slight improvements achieved by novel archi-
tectural designs, the performance of the proposed models still
seems to below the best three contestants (i.e. nnUnet-0.95 ±
0.02, PKDIA-0.93 ±0.02 and Mountain- 0.90 ±0.03) of the
CHAOS challenge. This is also observed for other metrics
such as ASSD (OvGUMEMoRIAL: 1.05 ±0.55). Neverthe-
less, the modifications by (Sinha and Dolz, 2020) reduced the
standard deviation of the other metrics rather than their mean
18
values. The qualitative analysis performed to visualize the ef-
fect of the proposed modifications illustrate that some models
(such as UNet) typically under-segments certain organs, pro-
duce smoother segmentations causing loss of fine-grained de-
tails. The architectural modifications are especially helpful to
compensate for such drawbacks by focusing the attention of the
model on anatomically more relevant areas.
6.6. CT-MR Abdominal Organ Segmentation (Cross-Modality
Multi Modal) (Task 4)
This task covers the segmentation of both the liver in CT and
four abdominal organs in MRI data. Hence, it can be considered
as the most difficult task since it contains both cross-modality
learning and multiple organ segmentation. Therefore, it is not
surprising that it has the lowest participation and scores.
The on-site winner was ISDUE with a score of 58.69±18.65.
Fig. 5.e-f shows that their solution had consistent and high-
performance distribution in both CT and MR data. It can be
thought that two convolutional encoders in their system boost
performance on cross-modality data. These encoders are able
to compress information about anatomy. On the other hand,
PKDIA also shows promising performance with a score of
49.63±23.25. Despite their success on MRI sets, the CT per-
formance can be considered unsatisfactory, similar to their sit-
uation at Task 1. This reveals that the CNN-based encoder may
not be trained effectively. As the encoder part of their solution
relies on transfer learning, fine-tuning the pre-trained weights
was not successful in multiple modalities. The OvGUMEMo-
RIAL team achieved the third position with an average score
of 43.15 and they have a balanced performance on both modal-
ities. Their method can be considered successful in terms of
generalization, compared to the other participating teams.
Together with the outcomes of Task 1 and 5, it is shown that
in the current strategies and architectures, CNNs have better
segmentation performance on single modality tasks. This might
be considered as an expected outcome because the success of
CNNs is highly dependent on the consistency and homogeneity
of the data. Using multiple modalities creates a high variance
in the data even though all data were normalized. On the other
hand, the results also revealed that CNNs have good potential
for cross-modality tasks if appropriately extended models are
constructed. This potential was not that clear before the devel-
opment of deep learning strategies for segmentation.
6.7. Remarks on Ranking Stability and Robustness
It is well known and extensively discussed in the medical
imaging community that the evaluation strategy plays a key role
in the rankings (Maier-Hein et al., 2018). The performance of a
model relies on how the metrics are transformed into the scores
and the main factor at this transformation is the selection of
the thresholds. In CHAOS, the expert physicians (i.e. a team
of radiologists and surgeons) determined these thresholds after
extensive discussions. Nevertheless, to explore the stability of
the rankings through their dependency on threshold, scores are
re-calculated using new threshold values. To measure the sen-
sitivity, the threshold were changed by 5% in two ways:
1. Thresholds were increased by 5% for accepting/favoring
more precise segmentation results (higher for DICE, lower for
the other metrics). The DICE threshold is increased to 0.84
while RAVD, ASSD, and MSSD were decreased to 4.75%,
14.2mm, and 57mm respectively.
2. Thresholds were decreased by 5% for accepting/favoring
less precise segmentation results (lower for DICE, higher for
the other metrics) These values were calculated as 0.76 for
DICE, 5.25% for RAVD, 15.75mm for ASSD, and 63mm for
MSSD.
For both cases, all of the scores and rankings were re-
calculated. The results are presented in Table 11, which shows
that there are no noteworthy changes in the rankings. The only
change is between teams METU MMLAB and ISDUE (4th and
5th places on the scoreboard) in Task 5. However, the scores of
these teams are very close and the change occurs in the deci-
mals (i.e. 0.24 difference in favor of ISDUE at original thresh-
olds and 0.44 in favor of METU MMLAB when thresholds are
decreased by 5%). According to these results, it is possible to
state that the rankings in CHAOS are robust and they are not
influenced by small threshold changes.
The authors believe that the stability of the rankings is
achieved by carefully following the organization suggestions
given at (Maier-Hein et al., 2018). As highlighted by Maier-
Hein et al. (2018), when designing the challenge (especially for
evaluation stage), requirements of three main points that can
strongly affect scores are satisfied in the CHAOS challenge:
1) Preventing ranking alterations due to minor changes in
metrics: CHAOS rankings are shown to be robust to such
changes as shown in Tab.11.
2) Making the ground truth less dependent on annotator dif-
ferences: This issue is resolved by using three annotators and
their consensus as the ground truth
3) Handling of missing data to block rank manipulation: This
is handled by giving zero points to non-existing cases.
The remaining major factor in the evaluation, the aggrega-
tion of different metrics is chosen to be averaging in CHAOS
(instead of alternatives such as median (Maier-Hein et al.,
2018)), since the physicians find the used metrics equally
important from multiple clinical perspectives (such as surgical
precision, follow-up analysis, etc.)
7. Discussions and Conclusion
In this paper, we presented the CHAOS challenge. We gener-
ated an unpaired cross-modality (CT-MR), multi-modality (MR
T1-DUAL in /oppose, T2-SPIR) public dataset for five tasks
and evaluated a considerable number of newly proposed, well-
established, and state-of-the-art segmentation methods. Five
different tasks targeting at single modality (CT or MR), cross-
modality (CT and MR), and multi-modal (MR T1 in/oppose and
T2 sequences) segmentation were prepared. The evaluation is
performed using a scoring system based on four metrics. Our
results indicate various important outcomes.
7.1. Task-Based Conclusions
Task-based conclusions can be highlighted as follows :
19
Table 11. Mean scores of the submissions against different thresholds. The left column represents the scores with selected thresholds in the challenge. The
middle column shows the scores obtained with 5% higher thresholds favoring more precise segmentation results (higher for DICE and lower for other
metrics). The right column shows the scores obtained with thresholds reduced by 5% (lower for DICE and higher for other metrics). The best results are
given in bold.
Team Name Mean score with
default thresholds
Mean score with
5% more precise thresholds
Mean score with
5% less precise thresholds
Task 1
OvGUMEMoRIAL 55.78 ±19.20 54.36 ±19.98 57.04 ±18.76
ISDUE 55.48 ±16.59 51.76 ±17.02 56.64 ±16.24
PKDIA 50.66 ±23.95 46.61 ±24.12 52.14 ±23.02
Lachinov 45.10 ±21.91 44.85 ±22.14 48.09 ±20.85
METU MMLAB 42.54 ±18.79 38.54 ±17.98 42.25 ±18.22
IITKGP-KLIV 40.34 ±20.25 34.65 ±20.95 43.21 ±19.87
Task 2
PKDIA 82.46 ±8.47 81.83 ±8.11 83.51 ±8.58
MedianCHAOS6 80.45 ±8.61 79.57 ±8.17 81.28 ±8.98
MedianCHAOS3 80.43 ±9.23 79.55 ±9.88 81.27 ±8.99
MedianCHAOS1 79.91 ±9.76 78.95 ±10.01 80.77 ±9.25
MedianCHAOS2 79.78 ±9.68 78.86 ±10.09 80.65 ±9.23
MedianCHAOS5 73.39 ±6.96 72.35 ±7.24 74.32 ±6.25
OvGUMEMoRIAL 61.13 ±19.72 60.00 ±20.14 62.16 ±19.11
MedianCHAOS4 59.05 ±16.00 58.14 ±16.88 59.88 ±15.24
ISDUE 55.79 ±11.91 54.87 ±12.25 57.61 ±11.43
IITKGP-KLIV 55.35 ±17.58 54.15 ±18.27 56.48 ±17.26
Lachinov 39.86 ±27.90 36.75 ±28.14 41.94 ±27.52
Task 3
nnU-Net 75.10 ±7.61 74.65 ±7.89 76.85 ±7.11
PKDIA 70.71 ±6.40 69.59 ±6.95 71.76 ±6.02
Mountain 60.82 ±10.94 59.18 ±11.24 61.95 ±10.56
ISDUE 55.17 ±20.57 54.11 ±21.01 56.15 ±20.06
CIR MPerkonigg 53.60 ±17.92 52.71 ±18.31 55.45 ±17.16
METU MMLAB 53.15 ±10.92 52.05 ±11.17 54.32 ±10.33
Lachinov 50.34 ±12.22 48.91 ±12.89 51.17 ±12.03
OvGUMEMoRIAL 41.15 ±21.61 39.07 ±22.10 42.60 ±21.35
IITKGP-KLIV 34.69 ±8.49 33.96 ±8.98 35.40 ±8.12
Task 4
ISDUE 58.69 ±18.65 56.27 ±18.97 58.79 ±18.12
PKDIA 49.63 ±23.25 48.85 ±23.96 50.65 ±22.99
OvGUMEMoRIAL 43.15 ±13.88 43.99 ±14.02 48.18 ±13.56
IITKGP-KLIV 35.33 ±17.79 24.61 ±18.02 37.36 ±17.32
Task 5
nnU-Net 72.44 ±5.05 71.89 ±5.56 73.57 ±4.98
PKDIA 66.46 ±5.81 64.77 ±6.01 67.32 ±5.45
Mountain 60.20 ±8.69 57.72 ±8.98 61.54 ±8.25
ISDUE 56.25 ±19.63 54.78 ±20.21 57.51 ±19.56
METU MMLAB 56.01 ±6.79 52.73 ±7.12 57.95 ±6.23
OvGUMEMoRIAL 44.34 ±14.92 40.97 ±15.25 46.07 ±14.65
IITKGP-KLIV 25.63 ±5.64 24.41 ±5.89 26.66 ±5.38
20
1) Since the start of the competition (11 April 2019), the
most popular task, Task 2 (liver segmentation on CT), has re-
ceived more than 200 submissions within eight months. Quan-
titative analyses on Task 2 show that CNNs for segmenta-
tion of the liver from CT have achieved a great success.
Deep learning-based automatic methods outperformed interac-
tive semi-automatic strategies for CT liver segmentation. They
have reached inter-expert variability for DICE and volumetry,
but still need some more improvements for distance-based met-
rics that are critical for determining surgical error margins. Sup-
porting the quantitative analyses, our qualitative observations
suggest that the top methods can be used in real-life solutions
with little efforts on post-processing.
2) Considering MR liver segmentation (Task 3), the partici-
pating deep models have performed almost equally well as in-
teractive ones for DICE, but lack in performance for distance-
based measures. Given the outstanding results for this task
and the fact that the resulting volumes will be visualized by a
radiologist-surgeon team prior to various operations in the con-
text of clinical routine, it can be concluded that minimal user
interaction, especially in the post-processing phase, would eas-
ily bring the single modality MR results to clinically acceptable
levels. Of course, this would require not only having a software
implementation of the participating methods, but also their in-
tegration to an adequate workstation/DICOM viewer, easily ac-
cessible in the daily workflow of the physician.
3) Deep models perform better in the segmentation of the
four abdominal organs (Task 5) compared to the segmentation
of only the liver. However, it is not clear whether this improve-
ment can be attributed to multi-tasking. For instance, due to
its relatively bigger size and higher shape variations, the MSSD
performance of the models for the liver is worse compared to
other organs (i.e. MRI average MSSD scores: Liver 61.01 mm,
Right Kidney 44.31 mm, Left kidney 46.57 mm, and Spleen
44.22 mm). Accordingly, when all organs are segmented, aver-
age MSSD becomes lower (i.e. better) compared to liver seg-
mentation. Our in-depth analyses show that even for slight per-
formance gains, the reviewed methods will need substantial im-
provement, or new approaches have to be developed. However,
the impact and the importance of these slight gains in segmen-
tation quality may not justify the effort.
This conclusion is also validated by independent studies,
which used the CHAOS dataset and utilized ablation studies
to improve model performance. In (Sinha and Dolz, 2020), a
series of experiments are performed to validate the individual
contribution of different components to the segmentation per-
formance. Compared to the baseline, integrating spatial or at-
tention modules to the architecture was observed to increase the
performance between 2-3% for DICE 12-18% for ASSD while
employing both modules only bring slight improvements for
DICE and reduce ASSD. Thus, the channel attention module
is chosen at the final design simply by observing the paramet-
ric simulation results. Besides, such ablation studies relying on
extensive experimentation under different settings might cause
data-dependent models with lower generalization ability.
4) We observed that performances reported for the remaining
tasks using cross-modality, i.e., Task1 (Liver Segmentation on
CT +MRI) and Task 4 (Segmentation of abdominal organs on
CT +MRI), are clearly lower than the aforementioned ones.
This shows that despite the important developments by DL
models for segmentation, their application to the real-world
clinical problems still need major progress. Thus, cross-
modality (CT-MR) learning still proved to be more challenging
than individual training. Last but not least, multi-organ cross-
modality segmentation remains the most challenging problem
until appropriate ways to take advantage of multi-tasking prop-
erties of deep models and bigger data advantage of cross-
modal medical data are developed. Such complicated tasks
could benefit from spatial priors, global topological, or shape-
representations in their loss functions as employed by some of
the submitted models.
7.2. Conclusions about Participating Models
Except for one, all teams involved in this challenge have used
a modification of U-Net as a primary regressor model or as a
support system. However, the high variance between reported
scores shows that the understanding of the model performance
still relies on many parameters including architectural design,
implementation, parametric modifications, optimizations, and
tuning. Although several common algorithmic properties can
be derived for high-scoring models, an interpretation and/or ex-
planation of why a particular model performs well or not is far
from being trivial as relations among these factors are poorly
defined. As discussed in the previous challenges, such an analy-
sis is almost impossible on a heterogeneous set of models devel-
oped by different teams and programming environments. More-
over, the selection of evaluation metrics, their transformations
to scoring, and calculation of the final scores might have an im-
pact on the reported performances.
We believe DL research is essential to develop effective so-
lutions for medical image segmentation. However, instead of
focusing solely on segmentation accuracy, the following issues
should be addressed to apply DL methods to real-world clin-
ical use: improving generalization through domain adaptation
strategies (Yang et al., 2019; Gholami et al., 2019; Schoenauer-
Sebag et al., 2019), optimizing neural network architectures
to reduce the computational cost (Belagiannis et al., 2019;
Carreira-Perpi˜
n´
an and Idelbayev, 2018), attaching importance
to repeatability and reproducibility (Nikolov et al., 2018), and
focusing on interpretable and solutions. Moreover, combining
existing strategies, especially atlas based methods (which are
still commonly used for benchmarking (Kim et al., 2020)), with
deep models would enable incorporating spatial knowledge and
might have potential to improve performance of DL techniques
(Gao et al., 2020).
7.3. Conclusions about Multiple Submissions, Peeking and En-
sembles
The organizers hope that the submissions included in this ar-
ticle have fair developing stages without peeking attempts. In
general, the peeking problem does not exist with on-site chal-
lenges that announce the results in a short time. On the other
hand, it is a general problem of many online challenges not only
in medical image analysis but also in other fields. In CHAOS,
21
various approaches were tested to prevent peeking. Unfortu-
nately, up to our knowledge, there is no clear and elegant way
to handle this problem completely. According to our experience
from online submissions, the precautions such as limiting the
number of submissions, the obligation for using official univer-
sity/company mail addresses, and demanding a manuscript that
explains the methods would be useful, but not perfectly cover
all situations. Another alternative viable solution is accepting
Docker containers that have source codes of algorithms instead
of their results. However, this may need additional preparation
time for both challenge organizers and participants.
The effort of participants to outperform other results may
lead to misleading performance. The scores obtained at the end
of the on-site challenge session makes peeking almost impossi-
ble. However, this is not true for online submissions. For this
reason, in this paper, we have put a great effort to include online
submissions, which not only shows high performance, but also
their results can be verified through one of the following ways:
1) Uploading source code and/or model to an open access
repository (such as GitHub),
2) Submitting a PDF document, which explains the utilized
approach, and
3) Providing references that show the previous uses of the
utilized method/model.
Finally, it would also be worthwhile to point out that, sev-
eral medical segmentation challenges demonstrate the ensem-
ble superiority by combining the top-performing models in the
scoreboard (Kamnitsas et al., 2018; Isensee et al., 2019). It
is well-known that, in many challenges, the amount of train-
ing data is limited due to the high expense of gathering and
annotating medical volumetric datasets (Heimann et al., 2009;
Bilic et al., 2019; Menze et al., 2014)). Being relatively small
for proper training of a deep model, this can lead to overfit-
ting of individual models. However, classifier ensembles are
known to achieve better results compared to their base classi-
fiers even when those classifiers are over-trained (Kuncheva,
2014; Prevedello et al., 2019). Accordingly, when the top meth-
ods (usually Deep Models) are combined through some rule
(such as majority voting), the result usually become better than
the best individual result (Bilic et al., 2019; Menze et al., 2014;
Jimenez-del Toro et al., 2016; Kavur et al., 2020). On the other
hand, such results should be analyzed carefully due to depen-
dency between the train and test data during the construction of
ensembles.
Acknowledgments
The organizers would like to thank Ivana Isgum and Tom
Vercauteren in the challenge committee of ISBI 2019 for their
guidance and support. We express our gratitude to supporting
organizations of the grand-challenge.org platform. We thank
Umut Baran Ekinci, Ece K¨
ose, Fabian Isensee, David V¨
olgyes,
and Javier Coronel for their contributions. Last but not least,
our special thanks go to Ludmila I. Kuncheva for her valuable
contributions.
This work is supported by Scientific and Technological Re-
search Council of Turkey (TUBITAK) ARDEB-EEEAG un-
der grant number 116E133 and TUBITAK BIDEB-2214 In-
ternational Doctoral Research Fellowship Programme. The
work of P. Ernst, S. Chatterjee, O. Speck and, A. N ¨
urnberger
was conducted within the context of the International Gradu-
ate School MEMoRIAL at OvGU Magdeburg, Germany, sup-
ported by ESF (project no. ZS/2016/08/80646). The work
of S. Aslan within the context of Ca’ Foscari University of
Venice is supported by under TUBITAK BIDEB-2219 grant no
1059B191701102.
References
Abraham, N., Khan, N.M., 2019. A novel focal tversky loss function with im-
proved attention u-net for lesion segmentation, in: 2019 IEEE 16th Interna-
tional Symposium on Biomedical Imaging (ISBI 2019), IEEE. pp. 683–687.
Ayache, N., Duncan, J., 2016. 20th anniversary of the medical image analysis
journal (MedIA). Medical Image Analysis 33, 1–3. URL: https://hal.
inria.fr/hal-01353697, doi:10.1016/j.media.2016.07.004.
Belagiannis, V., Farshad, A., Galasso, F., 2019. Adversarial network compres-
sion, in: Leal-Taix´
e, L., Roth, S. (Eds.), Computer Vision – ECCV 2018
Workshops, Springer International Publishing, Cham. pp. 431–449.
Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W.,
Han, X., Heng, P.A., Hesser, J., et al., 2019. The liver tumor segmentation
benchmark (lits). arXiv preprint arXiv:1901.04056 .
Carreira-Perpi˜
n´
an, M.A., Idelbayev, Y., 2018. “learning-compression” algo-
rithms for neural net pruning, in: The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Cerrolaza, J.J., Picazo, M.L., Humbert, L., Sato, Y., Rueckert, D., ´
Angel
Gonz´
alez Ballester, M., Linguraru, M.G., 2019. Computational anatomy for
multi-organ analysis in medical imaging: A review. Medical Image Analysis
56, 44 – 67. doi:https://doi.org/10.1016/j.media.2019.04.002.
Conze, P.H., Brochard, S., Burdin, V., Sheehan, F.T., Pons, C., 2020a. Healthy
versus pathological learning transferability in shoulder muscle mri segmen-
tation using deep convolutional encoder-decoders. Computerized Medical
Imaging and Graphics (CMIG) .
Conze, P.H., Kavur, A.E., Gall, E.C.L., Gezer, N.S., Meur, Y.L., Selver, M.A.,
Rousseau, F., 2020b. Abdominal multi-organ segmentation with cascaded
convolutional and adversarial deep networks. arXiv:2001.09521.
Deng, X., Du, G., 2008. 3d segmentation in the clinic: a grand challenge ii-liver
tumor segmentation, in: MICCAI workshop.
Diciotti, S., Ciulli, S., Mascalchi, M., Giannelli, M., Toschi, N., 2013. The
“peeking” effect in supervised feature selection on diffusion tensor imaging
data. American Journal of Neuroradiology URL: http://www.ajnr.org/
content/early/2013/07/18/ajnr.A3685.
Dolz, J., Desrosiers, C., Ayed, I.B., 2018. Ivd-net: Intervertebral disc local-
ization and segmentation in mri with a multi-modal unet, in: International
Workshop and Challenge on Computational Methods and Clinical Applica-
tions for Spine Imaging, Springer. pp. 130–143.
Fischer, F., Alper Selver, M., Hillen, W., Guzelis, C., 2010. Integrating seg-
mentation methods from different tools into a visualization program using
an object-based plug-in interface. IEEE Transactions on Information Tech-
nology in Biomedicine 14, 923–934. doi:10.1109/TITB.2010.2044243.
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019. Dual atten-
tion network for scene segmentation, in: 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3141–3149.
Gao, Y., Huang, R., Yang, Y., Zhang, J., Shao, K., Tao, C., Chen, Y., Metaxas,
D.N., Li, H., Chen, M., 2020. Focusnetv2: Imbalanced large and small or-
gan segmentation with adversarial shape constraint for head and neck ct im-
ages. Medical Image Analysis , 101831doi:https://doi.org/10.1016/
j.media.2020.101831.
Gholami, A., Subramanian, S., Shenoy, V., Himthani, N., Yue, X., Zhao, S., Jin,
P., Biros, G., Keutzer, K., 2019. A novel domain adaptation framework for
medical image segmentation, in: Crimi, A., Bakas, S., Kuijf, H., Keyvan,
F., Reyes, M., van Walsum, T. (Eds.), Brainlesion: Glioma, Multiple Sclero-
sis, Stroke and Traumatic Brain Injuries, Springer International Publishing,
Cham. pp. 289–298.
Gibson, E., Li, W., Sudre, C., Fidon, L., Shakir, D.I., Wang, G., Eaton-Rosen,
Z., Gray, R., Doel, T., Hu, Y., Whyntie, T., Nachev, P., Modat, M., Barratt,
D.C., Ourselin, S., Cardoso, M.J., Vercauteren, T., 2018. Niftynet: a deep-
learning platform for medical imaging. Computer Methods and Programs in
22
Biomedicine 158, 113 – 122. doi:https://doi.org/10.1016/j.cmpb.
2018.01.025.
van Ginneken, B., Kerkstra, S., 2015. Grand challenges in biomedical image
analysis. URL: http://grand-challenge.org/. accessed: 2019-07-07.
Guinney, J., Wang, T., Laajala, T.D., Winner, K.K., Bare, J.C., Neto, E.C.,
Khan, S.A., Peddinti, G., Airola, A., Pahikkala, T., et al., 2017. Prediction
of overall survival for patients with metastatic castration-resistant prostate
cancer: development of a prognostic model through a crowdsourced chal-
lenge with open clinical trial data. The Lancet Oncology 18, 132–142.
Han, S., He, Y., Carass, A., Ying, S.H., Prince, J.L., 2019. Cerebellum parcella-
tion with convolutional neural networks, in: Medical Imaging 2019: Image
Processing, International Society for Optics and Photonics. p. 109490K.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778.
Heimann, T., Van Ginneken, B., Styner, M.A., Arzhaeva, Y., Aurich, V., Bauer,
C., Beck, A., Becker, C., Beichel, R., Bekes, G., et al., 2009. Comparison
and evaluation of methods for liver segmentation from ct datasets. IEEE
transactions on medical imaging 28, 1251–1265.
Hirokawa, Y., Isoda, H., Maetani, Y.S., Arizono, S., Shimada, K., Togashi,
K., 2008. Mri artifact reduction and quality improvement in the upper ab-
domen with propeller and prospective acquisition correction (pace) tech-
nique. American Journal of Roentgenology 191, 1154–1158.
Iglovikov, V., Shvets, A., 2018. Ternausnet: U-net with vgg11 en-
coder pre-trained on imagenet for image segmentation. arXiv preprint
arXiv:1801.05746 .
Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network
training by reducing internal covariate shift, in: Proceedings of the 32nd
International Conference on International Conference on Machine Learning-
Volume 37, JMLR. org. pp. 448–456.
Isensee, F., Petersen, J., Klein, A., Zimmerer, D., Jaeger, P.F., Kohl,
S., Wasserthal, J., Koehler, G., Norajitra, T., Wirkert, S., Maier-Hein,
K.H., 2019. nnU-Net: Self-adapting Framework for U-Net-Based
Medical Image Segmentation. Springer Vieweg, Wiesbaden. p. 22.
URL: http://link.springer.com/10.1007/978-3- 658-25326- 4_
7, doi:10.1007/978-3- 658-25326- 4_7.
Joiner, B.J., Simpson, A.L., Leal, J.N., D’Angelica, M.I., Do, R.K.G.,
2015. Assessing splenic enlargement on ct by unidimensional measurement
changes in patients with colorectal liver metastases. Abdominal imaging 40,
2338–2344. doi:10.1007/s00261-015- 0451-7.
Kamnitsas, K., Bai, W., Ferrante, E., McDonagh, S., Sinclair, M., Pawlowski,
N., Rajchl, M., Lee, M., Kainz, B., Rueckert, D., Glocker, B., 2018. En-
sembles of multiple models and architectures for robust brain tumour seg-
mentation, in: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, M. (Eds.),
Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain In-
juries, Springer International Publishing, Cham. pp. 450–462.
Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon,
D.K., Rueckert, D., Glocker, B., 2017. Efficient multi-scale 3d cnn with
fully connected crf for accurate brain lesion segmentation. Medical Image
Analysis 36, 61–78. doi:10.1016/j.media.2016.10.004.
Kavur, A.E., Gezer, N.S., Barıs¸, M., S¸ ahin, Y., ¨
Ozkan, S., Baydar, B., Y¨
uksel,
U., Kılıkc¸ıer, C., Olut, S., Bozda˘
gı Akar, G., ¨
Unal, G., Dicle, O., Selver,
M.A., 2020. Comparison of semi-automatic and deep learning based auto-
matic methods for liver segmentation in living liver transplant donors. Diag-
nostic and Interventional Radiology 26, 11–21. doi:10.5152/dir.2019.
19025.
Kavur, A.E., Selver, M.A., Dicle, O., Barıs¸, M., Gezer, N.S., 2019. CHAOS
- Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge
Data. URL: http://doi.org/10.5281/zenodo.3362844, doi:10.
5281/zenodo.3362844. accessed: 2019-04-11.
Kikinis, R., Pieper, S.D., Vosburgh, K.G., 2014. 3D Slicer: A Platform
for Subject-Specific Image Analysis, Visualization, and Clinical Support.
Springer New York, New York, NY. pp. 277–289. doi:10.1007/
978-1- 4614-7657- 3_19.
Kim, H., Jung, J., Kim, J., Cho, B., Kwak, J., Jang, J.Y., Lee, S.w., Lee,
J.G., Yoon, S.M., 2020. Abdominal multi-organ auto-segmentation using
3d-patch-based deep convolutional neural network. Scientific Reports 10,
6204. doi:10.1038/s41598-020- 63285-0.
King, B.F., Reed, J.E., Bergstralh, E.J., Sheedy, P.F., Torres, V.E., 2000. Quan-
tification and longitudinal trends of kidney, renal cyst, and renal parenchyma
volumes in autosomal dominant polycystic kidney disease. Journal of the
American Society of Nephrology 11, 1505–1511. URL: https://jasn.
asnjournals.org/content/11/8/1505.
Kistler, M., Bonaretti, S., Pfahrer, M., Niklaus, R., B¨
uchler, P., 2013. The
virtual skeleton database: an open access repository for biomedical research
and collaboration. Journal of medical Internet research 15, e245.
Kozubek, M., 2016. Challenges and benchmarks in bioimage analysis, in: Fo-
cus on Bio-Image Informatics. Springer, pp. 231–262.
Kuncheva, L.I., 2014. Combining Pattern Classifiers: Methods and Algo-
rithms: Second Edition. Wiley-Interscience. volume 9781118315. pp. 1–
357. doi:10.1002/9781118914564.
Lachinov, D., 2019. Segmentation of thoracic organs using pixel shuffle, in:
Proceedings of the 2019 Challenge on Segmentation of THoracic Organs at
Risk in CT Images, SegTHOR@ISBI 2019, April 8, 2019. URL: http:
//ceur-ws.org/Vol- 2349/SegTHOR2019_paper_10.pdf.
Lamb, P.M., Lund, A., Kanagasabay, R.R., Martin, A., Webb, J.A.W., Reznek,
R.H., 2002. Spleen size: how well do linear ultrasound measurements cor-
relate with three-dimensional ct volume assessments? The British Journal
of Radiology 75, 573–577. doi:10.1259/bjr.75.895.750573.
Landman, B., Xu, Z., Igelsias, J.E., Styner, M., Langerak, T.R., Klein, A.,
2015. MICCAI multi-atlas labeling beyond the cranial vault – workshop
and challenge. doi:10.7303/syn3193805.
Langville, A.N., Meyer, C.D.C.D., 2013. Who’s #1? : the science of rating and
ranking. Princeton University Press. p. 247.
Li, F., Neverova, N., Wolf, C., Taylor, G., 2016. Modout: Learning to fuse
modalities via stochastic regularization. Journal of Computational Vision
and Imaging Systems 2.
Li, H., Xiong, P., An, J., Wang, L., 2018. Pyramid attention network for seman-
tic segmentation, in: The British Machine Vision Conference (BMVC).
Linguraru, M.G., Sandberg, J.K., Jones, E.C., Summers, R.M., 2013. Assess-
ing splenomegaly: Automated volumetric analysis of the spleen. Academic
Radiology 20, 675–684. doi:10.1016/j.acra.2013.01.011.
Low, G., Wiebe, E., Walji, A.H., Bigam, D.L., 2008. Imaging evaluation of po-
tential donors in living-donor liver transplantation. Clinical Radiology 63,
136–145. URL: https://doi.org/10.1016/j.crad.2007.08.008,
doi:10.1016/j.crad.2007.08.008.
Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz,
P., Arbel, T., Bogunovic, H., Bradley, A.P., Carass, A., Feldmann, C., Frangi,
A.F., Full, P.M., van Ginneken, B., Hanbury, A., Honauer, K., Kozubek,
M., Landman, B.A., M¨
arz, K., Maier, O., Maier-Hein, K., Menze, B.H.,
M¨
uller, H., Neher, P.F., Niessen, W., Rajpoot, N., Sharp, G.C., Sirinukun-
wattana, K., Speidel, S., Stock, C., Stoyanov, D., Taha, A.A., van der
Sommen, F., Wang, C.W., Weber, M.A., Zheng, G., Jannin, P., Kopp-
Schneider, A., 2018. Why rankings of biomedical image analysis compe-
titions should be interpreted with care. Nature Communications 9, 5217.
doi:10.1038/s41467-018- 07619-7.
Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J.,
Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2014. The multimodal
brain tumor image segmentation benchmark (brats). IEEE transactions on
medical imaging 34, 1993–2024.
Nandamuri, S., China, D., Mitra, P., Sheet, D., 2019. Sumnet: Fully convo-
lutional model for fast segmentation of anatomical structures in ultrasound
volumes. arXiv preprint arXiv:1901.06920 .
Nikolov, S., Blackwell, S., Mendes, R., Fauw, J.D., Meyer, C., Hughes, C.,
Askham, H., Romera-Paredes, B., Karthikesalingam, A., Chu, C., Car-
nell, D., Boon, C., D’Souza, D., Moinuddin, S.A., Sullivan, K., Consor-
tium, D.R., Montgomery, H., Rees, G., Sharma, R., Suleyman, M., Back,
T., Ledsam, J.R., Ronneberger, O., 2018. Deep learning to achieve clini-
cally applicable segmentation of head and neck anatomy for radiotherapy.
arXiv:1809.04430.
Prevedello, L.M., Halabi, S.S., Shih, G., Wu, C.C., Kohli, M.D., Chokshi, F.H.,
Erickson, B.J., Kalpathy-Cramer, J., Andriole, K.P., Flanders, A.E., 2019.
Challenges related to artificial intelligence research in medical imaging and
the importance of image analysis competitions. Radiology: Artificial Intel-
ligence 1, e180031.
Reinke, A., Eisenmann, M., Onogur, S., Stankovic, M., Scholz, P., Full, P.M.,
Bogunovic, H., Landman, B.A., Maier, O., Menze, B., et al., 2018a. How
to exploit weaknesses in biomedical challenge design and organization,
in: International Conference on Medical Image Computing and Computer-
Assisted Intervention, Springer. pp. 388–395.
Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H.,
Bradley, A.P., Carass, A., Feldmann, C., Frangi, A.F., et al., 2018b. Is the
winner really the best? a critical analysis of common research practice in
biomedical image analysis competitions. arXiv preprint arXiv:1806.02051 .
23
Reunanen, J., 2003. Overfitting in making comparisons between variable selec-
tion methods. Journal of Machine Learning Research 3, 1371–1382.
Robertson, F., Leander, P., Ekberg, O., 2001. Radiology of the spleen. European
Radiology 11, 80–95. doi:10.1007/s003300000528.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks
for biomedical image segmentation, in: International Conference on Medi-
cal image computing and computer-assisted intervention, Springer. pp. 234–
241.
Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B.,
Rueckert, D., 2019. Attention gated networks: Learning to leverage salient
regions in medical images. Medical Image Analysis 53, 197–207. URL:
http://dx.doi.org/10.1016/j.media.2019.01.012, doi:10.1016/
j.media.2019.01.012.
Schoenauer-Sebag, A., Heinrich, L., Schoenauer, M., Sebag, M., Wu,
L.F., Altschuler, S.J., 2019. Multi-domain adversarial learning.
arXiv:1903.09239.
Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I., 2018. Automatic
instrument segmentation in robot-assisted surgery using deep learning, in:
2018 17th IEEE International Conference on Machine Learning and Appli-
cations (ICMLA), IEEE. pp. 624–628.
Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., van
Ginneken, B., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze,
B., et al., 2019. A large annotated medical image dataset for the de-
velopment and evaluation of segmentation algorithms. arXiv preprint
arXiv:1902.09063 .
Sinha, A., Dolz, J., 2020. Multi-scale self-guided attention for medical image
segmentation. IEEE Journal of Biomedical and Health Informatics Early
Access, 1–1. doi:10.1109/JBHI.2020.2986926.
Smialowski, P., Frishman, D., Kramer, S., 2010. Pitfalls of supervised feature
selection. Bioinformatics 26, 440–443.
Staal, J., Abr`
amoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.,
2004. Ridge-based vessel segmentation in color images of the retina. IEEE
transactions on medical imaging 23, 501–509.
Jimenez-del Toro, O., M¨
uller, H., Krenn, M., Gruenberg, K., Taha, A.A., Win-
terstein, M., Eggel, I., Foncubierta-Rodr´
ıguez, A., Goksel, O., Jakab, A.,
et al., 2016. Cloud-based evaluation of anatomical structure segmentation
and landmark detection algorithms: Visceral anatomy benchmarks. IEEE
transactions on medical imaging 35, 2459–2475.
Ulyanov, D., Vedaldi, A., Lempitsky, V., 2017. Improved texture networks:
Maximizing quality and diversity in feed-forward stylization and texture
synthesis, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 6924–6932.
Valindria, V.V., Pawlowski, N., Rajchl, M., Lavdas, I., Aboagye, E.O., Rockall,
A.G., Rueckert, D., Glocker, B., 2018. Multi-modal learning from unpaired
images: Application to multi-organ segmentation in ct and mri, in: 2018
IEEE Winter Conference on Applications of Computer Vision (WACV), pp.
547–556. doi:10.1109/WACV.2018.00066.
Van Ginneken, B., Heimann, T., Styner, M., 2007. 3d segmentation in theclinic:
A grand challenge. 3D segmentation in the clinic: a grand challenge , 7–15.
Wang, Y., Deng, Z., Hu, X., Zhu, L., Yang, X., Xu, X., Heng, P.A., Ni, D.,
2018. Deep attentional features for prostate segmentation in ultrasound, in:
MICCAI.
Weight, C., Papanikolopoulos, N., Kalapara, A., Heller, N., 2019. URL:
https://kits19.grand-challenge.org/. accessed: 2019-07-08.
Wu, Y., He, K., 2018. Group normalization, in: Proceedings of the European
Conference on Computer Vision (ECCV), pp. 3–19.
Yan, Y., Conze, P.H., Decenci`
ere, E., Lamard, M., Quellec, G., Cochener, B.,
Coatrieux, G., 2019. Cascaded multi-scale convolutional encoder-decoders
for breast mass segmentation in high-resolution mammograms, in: Annual
International Conference of the IEEE Engineering in Medicine and Biol-
ogy Society, Berlin, Germany. pp. 6738–6741. doi:10.1109/EMBC.2019.
8857167.
Yang, J., Dvornek, N.C., Zhang, F., Chapiro, J., Lin, M., Duncan, J.S., 2019.
Unsupervised domain adaptation via disentangled representations: Applica-
tion to cross-modality liver segmentation, in: Shen, D., Liu, T., Peters, T.M.,
Staib, L.H., Essert, C., Zhou, S., Yap, P.T., Khan, A. (Eds.), Medical Image
Computing and Computer Assisted Intervention – MICCAI 2019, Springer
International Publishing, Cham. pp. 255–263.
Yeghiazaryan, V., Voiculescu, I., 2015. An Overview of Current Evaluation
Methods Used in Medical Image Segmentation. Technical Report RR-15-
08. Department of Computer Science. Oxford, UK.