Conference PaperPDF Available

Progressively growing convolutional networks for end-to-end deformable image registration

Authors:
  • ScreenPoint Medical
Progressively Growing Convolutional Networks for
End-to-End Deformable Image Registration
Koen A.J. Eppenhof, Maxime W. Lafarge, and Josien P.W. Pluim
Medical Image Analysis, Eindhoven University of Technology, The Netherlands
ABSTRACT
Deformable image registration is often a slow process when using conventional methods. To speed up deformable
registration, there is growing interest in using convolutional neural networks. They are comparatively fast and can
be trained to estimate full-resolution deformation fields directly from pairs of images. Because deep learning-
based registration methods often require rigid or affine pre-registration of the images, they do not perform
true end-to-end image registration. To address this, we propose a progressive training method for end-to-end
image registration with convolutional networks. The network is first trained to find large deformations at a low
resolution using a smaller part of the full architecture. The network is then gradually expanded during training
by adding higher resolution layers that allow the network to learn more fine-grained deformations from higher
resolution data. By starting at a lower resolution, the network is able to learn larger deformations more quickly
at the start of training, making pre-registration redundant. We apply this method to pulmonary CT data, and
use it to register inhalation to exhalation images. We train the network using the CREATIS pulmonary CT data
set, and apply the trained network to register the DIRLAB pulmonary CT data set. By computing the target
registration error at corresponding landmarks we show that the error for end-to-end registration is significantly
reduced by using progressive training, while retaining sub-second registration times.
Keywords: Deformable image registration, multi-resolution methods, convolutional neural networks, deep learn-
ing, fast image registration
1. INTRODUCTION
In previous work we have shown that it is possible to train a neural network to estimate non-linear transformations
on a thin-plate spline grid.1The network was trained on pairs of synthetically deformed pulmonary CT images,
but was shown to generalize to inhale-exhale lung image registration. This work has since been extended to
a network that can learn full-resolution deformation fields directly from two input images.2For this, we used
the U-net architecture,3which allows estimation of full-resolution deformation fields at high speeds because the
registration consists of only one forward pass through the network. The accuracy of this method can compete with
existing methods for pulmonary CT registration. One limitation of this previous work is that it requires affine
pre-registration of the images, because the network can only estimate the smaller local deformations between
affinely registered images and fails to accurately estimate the deformation field without pre-registration. This
limitation has been reported by other registration methods based on convolutional neural networks. Examples
include deep learning-based pulmonary CT registration,4and brain MR registration, where the images are pre-
registered to a brain atlas,5or to each other.6
1.1 Aim
Given the need for pre-registration in deep learning-based registration methods, we propose an extension to
our previous method with the purpose to perform true end-to-end image registration, i.e. without requiring
pre-registration, while still retaining the high registration speed. Instead of training the network to find the
full resolution deformation field from the start, we reason that it is easier to learn large displacements at lower
resolutions first, and then expand to higher resolutions. Similar strategies are used in image registration methods
that iteratively optimize similarity metrics, by employing a multi-resolution image pyramid. A transformation
model found for lower resolution version of the images is then used as initialization for the higher resolution.
In optimization-based methods, this leads to a more robust optimization, that reduces local minima. In this
paper, we adapt these ideas to training a multi-resolution neural network for deformable registration. Instead
of training the network directly to learn a full-resolution deformation field between two input images, we first
let smaller versions of the network learn lower resolution versions of the deformation field. This way, we slowly
build up the network during training, until the network can construct full resolution deformation fields. We
implement this by parameterizing the network architecture such that it can be trained for a specific resolution
of inputs and outputs. This parameterization also allows smooth transitions between the resolutions.
2. METHODS
2.1 Architecture
Growing neural networks have been previously applied in the context of generative adversarial networks (GANs)
by Karras et al.7Here, we apply a similar method to the U-net architecture. The U-net architecture is composed
of several ’resolution levels’ that consist of convolutional layers that operate on a specific resolution (image
dimension), with low-resolution layers at the bottom and higher resolution layers at the top of the ‘U’-shaped
architecture. Each resolution level also has a distinct number of learned feature maps. This number doubles for
every resolution level, e.g. if the convolutional layers in the top level learn Nfeature maps, then the convolutional
layers in the level below it learn 2Nfeature maps. When we apply this architecture to registration, we use two
images as inputs: the fixed and the moving image. The output of the U-net consists of three maps corresponding
to the deformation field’s three components, i.e. the displacement in x,y, and zdirection.
2.2 Modifications to the architecture
The architecture remains similar to common implementations of a 3D U-net, except that there are inputs and
outputs at every resolution level, instead of only the top level (Figure 1). These inputs and outputs have
resolutions that match the resolution of their level. On the left side of the network, the input images are
downsampled to the resolution of the level in which they enter using average pooling. The downsampled input
passes through a single convolutional layer to match the number of feature maps Nfor its resolution level i. The
result and the output of the pooling layer of the resolution level above it, are summed, weighted by weights βi
and 1 βi. On the right side of each resolution level, an extra convolutional layer learns the three maps for the
components of the vector field at the level’s resolution. The vector fields are upsampled to the original resolution,
after which they are weighted with parameters αi, and summed to produce one final deformation field at the
original resolution of the input images.
2.3 Progressive learning
The modifications to the architecture introduce a set of parameters αi(t) that determine the influence of each
resolution level to the output of the network. The βi(t) parameters determine the weighting between the output
of pooling layers and incoming inputs, and are defined as
βi=
i
X
j=0
αj.(1)
The αi(t) parameters are updated during training, and therefore depend on the training iteration t. At every
point in training, the sum of these parameters is 1, i.e.
X
i
αi(t) = 1 for all t. (2)
The individual parameters are defined as
αi(t) =
0 for t<τiand t > τi+ ∆
tτi
δfor τitτi+δ
1 for τi+δtτi+ ∆ δ
τi+∆t
δfor τi+ ∆ δtτi+ ∆.
(3)
Figure 1: The network architecture resembles the original U-net architecture in 3D, but with inputs and outputs
at every resolution level. The parameters αidetermine which parts of the architecture are used.
Figure 2: Progressive growing scheme showing the parameters αi(t) and βi(t) changing over time during training.
The colors correspond to the colors of the levels in Figure 1.
The schedules for the five resolution levels in the U-net use δ= 200, ∆ = 600, and τi=300 + 400iand are
shown in Figure 2. Whenever one of the weights αiequals 1, we are essentially training a U-net with i+ 1
resolution levels. Hence, once we reach α0=α1=α2=α3= 0 and α4= 1 we have arrived at a normal five
level U-net architecture. An example of the network architecture at three points in training is given in Figure 3,
which shows the network at the start of training, at the transition between the first and second resolution level,
and after that transition is complete and two resolution layers are being trained.
2.4 Data
We train on and apply the network to pulmonary CT images. We train the network on the public CREATIS and
POPI data sets,8and apply the trained network to the DIRLAB data9set to show generalization to a different
data set of pulmonary CT images. We crop the images such that the lungs are still completely visible, and resize
the resulting image to a 128 ×128 ×128 voxel resolution. We create lung masks by segmenting all voxels with
Hounsfield units lower than -250, which corresponds to low density tissue inside the lungs. These masks are used
to focus the training on the voxels inside the lungs (see Section 2.6).
2.5 Training set construction
During training, random synthetic transformations are applied to the fourteen (seven pairs) CREATIS and POPI
images. Every iteration of training, one of the images in the training set is selected. Two random transformations
are applied to this image: one to augment the data set (Taugm ), and one that is actually learned by the network
𝝰 = [0.75, 0.25, 0, 0, 0]
Figure 3: Illustration of the transition between a one level and a two level network. The remainder of the network
(levels 3, 4, and 5) are not shown for clarity. The values of the parameters αand βare shown in red for the
state of the network at iterations 0, 150, and 300 respectively.
Figure 4: At training time, an image from the training set is transformed twice. The deformation field of the
net transformation between the images is learned by the network. At test time, the two images are replaced by
the moving and fixed image.
(Tlearned). The latter is applied together with the augmentation transformation, such that we obtain a pair of
images I[Taugm(x)] and I[(Taugm Tlearned )(x)]. From these two images, the network will learn the underlying
deformation vector field, i.e. of the transformation Tlearned (Figure 4).
The augmentation transformation Taugm is created by sampling displacements on a 2 ×2×2 grid from a
uniform distribution in the [12.8,12.8] voxel range. The learned transformation Tlearned consists of a sequence
of two transformations: one coarse B-spline transformation using a 4 ×4×4 size B-spline grid, and one fine
B-spline transformation using an 8×8×8 size grid. The displacements on those grids are sampled from a uniform
distribution in the ranges [25.6,25.6] voxels and [6.4,6.4] voxels respectively.
At test time, a pair of moving and fixed images replaces the pair of images: I[Taugm (x)] will be replaced by
the moving image, and I[(Taugm Tlearned)(x)] will be replaced by the fixed image.
2.6 Training the network
The network is trained by minimizing the squared error between the true deformation field ulearned(x) =
Tlearned(x)xand the network’s estimate ˆulearned, within the lung mask M(x)[0,1]. The loss function
is defined as
L=PxFM(x)||ulearned(x)ˆulearned(x)||2
2
PxFM(x).(4)
This loss function is minimized using stochastic gradient descent with a momentum of 0.5 and an annealed
learning rate
η(t) = η0
1 + λt (5)
with η0= 0.1 and λ= 104. We use a batch size of one, and use batch normalization for all convolutional layers
using moving averages of the mean and variances of the parameters over all iterations, as proposed by Ioffe et
al.10 To show the effect of progressive training, we train the architecture once using the schedule in Figure 2,
and once using the setting α0=α1=α2=α3= 0 and α4= 1 throughout training, i.e. the ‘conventional’ U-net.
Because the architecture is the same in both cases, we use the exact same weight initialization, training data,
and learning rate. At test time we use this setting for αfor both versions of the network.
3. RESULTS
We registered the ten pairs of DIRLAB images using both the progressive and conventional network, and eval-
uated the resulting deformation fields by computing target registration errors (TRE) on manually annotated
corresponding landmarks in the DIRLAB set. The TRE metric captures misalignment by measuring the L2-
norm between landmarks in the moving domain and the transformed landmarks from the fixed domain, i.e.
TRE = ||xMT(xF)||. Results per image pair are shown in Table 1, together with the boxplot in Figure 6that
shows the distributions of TRE values for both architectures. We also show correlation plots for the estimated
displacements of the 3000 landmarks in the DIRLAB set in x-, y-, and z-direction in Figure 5. Both show that
Figure 5: Correlation plots showing the correlation between the estimated and true displacements of the DIRLAB
landmarks for both network variants, shown for the x-, y-, and z-components of the displacement.
Table 1: Target registration errors (TRE) in millimeters for no reg-
istration, registration using the conventional network, and using the
progressive network.
Before Conventional Progressive
Pair registration training training
1 3.89 ±2.78 2.05 ±0.97 2.18 ±1.05
2 4.34 ±3.90 4.10 ±2.01 2.06 ±0.96
3 6.94 ±4.05 2.96 ±1.21 2.11 ±1.04
4 9.83 ±4.85 3.75 ±2.28 3.13 ±1.60
5 7.48 ±5.50 4.00 ±2.46 2.92 ±1.70
6 10.89 ±6.96 6.38 ±6.05 4.20 ±2.00
7 11.03 ±7.42 5.00 ±3.65 4.12 ±2.97
8 14.99 ±9.00 11.33 ±5.01 9.43 ±6.28
9 7.92 ±3.97 4.50 ±2.02 3.82 ±1.69
10 7.30 ±6.34 4.29 ±2.61 2.87 ±1.96
Total 8.46 ±6.58 4.84 ±4.03 3.68 ±3.32 Figure 6: Boxplots of the average
TRE values of the two networks.
in general, the progressively trained network performs better. We used the Wilcoxon signed-rank test to test for
pairwise equality between the TRE values for all landmarks in both architectures. This indicated that the TRE-
values of the progressive architecture (median: 2.82 mm) and the TRE-values of the static architecture (median:
3.72 mm) differed significantly (N= 3000, T = 16.2, r = 0.55, p < 0.001). The median TRE values per image for
each architecture also significantly differ (3.68 mm versus 4.79 mm, N= 10, T = 0, r = 1.0, p = 0.005), as can
also be concluded from Figure 6. On average, the estimation of the deformation vector fields took 0.90 ±0.05
seconds on an Nvidia Titan XP GPU.
4. DISCUSSION
In this paper we propose a method for training a fully convolutional neural network that lets the network grow
progressively during training. We show on publicly available pulmonary CT data sets that this approach results in
significantly lower registration errors, and better correlation with the ground truth compared to the conventional
training approach. We hypothesize that because the network can focus on lower resolution data at the start of
training, it can quickly optimize a limited number of weights to model a large range of displacements on a small
vector field. The smooth transitions between resolution levels then allow the network to estimate the same range
of displacements in higher resolution vector fields. Once the final transition is complete, the architecture is no
different from an ordinary U-net.
The resulting network can perform end-to-end estimation of full size deformation fields directly from two
images. It can generalize from the training set to a separate set of data acquired on different hardware, at a
different institute. By making elaborate use of data augmentation, we only require a few images to train on,
without need for manually annotated ground truths. Furthermore, because the method requires only one forward
pass through the network, it takes less than one second to estimate a full-resolution deformation field. Although
the results are promising, they are not yet up to the standard of the state-of-the-art for lung registration , which
result in average TREs between 1.36 ±1.01 and 2.13 ±1.82 mm on the same DIRLAB data set.11,12 To address
this, future work will focus on improving the training set’s deformations to be more realistic for this particular
application.
5. CONCLUSION
In this paper we propose a method for training a convolutional neural network that lets the network grow
progressively during training. We show on publicly available pulmonary CT data sets that this approach results
in lower registration errors, and better correlation with the ground truth compared to conventional training
approaches. The resulting network can perform end-to-end estimation of full size deformation fields in less than
a second, directly from two images.
REFERENCES
[1] Eppenhof, K. A. J. and Pluim, J. P. W., “Deformable image registration using convolutional neural net-
works,” Proceedings SPIE Medical Imaging , 10574–27 (2018).
[2] Eppenhof, K. A. J. and Pluim, J. P. W., “Pulmonary ct registration through supervised learning with
convolutional neural networks,” IEEE Transactions on Medical Imaging , In press (2018).
[3] Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmen-
tation,” Proceedings MICCAI , 234–241 (2015).
[4] Sokooti, H., de Vos, B. D., Berendsen, F. F., Lelieveldt, B. P. F., Isgum, I., and Staring, M., “Nonrigid image
registration using multi-scale 3D convolutional neural networks,” Proceedings MICCAI , 232–239 (2017).
[5] Yang, X., Kwitt, R., Styner, M., and Niethammer, M., “Quicksilver: Fast predictive image registration a
deep learning approach,” NeuroImage 158, 378 – 396 (2017).
[6] Cao, X., Yang, J., Zhang, J., Nie, D., Kim, M., Wang, Q., and Shen, D., “Deformable image registration
based on similarity-steered CNN regression,” Proceedings MICCAI , 300–308 (2017).
[7] Karras, T., Aila, T., Laine, S., and Lehtinen, J., “Progressive growing of GANs for improved quality,
stability, and variation,” ICLR (2018).
[8] Vandemeulebroucke, J., Rit, S., Kybic, J., Clarysse, P., and Sarrut, D., “Spatiotemporal motion estimation
for respiratory-correlated imaging of the lungs,” Medical Physics 38(1), 166–178 (2011).
[9] Castillo, E., Castillo, R., Martinez, J., Shenoy, M., and Guerrero, T., “Four-dimensional deformable image
registration using trajectory modeling,” Physics in Medicine & Biology 55(1), 305 (2010).
[10] Ioffe, S. and Szegedy, C., “Batch normalization: Accelerating deep network training by reducing internal
covariate shift,” in [Proceedings ICML], 448–456 (2015).
[11] Schmidt-Richberg, A., Werner, R., Handels, H., J., and Ehrhardt, “Estimation of slipping organ motion by
registration with direction-dependent regularization,” Medical Image Analysis 16(1), 150–159 (2012).
[12] Berendsen, F. F., Kotte, A. N. T. J., Viergever, M. A., and Pluim, J. P., “Registration of organs with sliding
interfaces and changing topologies,” in [SPIE Medical Imaging], 90340E–1 (2014).
... This framework consists of a multistage CNN architecture for the coarse-to-fine registration considering multiple levels and image resolutions and achieved comparable performance with respect to conventional image registration while being several orders of magnitude faster. A progressive training method for end-to-end image registration based on a U-Net [36] was devised in [37] , which gradually processed from coarse-grained to fine-grained resolution data. The network was progressively expanded during training by adding higher resolution layers that allowed the network to learn fine-grained deformations from higher-resolution data. ...
... In recent years, an increasing number of studies have focused on using deep neural networks for deformable image registration because such methods offer fast or nearly real-time registration [2,12,26,27,29,35,37] . However, their application to abdominopelvic CT images remains limited because of the large intra-and inter-patient variations, the not fully diffeomorphic nature of the deformations, and the limited availability of large numbers of well-annotated images for training. ...
Article
Full-text available
Background and Objectives: Deep learning is being increasingly used for deformable image registration and unsupervised approaches, in particular, have shown great potential. However, the registration of abdominopelvic Computed Tomography (CT) images remains challenging due to the larger displacements compared to those in brain or prostate Magnetic Resonance Imaging datasets that are typically considered as benchmarks. In this study, we investigate the use of the commonly used unsupervised deep learning framework VoxelMorph for the registration of a longitudinal abdominopelvic CT dataset acquired in patients with bone metastases from breast cancer. Methods: As a pre-processing step, the abdominopelvic CT images were refined by automatically removing the CT table and all other extra-corporeal components. To improve the learning capabilities of the VoxelMorph framework when only a limited amount of training data is available, a novel incremental training strategy is proposed based on simulated deformations of consecutive CT images in the longitudinal dataset. This devised training strategy was compared against training on simulated deformations of a single CT volume. A widely used software toolbox for deformable image registration called NiftyReg was used as a benchmark. The evaluations were performed by calculating the Dice Similarity Coefficient (DSC) between manual vertebrae segmentations and the Structural Similarity Index (SSIM). Results: The CT table removal procedure allowed both VoxelMorph and NiftyReg to achieve significantly better registration performance. In a 4-fold cross-validation scheme, the incremental training strategy resulted in better registration performance compared to training on a single volume, with a mean DSC of 0.929±0.037 and 0.883±0.033, and a mean SSIM of 0.984±0.009 and 0.969±0.007, respectively. Although our deformable image registration method did not outperform NiftyReg in terms of DSC (0.988±0.003) or SSIM (0.995±0.002), the registrations were approximately 300 times faster. Conclusions: This study showed the feasibility of deep learning based deformable registration of longitudinal abdominopelvic CT images via a novel incremental training strategy based on simulated deformations.
... The majority of recently published methods uses encoder-decoder architectures (like the U-Net) to predict dense displacement fields directly from the input images in an unsupervised setting using similarity metrics (minimizing the objective function similar to conventional iterative registration frameworks) (Balakrishnan et al., 2019) or uses annotated label images to guide the training process (Hu et al., 2018). To deal with large deformations in medical images primarily multilevel strategies and iteratively trained networks were proposed (Eppenhof et al., 2019;de Vos et al., 2019;Hering et al., 2019). Still, while deep networks offer very fast inference times and have the potential to further learn from expert annotations, with errors above 2.2 mm they can not yet compete with accuracies of conventional registration frameworks on challenging thoracic CT benchmarks (below 1 mm (Rühaak et al., 2017)). ...
... As regularization method, we employ a simple diffusion over all keypoints and displacements using the graph Laplacian. Table 1 shows the results of our method in comparison to other learning based registration frameworks, four approaches based on dense encoder-decoder (multi-level) architectures (Eppenhof et al., 2019;de Vos et al., 2019;Balakrishnan et al., 2019;Hering et al., 2019) and one that is using keypoints with graph CNNs and a point cloud matching algorithm (Hansen et al., 2019). ...
Preprint
Though, deep learning based medical image registration is currently starting to show promising advances, often, it still fells behind conventional frameworks in terms of registration accuracy. This is especially true for applications where large deformations exist, such as registration of interpatient abdominal MRI or inhale-to-exhale CT lung registration. Most current works use U-Net-like architectures to predict dense displacement fields from the input images in different supervised and unsupervised settings. We believe that the U-Net architecture itself to some level limits the ability to predict large deformations (even when using multilevel strategies) and therefore propose a novel approach, where the input images are mapped into a displacement space and final registrations are reconstructed from this embedding. Experiments on inhale-to-exhale CT lung registration demonstrate the ability of our architecture to predict large deformations in a single forward path through our network (leading to errors below 2 mm).
... This framework consists of a multi-stage CNN architecture for the coarse-tofine registration considering multiple levels and image resolutions and achieved comparable performance with respect to conventional image registration while being several orders of magnitude faster. A progressive training method for end-to-end image registration based on a U-Net [36] was devised in [37], which gradually processed from coarse-grained to fine-grained resolution data. The network was progressively expanded during training by adding higher resolution layers that allowed the network to learn fine-grained deformations from higherresolution data. ...
... In recent years, an increasing number of studies have focused on using deep neural networks for deformable image registration because such methods offer fast or nearly real-time registration [2,12,26,27,29,35,37]. However, their application to abdominopelvic CT images remains limited because of the large intra-and inter-patient variations, the not fully diffeomorphic nature of the deformations, and the limited availability of large numbers of well-annotated images for training. ...
Preprint
This study investigates the use of the unsupervised deep learning framework VoxelMorph for deformable registration of longitudinal abdominopelvic CT images acquired in patients with bone metastases from breast cancer. The CT images were refined prior to registration by automatically removing the CT table and all other extra-corporeal components. To improve the learning capabilities of VoxelMorph when only a limited amount of training data is available, a novel incremental training strategy is proposed based on simulated deformations of consecutive CT images. In a 4-fold cross-validation scheme, the incremental training strategy achieved significantly better registration performance compared to training on a single volume. Although our deformable image registration method did not outperform iterative registration using NiftyReg (considered as a benchmark) in terms of registration quality, the registrations were approximately 300 times faster. This study showed the feasibility of deep learning based deformable registration of longitudinal abdominopelvic CT images via a novel incremental training strategy based on simulated deformations.
... Therefore, studies tested on the same data set were selected for comparison. The study of Eppenhof using convolutional neural network to study the parameters of thin-plate spline transform [26] and the cascade method [27], and the registration of DLIR framework cascading network based on deep learning algorithm were selected as three groups of control studies. In this study, on one hand, the registration results were visualized, and the registration results of the model were qualitatively compared. ...
Article
Full-text available
Deep learning techniques have been applied to certain rigid or non-rigid medical image registration due to its potential advantages in meeting the clinical requirements of real-time and accuracy. Based on the deep learning model, this study aims to explore specific network models suitable for lung CT images. The proposed model took unlabeled 3D image pairs as input, and the convolutional neural network (CNN) was utilized and identified as a function with ability of sharing parameters to obtain displacement field. The image pair could be aligned by applying the acquired displacement field to the target image through spatial transformation. The similarity between the aligned image pair combined with the constraints on the displacement field was taken as the objective function to obtain the optimal parameters. Two models with different depths were designed and the consequent registration effects with different optimization methods and convolution kernel sizes were explored. The results proved that the designs with deeper level using Adam optimizer and smaller convolution kernels in obtaining displacement fields had higher accuracy and stronger robustness. The accuracy of the unsupervised model was comparable to state-of-the-art methods, while operating orders of magnitude faster. This study proposed a feasible registration method for lung 3D-CT, and its usefulness in aligning CT images has been demonstrated.
Chapter
Accurate deformable image registration is important for brain analysis. However, there are two challenges in deformation registration of brain magnetic resonance (MR) images. First, the global cerebrospinal fluid (CSF) regions are rarely aligned since most of them are located in narrow regions outside of gray matter (GM) tissue. Second, the small complex morphological structures in tissues are rarely aligned since dense deformation fields are too blurred. In this work, we use a weakly supervised registration scheme, which is driven by global segmentation labels and local segmentation labels via two special loss functions. Specifically, multiscale double Dice similarity is used to maximize the overlap of the same labels and also minimize the overlap of regions with different labels. The structural similarity loss function is further used to enhance registration performance of small structures, thus enhancing the whole image registration accuracy. Experimental results on inter-subject registration of T1-weighted MR brain images from the OASIS-1 dataset show that the proposed scheme achieves higher accuracy on CSF, GM and white matter (WM) compared with the baseline learning model.
Chapter
Learning-based registration, in particular unsupervised approaches that use a deep network to predict a displacement field that minimise a conventional similarity metric, has gained huge interest within the last two years. It has, however, not yet reached the high accuracy of specialised conventional algorithms for estimating large 3D deformations. Employing a dense set of discrete displacements (in a so-called correlation layer) has shown great success in learning 2D optical flow estimation, cf. FlowNet and PWC-Net, but comes at excessive memory requirements when extended to 3D medical registration. We propose a highly accurate unsupervised learning framework for 3D abdominal CT registration that uses a discrete displacement layer and a contrast-invariant metric (MIND descriptors) that is evaluated in a probabilistic fashion. We realise a substantial reduction in memory and computational demand by iteratively subdividing the 3D search space into orthogonal planes. In our experimental validation on inter-subject deformable 3D registration, we demonstrate substantial improvements in accuracy (at least ≈10% points Dice) compared to widely used conventional methods (ANTs SyN, NiftyReg, IRTK) and state-of-the-art U-Net based learning methods (VoxelMorph). We reduce the search space 5-fold, speed-up the run-time twice and are on-par in terms of accuracy with a fully 3D discrete network.
Chapter
Deformable image registration was a fundamental task in medical image analysis. Recently published registration methods based on deep learning have shown promising results. However, these algorithms brought limited precision improvement due to the similar learning framework. In order to address this shortcoming, we proposed a novel two-stage framework for deep learning based image registration. The new network computed deformation fields on different scales, similar to methods using auto-context strategy. Thereby, a coarse-scale alignment was obtained by the first half part of our network, which was subsequently improved on finer scale by the second half part. The new model could directly estimate the final deformation field in an end to end way. We demonstrated our method on the task of brain magnetic resonance (MR) image registration and showed that the new model could reach significantly better registration results.
Article
Full-text available
Purpose To quickly and automatically propagate organ contours from pretreatment to fraction images in magnetic resonance (MR)‐guided prostate external‐beam radiotherapy. Methods Five prostate cancer patients underwent 20 fractions of image‐guided external‐beam radiotherapy on a 1.5 T MR‐Linac system. For each patient, a pretreatment T2‐weighted three‐dimensional (3D) MR imaging (MRI) scan was used to delineate the clinical target volume (CTV) contours. The same scan was repeated during each fraction, with the CTV contour being manually adapted if necessary. A convolutional neural network (CNN) was trained for combined image registration and contour propagation. The network estimated the propagated contour and a deformation field between the two input images. The training set consisted of a synthetically generated ground truth of randomly deformed images and prostate segmentations. We performed a leave‐one‐out cross‐validation on the five patients and propagated the prostate segmentations from the pretreatment to the fraction scans. Three variants of the CNN, aimed at investigating supervision based on optimizing segmentation overlap, optimizing the registration, and a combination of the two were compared to results of the open‐source deformable registration software package Elastix. Results The neural networks trained on segmentation overlap or the combined objective achieved significantly better Hausdorff distances between predicted and ground truth contours than Elastix, at the much faster registration speed of 0.5 s. The CNN variant trained to optimize both the prostate overlap and deformation field, and the variant trained to only maximize the prostate overlap, produced the best propagation results. Conclusions A CNN trained on maximizing prostate overlap and minimizing registration errors provides a fast and accurate method for deformable contour propagation for prostate MR‐guided radiotherapy.
Article
Full-text available
Deformable image registration can be time consuming and often needs extensive parameterization to perform well on a specific application. We present a deformable registration method based on a three-dimensional convolutional neural network, together with a framework for training such a network. The network directly learns transformations between pairs of three-dimensional images. The network is trained on synthetic random transformations, which are applied to a small set of representative images for the desired application. Training therefore does not require manually annotated ground truth information on the deformation. The framework for the generation of transformations for training uses a sequence of multiple transformations at different scales that are applied to the image. This way, complex transformations with large displacements can be modeled without folding or tearing images. The methodology is demonstrated on public data sets of inhale-exhale lung CT image pairs, which come with landmarks for evaluation of the registration quality. We show that a small training set can be used to train the network, while still allowing generalization to a separate pulmonary CT data set containing data from a different patient group, acquired using a different scanner and scan protocol. This approach results in an accurate and very fast deformable registration method, without a requirement for parameterization at test time, or manually annotated data for training.
Article
Full-text available
Error estimation in nonlinear medical image registration is a nontrivial problem that is important for validation of registration methods. We propose a supervised method for estimation of registration errors in nonlinear registration of three-dimensional (3-D) images. The method is based on a 3-D convolutional neural network that learns to estimate registration errors from a pair of image patches. By applying the network to patches centered around every voxel, we construct registration error maps. The network is trained using a set of representative images that have been synthetically transformed to construct a set of image pairs with known deformations. The method is evaluated on deformable registrations of inhale-exhale pairs of thoracic CT scans. Using ground truth target registration errors on manually annotated landmarks, we evaluate the method's ability to estimate local registration errors. Estimation of full domain error maps is evaluated using a gold standard approach. The two evaluation approaches show that we can train the network to robustly estimate registration errors in a predetermined range, with subvoxel accuracy. We achieved a root-mean-square deviation of 0.51 mm from gold standard registration errors and of 0.66 mm from ground truth landmark registration errors.
Article
Full-text available
This paper introduces Quicksilver, a fast deformable image registration method. Quicksilver registration for image-pairs works by patch-wise prediction of a deformation model based directly on image appearance. A deep encoder-decoder network is used as the prediction model. While the prediction strategy is general, we focus on predictions for the Large Deformation Diffeomorphic Metric Mapping (LDDMM) model. Specifically, we predict the momentum-parameterization of LDDMM, which facilitates a patch-wise prediction strategy while maintaining the theoretical properties of LDDMM, such as guaranteed diffeomorphic mappings for sufficiently strong regularization. We also provide a probabilistic version of our prediction network which can be sampled during test time to calculate uncertainties in the predicted deformations. Finally, we introduce a new correction network which greatly increases the prediction accuracy of an already existing prediction network. Experiments are conducted for both atlas-to-image and image-to-image registrations. These experiments show that our method accurately predicts registrations obtained by numerical optimization, is very fast, and achieves state-of-the-art registration results on four standard validation datasets. Quicksilver is freely available as open-source software.
Article
Full-text available
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Conference Paper
In this paper we propose a method to solve nonrigid image registration through a learning approach, instead of via iterative optimization of a predefined dissimilarity metric. We design a Convolutional Neural Network (CNN) architecture that, in contrast to all other work, directly estimates the displacement vector field (DVF) from a pair of input images. The proposed RegNet is trained using a large set of artificially generated DVFs, does not explicitly define a dissimilarity metric, and integrates image content at multiple scales to equip the network with contextual information. At testing time nonrigid registration is performed in a single shot, in contrast to current iterative methods. We tested RegNet on 3D chest CT follow-up data. The results show that the accuracy of RegNet is on par with a conventional B-spline registration, for anatomy within the capture range. Training RegNet with artificially generated DVFs is therefore a promising approach for obtaining good results on real clinical data, thereby greatly simplifying the training problem. Deformable image registration can therefore be successfully casted as a learning problem.
Conference Paper
Existing deformable registration methods require exhaustively iterative optimization, along with careful parameter tuning, to estimate the deformation field between images. Although some learning-based methods have been proposed for initiating deformation estimation, they are often template-specific and not flexible in practical use. In this paper, we propose a convolutional neural network (CNN) based regression model to directly learn the complex mapping from the input image pair (i.e., a pair of template and subject) to their corresponding deformation field. Specifically, our CNN architecture is designed in a patch-based manner to learn the complex mapping from the input patch pairs to their respective deformation field. First, the equalized active-points guided sampling strategy is introduced to facilitate accurate CNN model learning upon a limited image dataset. Then, the similarity-steered CNN architecture is designed, where we propose to add the auxiliary contextual cue, i.e., the similarity between input patches, to more directly guide the learning process. Experiments on different brain image datasets demonstrate promising registration performance based on our CNN model. Furthermore, it is found that the trained CNN model from one dataset can be successfully transferred to another dataset, although brain appearances across datasets are quite variable.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.