Content uploaded by Koen Eppenhof
Author content
All content in this area was uploaded by Koen Eppenhof on Aug 30, 2019
Content may be subject to copyright.
Progressively Growing Convolutional Networks for
End-to-End Deformable Image Registration
Koen A.J. Eppenhof, Maxime W. Lafarge, and Josien P.W. Pluim
Medical Image Analysis, Eindhoven University of Technology, The Netherlands
ABSTRACT
Deformable image registration is often a slow process when using conventional methods. To speed up deformable
registration, there is growing interest in using convolutional neural networks. They are comparatively fast and can
be trained to estimate full-resolution deformation fields directly from pairs of images. Because deep learning-
based registration methods often require rigid or affine pre-registration of the images, they do not perform
true end-to-end image registration. To address this, we propose a progressive training method for end-to-end
image registration with convolutional networks. The network is first trained to find large deformations at a low
resolution using a smaller part of the full architecture. The network is then gradually expanded during training
by adding higher resolution layers that allow the network to learn more fine-grained deformations from higher
resolution data. By starting at a lower resolution, the network is able to learn larger deformations more quickly
at the start of training, making pre-registration redundant. We apply this method to pulmonary CT data, and
use it to register inhalation to exhalation images. We train the network using the CREATIS pulmonary CT data
set, and apply the trained network to register the DIRLAB pulmonary CT data set. By computing the target
registration error at corresponding landmarks we show that the error for end-to-end registration is significantly
reduced by using progressive training, while retaining sub-second registration times.
Keywords: Deformable image registration, multi-resolution methods, convolutional neural networks, deep learn-
ing, fast image registration
1. INTRODUCTION
In previous work we have shown that it is possible to train a neural network to estimate non-linear transformations
on a thin-plate spline grid.1The network was trained on pairs of synthetically deformed pulmonary CT images,
but was shown to generalize to inhale-exhale lung image registration. This work has since been extended to
a network that can learn full-resolution deformation fields directly from two input images.2For this, we used
the U-net architecture,3which allows estimation of full-resolution deformation fields at high speeds because the
registration consists of only one forward pass through the network. The accuracy of this method can compete with
existing methods for pulmonary CT registration. One limitation of this previous work is that it requires affine
pre-registration of the images, because the network can only estimate the smaller local deformations between
affinely registered images and fails to accurately estimate the deformation field without pre-registration. This
limitation has been reported by other registration methods based on convolutional neural networks. Examples
include deep learning-based pulmonary CT registration,4and brain MR registration, where the images are pre-
registered to a brain atlas,5or to each other.6
1.1 Aim
Given the need for pre-registration in deep learning-based registration methods, we propose an extension to
our previous method with the purpose to perform true end-to-end image registration, i.e. without requiring
pre-registration, while still retaining the high registration speed. Instead of training the network to find the
full resolution deformation field from the start, we reason that it is easier to learn large displacements at lower
resolutions first, and then expand to higher resolutions. Similar strategies are used in image registration methods
that iteratively optimize similarity metrics, by employing a multi-resolution image pyramid. A transformation
model found for lower resolution version of the images is then used as initialization for the higher resolution.
In optimization-based methods, this leads to a more robust optimization, that reduces local minima. In this
paper, we adapt these ideas to training a multi-resolution neural network for deformable registration. Instead
of training the network directly to learn a full-resolution deformation field between two input images, we first
let smaller versions of the network learn lower resolution versions of the deformation field. This way, we slowly
build up the network during training, until the network can construct full resolution deformation fields. We
implement this by parameterizing the network architecture such that it can be trained for a specific resolution
of inputs and outputs. This parameterization also allows smooth transitions between the resolutions.
2. METHODS
2.1 Architecture
Growing neural networks have been previously applied in the context of generative adversarial networks (GANs)
by Karras et al.7Here, we apply a similar method to the U-net architecture. The U-net architecture is composed
of several ’resolution levels’ that consist of convolutional layers that operate on a specific resolution (image
dimension), with low-resolution layers at the bottom and higher resolution layers at the top of the ‘U’-shaped
architecture. Each resolution level also has a distinct number of learned feature maps. This number doubles for
every resolution level, e.g. if the convolutional layers in the top level learn Nfeature maps, then the convolutional
layers in the level below it learn 2Nfeature maps. When we apply this architecture to registration, we use two
images as inputs: the fixed and the moving image. The output of the U-net consists of three maps corresponding
to the deformation field’s three components, i.e. the displacement in x,y, and zdirection.
2.2 Modifications to the architecture
The architecture remains similar to common implementations of a 3D U-net, except that there are inputs and
outputs at every resolution level, instead of only the top level (Figure 1). These inputs and outputs have
resolutions that match the resolution of their level. On the left side of the network, the input images are
downsampled to the resolution of the level in which they enter using average pooling. The downsampled input
passes through a single convolutional layer to match the number of feature maps Nfor its resolution level i. The
result and the output of the pooling layer of the resolution level above it, are summed, weighted by weights βi
and 1 −βi. On the right side of each resolution level, an extra convolutional layer learns the three maps for the
components of the vector field at the level’s resolution. The vector fields are upsampled to the original resolution,
after which they are weighted with parameters αi, and summed to produce one final deformation field at the
original resolution of the input images.
2.3 Progressive learning
The modifications to the architecture introduce a set of parameters αi(t) that determine the influence of each
resolution level to the output of the network. The βi(t) parameters determine the weighting between the output
of pooling layers and incoming inputs, and are defined as
βi=
i
X
j=0
αj.(1)
The αi(t) parameters are updated during training, and therefore depend on the training iteration t. At every
point in training, the sum of these parameters is 1, i.e.
X
i
αi(t) = 1 for all t. (2)
The individual parameters are defined as
αi(t) =
0 for t<τiand t > τi+ ∆
t−τi
δfor τi≤t≤τi+δ
1 for τi+δ≤t≤τi+ ∆ −δ
τi+∆−t
δfor τi+ ∆ −δ≤t≤τi+ ∆.
(3)
Figure 1: The network architecture resembles the original U-net architecture in 3D, but with inputs and outputs
at every resolution level. The parameters αidetermine which parts of the architecture are used.
Figure 2: Progressive growing scheme showing the parameters αi(t) and βi(t) changing over time during training.
The colors correspond to the colors of the levels in Figure 1.
The schedules for the five resolution levels in the U-net use δ= 200, ∆ = 600, and τi=−300 + 400iand are
shown in Figure 2. Whenever one of the weights αiequals 1, we are essentially training a U-net with i+ 1
resolution levels. Hence, once we reach α0=α1=α2=α3= 0 and α4= 1 we have arrived at a normal five
level U-net architecture. An example of the network architecture at three points in training is given in Figure 3,
which shows the network at the start of training, at the transition between the first and second resolution level,
and after that transition is complete and two resolution layers are being trained.
2.4 Data
We train on and apply the network to pulmonary CT images. We train the network on the public CREATIS and
POPI data sets,8and apply the trained network to the DIRLAB data9set to show generalization to a different
data set of pulmonary CT images. We crop the images such that the lungs are still completely visible, and resize
the resulting image to a 128 ×128 ×128 voxel resolution. We create lung masks by segmenting all voxels with
Hounsfield units lower than -250, which corresponds to low density tissue inside the lungs. These masks are used
to focus the training on the voxels inside the lungs (see Section 2.6).
2.5 Training set construction
During training, random synthetic transformations are applied to the fourteen (seven pairs) CREATIS and POPI
images. Every iteration of training, one of the images in the training set is selected. Two random transformations
are applied to this image: one to augment the data set (Taugm ), and one that is actually learned by the network
𝝰 = [0.75, 0.25, 0, 0, 0]
Figure 3: Illustration of the transition between a one level and a two level network. The remainder of the network
(levels 3, 4, and 5) are not shown for clarity. The values of the parameters αand βare shown in red for the
state of the network at iterations 0, 150, and 300 respectively.
Figure 4: At training time, an image from the training set is transformed twice. The deformation field of the
net transformation between the images is learned by the network. At test time, the two images are replaced by
the moving and fixed image.
(Tlearned). The latter is applied together with the augmentation transformation, such that we obtain a pair of
images I[Taugm(x)] and I[(Taugm ◦Tlearned )(x)]. From these two images, the network will learn the underlying
deformation vector field, i.e. of the transformation Tlearned (Figure 4).
The augmentation transformation Taugm is created by sampling displacements on a 2 ×2×2 grid from a
uniform distribution in the [−12.8,12.8] voxel range. The learned transformation Tlearned consists of a sequence
of two transformations: one coarse B-spline transformation using a 4 ×4×4 size B-spline grid, and one fine
B-spline transformation using an 8×8×8 size grid. The displacements on those grids are sampled from a uniform
distribution in the ranges [−25.6,25.6] voxels and [−6.4,6.4] voxels respectively.
At test time, a pair of moving and fixed images replaces the pair of images: I[Taugm (x)] will be replaced by
the moving image, and I[(Taugm ◦Tlearned)(x)] will be replaced by the fixed image.
2.6 Training the network
The network is trained by minimizing the squared error between the true deformation field ulearned(x) =
Tlearned(x)−xand the network’s estimate ˆulearned, within the lung mask M(x)∈[0,1]. The loss function
is defined as
L=Px∈ΩFM(x)||ulearned(x)−ˆulearned(x)||2
2
Px∈ΩFM(x).(4)
This loss function is minimized using stochastic gradient descent with a momentum of 0.5 and an annealed
learning rate
η(t) = η0
1 + λt (5)
with η0= 0.1 and λ= 10−4. We use a batch size of one, and use batch normalization for all convolutional layers
using moving averages of the mean and variances of the parameters over all iterations, as proposed by Ioffe et
al.10 To show the effect of progressive training, we train the architecture once using the schedule in Figure 2,
and once using the setting α0=α1=α2=α3= 0 and α4= 1 throughout training, i.e. the ‘conventional’ U-net.
Because the architecture is the same in both cases, we use the exact same weight initialization, training data,
and learning rate. At test time we use this setting for αfor both versions of the network.
3. RESULTS
We registered the ten pairs of DIRLAB images using both the progressive and conventional network, and eval-
uated the resulting deformation fields by computing target registration errors (TRE) on manually annotated
corresponding landmarks in the DIRLAB set. The TRE metric captures misalignment by measuring the L2-
norm between landmarks in the moving domain and the transformed landmarks from the fixed domain, i.e.
TRE = ||xM−T(xF)||. Results per image pair are shown in Table 1, together with the boxplot in Figure 6that
shows the distributions of TRE values for both architectures. We also show correlation plots for the estimated
displacements of the 3000 landmarks in the DIRLAB set in x-, y-, and z-direction in Figure 5. Both show that
Figure 5: Correlation plots showing the correlation between the estimated and true displacements of the DIRLAB
landmarks for both network variants, shown for the x-, y-, and z-components of the displacement.
Table 1: Target registration errors (TRE) in millimeters for no reg-
istration, registration using the conventional network, and using the
progressive network.
Before Conventional Progressive
Pair registration training training
1 3.89 ±2.78 2.05 ±0.97 2.18 ±1.05
2 4.34 ±3.90 4.10 ±2.01 2.06 ±0.96
3 6.94 ±4.05 2.96 ±1.21 2.11 ±1.04
4 9.83 ±4.85 3.75 ±2.28 3.13 ±1.60
5 7.48 ±5.50 4.00 ±2.46 2.92 ±1.70
6 10.89 ±6.96 6.38 ±6.05 4.20 ±2.00
7 11.03 ±7.42 5.00 ±3.65 4.12 ±2.97
8 14.99 ±9.00 11.33 ±5.01 9.43 ±6.28
9 7.92 ±3.97 4.50 ±2.02 3.82 ±1.69
10 7.30 ±6.34 4.29 ±2.61 2.87 ±1.96
Total 8.46 ±6.58 4.84 ±4.03 3.68 ±3.32 Figure 6: Boxplots of the average
TRE values of the two networks.
in general, the progressively trained network performs better. We used the Wilcoxon signed-rank test to test for
pairwise equality between the TRE values for all landmarks in both architectures. This indicated that the TRE-
values of the progressive architecture (median: 2.82 mm) and the TRE-values of the static architecture (median:
3.72 mm) differed significantly (N= 3000, T = 16.2, r = 0.55, p < 0.001). The median TRE values per image for
each architecture also significantly differ (3.68 mm versus 4.79 mm, N= 10, T = 0, r = 1.0, p = 0.005), as can
also be concluded from Figure 6. On average, the estimation of the deformation vector fields took 0.90 ±0.05
seconds on an Nvidia Titan XP GPU.
4. DISCUSSION
In this paper we propose a method for training a fully convolutional neural network that lets the network grow
progressively during training. We show on publicly available pulmonary CT data sets that this approach results in
significantly lower registration errors, and better correlation with the ground truth compared to the conventional
training approach. We hypothesize that because the network can focus on lower resolution data at the start of
training, it can quickly optimize a limited number of weights to model a large range of displacements on a small
vector field. The smooth transitions between resolution levels then allow the network to estimate the same range
of displacements in higher resolution vector fields. Once the final transition is complete, the architecture is no
different from an ordinary U-net.
The resulting network can perform end-to-end estimation of full size deformation fields directly from two
images. It can generalize from the training set to a separate set of data acquired on different hardware, at a
different institute. By making elaborate use of data augmentation, we only require a few images to train on,
without need for manually annotated ground truths. Furthermore, because the method requires only one forward
pass through the network, it takes less than one second to estimate a full-resolution deformation field. Although
the results are promising, they are not yet up to the standard of the state-of-the-art for lung registration , which
result in average TREs between 1.36 ±1.01 and 2.13 ±1.82 mm on the same DIRLAB data set.11,12 To address
this, future work will focus on improving the training set’s deformations to be more realistic for this particular
application.
5. CONCLUSION
In this paper we propose a method for training a convolutional neural network that lets the network grow
progressively during training. We show on publicly available pulmonary CT data sets that this approach results
in lower registration errors, and better correlation with the ground truth compared to conventional training
approaches. The resulting network can perform end-to-end estimation of full size deformation fields in less than
a second, directly from two images.
REFERENCES
[1] Eppenhof, K. A. J. and Pluim, J. P. W., “Deformable image registration using convolutional neural net-
works,” Proceedings SPIE Medical Imaging , 10574–27 (2018).
[2] Eppenhof, K. A. J. and Pluim, J. P. W., “Pulmonary ct registration through supervised learning with
convolutional neural networks,” IEEE Transactions on Medical Imaging , In press (2018).
[3] Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmen-
tation,” Proceedings MICCAI , 234–241 (2015).
[4] Sokooti, H., de Vos, B. D., Berendsen, F. F., Lelieveldt, B. P. F., Isgum, I., and Staring, M., “Nonrigid image
registration using multi-scale 3D convolutional neural networks,” Proceedings MICCAI , 232–239 (2017).
[5] Yang, X., Kwitt, R., Styner, M., and Niethammer, M., “Quicksilver: Fast predictive image registration a
deep learning approach,” NeuroImage 158, 378 – 396 (2017).
[6] Cao, X., Yang, J., Zhang, J., Nie, D., Kim, M., Wang, Q., and Shen, D., “Deformable image registration
based on similarity-steered CNN regression,” Proceedings MICCAI , 300–308 (2017).
[7] Karras, T., Aila, T., Laine, S., and Lehtinen, J., “Progressive growing of GANs for improved quality,
stability, and variation,” ICLR (2018).
[8] Vandemeulebroucke, J., Rit, S., Kybic, J., Clarysse, P., and Sarrut, D., “Spatiotemporal motion estimation
for respiratory-correlated imaging of the lungs,” Medical Physics 38(1), 166–178 (2011).
[9] Castillo, E., Castillo, R., Martinez, J., Shenoy, M., and Guerrero, T., “Four-dimensional deformable image
registration using trajectory modeling,” Physics in Medicine & Biology 55(1), 305 (2010).
[10] Ioffe, S. and Szegedy, C., “Batch normalization: Accelerating deep network training by reducing internal
covariate shift,” in [Proceedings ICML], 448–456 (2015).
[11] Schmidt-Richberg, A., Werner, R., Handels, H., J., and Ehrhardt, “Estimation of slipping organ motion by
registration with direction-dependent regularization,” Medical Image Analysis 16(1), 150–159 (2012).
[12] Berendsen, F. F., Kotte, A. N. T. J., Viergever, M. A., and Pluim, J. P., “Registration of organs with sliding
interfaces and changing topologies,” in [SPIE Medical Imaging], 90340E–1 (2014).