Content uploaded by Lars König
Author content
All content in this area was uploaded by Lars König on Oct 06, 2014
Content may be subject to copyright.
c
2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
A FAST AND ACCURATE PARALLEL ALGORITHM FOR NON-LINEAR IMAGE
REGISTRATION USING NORMALIZED GRADIENT FIELDS
Lars K¨
onig and Jan R¨
uhaak
Fraunhofer MEVIS, Project Group Image Registration, L¨
ubeck, Germany
ABSTRACT
We present a novel parallelized formulation for fast non-linear
image registration. By carefully analyzing the mathematical
structure of the intensity independent Normalized Gradient
Fields distance measure, we obtain a scalable, parallel algo-
rithm that combines fast registration and high accuracy to an
attractive package. Based on an initial formulation as an opti-
mization problem, we derive a per pixel parallel formulation
that drastically reduces computational overhead.
The method was evaluated on ten publicly available 4DCT
lung datasets, achieving an average registration error of only
0.94 mm at a runtime of about 20 s. By omitting the finest
level, we obtain a speedup to 6.56 s with a moderate increase
of registration error to 1.00 mm. In addition our algorithm
shows excellent scalability on a multi-core system.
Index Terms—Image registration, Computational effi-
ciency, Parallel algorithms
1. INTRODUCTION
The problem of image registration and generally correspon-
dence detection between two or more images has been ex-
tensively studied [1]. Applications in medical imaging range
from motion compensation to intra-operative fusion of dif-
ferent modalities. In particular, non-linear registration meth-
ods are able to capture complex deformations with high accu-
racy, enabling advanced diagnosis and treatment [2]. Many of
these methods, however, exhibit long processing times or re-
quire special hardware such as GPUs. While both resolutions
and the number of imaging modalities are increasing, efficient
tools that run on available hardware are needed.
In this paper, we present a novel approach to both efficient
and accurate non-linear image registration. We directly target
the underlying mathematical structure of the entire algorithm
instead of only optimizing selected parts. We perform a deep
analysis of the objective function associated with the regis-
tration model, by which we join the major building blocks to
a closed analytical formulation. This allows parallelization
on a per pixel level with close-to-zero memory consumption,
directly executable on standard CPUs.
This work was partly funded by the European Regional Development
Fund (EFRE).
2. RELATED WORK
The efficiency of registration algorithms has been widely dis-
cussed. A general framework for fast registration has been
presented in 2004 [3]. Related approaches try to reduce the
computational complexity using adaptive discretizations [4].
With the ubiquity of multicore systems, parallel imple-
mentations have moved into the focus of the research com-
munity, see [5] for an overview. A detailed approach to data-
distributed parallel registration was presented in [6], whereas
newer work deals with the use of GPUs for accelerating non-
linear registration, e.g. [7] and references therein. A different
approach for rigid registration has been provided in [8], ex-
ploiting the mathematical structure to obtain a fully parallel
algorithm. This idea is picked up in our work and extended to
non-linear image registration.
3. METHOD
To obtain a custom-tailored, efficient algorithm, we first give
a short outline about our registration framework which allows
a thorough analysis of the components and their interaction.
3.1. Registration Framework
The goal of image registration is to establish correspondence
between a reference and a template image [9]. The images
are acquired as discrete arrays R∈Rabc and ˆ
T∈Rˆaˆ
bˆcin
column vectors representing three-dimensional images, Rof
size a×b×cwith grid spacings hx, hy, hz,ˆ
Tanalogously.
Correspondence is established by deforming the template
image onto the reference image using a transformation Y∈
R3ABC consisting of ABC three-dimensional deformation
coordinates. To be able to evaluate the template at those co-
ordinates, the discrete image is transferred to a continuous
model using trilinear interpolation, obtaining the interpolation
function T:R3abc →Rabc, which maps a set of coordinates
to a deformed image in the reference image space.
In our model, the size of the deformation Yis independent
of any image extent. This allows to adapt the deformation res-
olution to the size of the structures to be registered, thus de-
creasing both problem size and registration time. For compar-
ing the deformed template with the reference image, the de-
∂ψ
∂r
z }| {
∂r
∂T
z }| {
∂T
∂P
z }| {
∇DNGF(Y)=(••••••••••)
••••
•••••
•••••
••••••
•••••
•••••
••••••
•••••
•••••
••••
•••
•••
•••
•••
•••
•••
•••
•••
•••
•••
·∂P
∂Y
Fig. 1. Schematic view of the sparse matrix structure in the computation of ∇D. Diagonals in ∂r
∂T resulting from neighboring
points in the same direction are shown in the same color.
formation needs to be converted to the reference image extent
using a function P:R3ABC →R3abc , so that the deformed
template can be evaluated as T(P(Y)) : R3ABC →Rabc.
To quantify correspondence between the two images, we
define a distance measure D(Y) : R3ABC →R, which mea-
sures the similarity of reference and deformed template im-
age, depending on the deformation Y. The minimization of D
is an ill-posed problem and needs a regularization term S(Y)
to ensure certain deformation properties, such as smoothness
or specific physical behavior. Combining these two terms, the
registration problem can be written as an optimization prob-
lem J(Y) = D(Y) + αS(Y)Y
−→ min, where αbalances
image similarity and deformation regularity. The optimiza-
tion problem is then solved by Newton-type methods [10].
Since the formulation of each part of this optimization
problem is crucial, we will now look precisely at the compo-
nents and derive specific methods for efficient parallel com-
putation of their function values as well as their derivatives.
3.2. Distance measure
We focus on the Normalized Gradient Fields (NGF) distance
measure [9], that has been successfully proven to be both well
suited for multimodal registration problems and paralleliza-
tion [8]. The general assumption in this distance term is that
intensity changes, which naturally represent edges, are pre-
served across different modalities. The NGF evaluates the
angles between these image gradients and has a lower value
the more parallel the gradients are aligned. The maximum
value is obtained by orthogonal gradients.
In [9] NGF has been introduced in a continuous frame-
work. To obtain a discretized formulation, we use the mid-
point quadrature rule on the reference image domain. With
the product of the image grid spacings ¯
h=hxhyhzand
k·kε=ph·,·i +ε2we can write the NGF as
D(Y) = ¯
h
2
abc
X
i=1 1−h∇Ti(P(Y)),∇Rii+τ%
k∇Ti(P(Y))kτk∇Rik%2!,
(1)
where τ, % > 0are modality dependent parameters, which
enable the gradient images to be filtered for noise.
3.3. Parallel derivative computation
The most time of the registration is typically spent evaluat-
ing the distance measure and its derivative. While the func-
tion value computation is directly parallelizable using (1), the
gradient computation is more involved. It consists of several
separate steps, that need to be investigated in detail to derive
a joint, parallelizable formulation. The steps can be described
as follows: Convert deformation to reference image grid →
Compute deformed template →Compute NGF residual →
Final summation step. These steps translate to the function
chain
R3ABC P
−→ R3abc T
−→ Rabc r
−→ Rabc ψ
−→ R(2)
with reduction function ψ:Rabc →R,(r1, . . . , rabc)>7→
¯
h
2Pabc
i=1(1 −r2
i).Using (1), the ith component of rcan be
written as
ri(T) = hg(Ti), g(Ri)i+τ%
kg(Ti)kτkg(Ri)k%
,
where g(Ti)is the image gradient approximation of Tat the
point iusing forward/backward finite differences as in [8].
Mathematically, the derivative of (1) can directly be
computed using the chain-rule, yielding ∇DNGF(Y) =
∂ψ
∂r
∂r
∂T
∂T
∂P
∂P
∂Y . Calculating this in a matrix-based fashion, the
formulation is difficult to parallelize because of dependen-
cies on intermediate results and unknown matrix structures.
Hence, we take a closer look at the structure of the single
components, which is visualized in Fig. 1. Exploiting the
banded structure of ∂r
∂T , which only contains non-zero ele-
ments at neighboring points, we can derive a compact closed
formulation of each gradient element. By evaluating the
complete matrix chain, point-wise, down to its very basic ele-
ments (the images), this formulation can directly be computed
fully in parallel from the input data, eliminating intermediate
memory write accesses and computational overhead.
With the set of offsets to points adjacent to point iin a 3D-
neighborhood defined as M={−z, −y, −x, 0,+x, +y, +z}
with zero Neumann boundary conditions, using the notation
as in [8] we can first define
(ˆri)l=1
2hlRi−l−Ri
kgi(R)k%kgi(T)kτ−(hgi(T),gi(R)i+%τ)(Ti−l−Ti)
kgi(R)k%kgi(T)k3
τ
with Ti:= Ti(P(Y)). Then the i+lth element of the row
vector ∂ri
∂T can be written as
∂ri
∂T i+l
=
(ˆri)l,if l∈ M \ 0
−Pj∈M\0(ˆri)j,if l= 0
0,otherwise
.
The final gradient element at position iis given by
(∇D)i=
X
j∈M
−rj∂rj
∂T i−j
∂Ti
∂pi+d·abc
,(3)
with d= 0,1,2for derivatives regarding x-,y-,z-coordinates,
respectively. This formulation does not contain dependencies
between single gradient elements and can be calculated with-
out intermediate steps from the input data. Thus it can be
fully parallelized, given a per-element formulation of the in-
terpolation function and grid conversion T(P(Y)). This will
be discussed in the next section.
3.4. Grid conversion
The conversion between deformation and reference image
discretization is performed using trilinear interpolation. As
the interpolation weights only depend on the spacing of
deformation and reference image, not on the current defor-
mation, the conversion is a linear operation with matrix P.
For both NGF function value and gradient, the conversion
from deformation to image grid is needed. This can easily
be implemented in a matrix-free fashion by looping over the
image grid, collecting all adjacent deformation grid points
with their associated interpolation weights and summing up.
Moreover, the computation can directly be parallelized as
there are no write conflicts.
Setting v:= ∂ψ
∂r
∂r
∂T
∂T
∂P , the gradient computation for NGF
is equivalent to the matrix-vector product P>v. We use a red-
black scheme for efficient parallel implementation. The iter-
ation is performed over the deformation grid cells, allowing
write access to eight grid points at the same time. The algo-
rithm is parallelized on the image slices: In the first loop, only
the odd slices are considered, allowing for unconflicted writes
as the slices themselves are computed serially. In the second
loop, the even slices are calculated, finalizing the result. Fig.
2 illustrates our approach.
3.5. Regularizer
The last term in the objective function is the regularizer.
Here, we choose Curvature Regularization [11], which favors
a smooth deformation field. It has successfully been used
in non-linear registration problems [12]. Since its computa-
tion is lightweight and easy parallelizable it is well suited to
accompany the parallelized NGF. Discretized on the transfor-
mation grid and using the decomposition Y=X+U, where
Fig. 2. Red-black scheme for transposed grid conversion in
2D, with deformation (blue) and image grid (white). The red
rows are processed in parallel, followed by the black rows.
Only the adjacent blue nodes are written in each step such
that no write conflicts can occur.
Xis the identity, the curvature regularizer can be written as
SCurv(Y) = ¯
hY
2
ABC
X
i=i(∆iU1)2+ (∆iU2)2+ (∆iU3)2,
where Uirepresents the ith component function of the vector
field deformation U. The function ∆i:RABC →Ris a finite
difference approximation to the Laplace operator at point i
∆iUk=X
j∈{x,y,z}
1
hY
khY
k(Uk)i−j−2 (Uk)i+ (Uk)i+j,
where i±x, i ±y, i ±zrepresent the neighboring points of
iin the respective directions and hY
kthe grid spacing of Y.
Here, we use zero Neumann boundary conditions. The i-th
element of the gradient of the regularizer is then given by
(∇SCurv)i=¯
hY(∆iU1+ ∆iU2+ ∆iU3).
Note that due to discretization on the deformation grid, no
grid conversion is needed for the regularizer. With this for-
mulation we have the complete objective function available
as per point parallelizable terms.
3.6. Optimization
To gain additional speedup and avoid being trapped in local
minima, the presented objective function is optimized in a
multi-level approach. For this, the problem is successively
solved on finer representations, using the minimizer from
each coarser level as a starting guess for the next finer level.
On each level the objective function is iteratively minimized
using an L-BFGS approach, which is known for its memory
efficiency and fast convergence [10].
4. EVALUATION
We have evaluated the accuracy and computational efficiency
of our method on the challenging problem of CT lung reg-
istration. Since the air volume inside the lung varies while
Case LME (a) LME (b) Time (a) Time (b)
10.78 ±0.89 0.76 ±0.89 18.71 s 4.12 s
20.79 ±0.90 0.80 ±0.88 19.58 s 5.71 s
30.93 ±1.05 0.96 ±1.07 18.64 s 4.42 s
41.27 ±1.27 1.33 ±1.29 22.95 s 4.05 s
51.07 ±1.46 1.18 ±1.45 18.77 s 5.50 s
60.90 ±0.99 1.03 ±1.04 19.71 s 7.31 s
70.85 ±0.98 0.92 ±0.93 27.34 s 10.12 s
81.03 ±1.23 1.13 ±1.15 24.98 s 9.22 s
90.94 ±0.93 1.00 ±0.96 20.42 s 6.82 s
10 0.86 ±0.97 0.91 ±0.99 17.89 s 8.36 s
Avg. 0.94 ±1.07 1.00 ±1.07 20.90 s 6.56 s
Table 1. DIR-Lab datasets: Comparison of runtime and land-
mark error (LME) with α= 5, τ, % = 100 and finest defor-
mation grid size of 653. Multi-level configuration (a) uses the
full resolution, (b) omits the finest level in the multi-level ap-
proach [12]. All values are given in millimeters. The initial
landmark error ranged from 3.89 ±2.79 mm to 14.99 ±9.01
mm. The registrations were performed on a stock 3.4 GHz
Intel i7-2600 quad-core PC running Ubuntu Linux.
Method Serial Parallel Speedup
NGF 55.08 s 4.13 s 13.34
∇NGF 94.96 s 7.72 s 12.30
P x 8.98 s 0.76 s 11.82
P>x9.18 s 0.77 s 11.92
Table 2. Higher resolution datasets (5123image resolution,
1293deformation grid size): Scaling of NGF gradient and
grid conversion on a 12-core dual CPU Intel Xeon E5645
breathing, the intensities in the acquired images are not di-
rectly comparable, which makes the datasets appropriate for
the intensity independent NGF. For the evaluation we used the
publicly available DIR-Lab 4DCT datasets [13, 14] and regis-
tered the extreme phases. These phases come with 300 expert
annotated landmark pairs that can be used to assess registra-
tion accuracy. As we are only interested in the deformation of
lung tissue, we segmented the lungs from the images [15].
To show the scalability of our algorithm, we performed
separate calculations of the NGF and the grid change opera-
tors single and multithreaded on a 12-core workstation.
5. RESULTS
On the DIR-Lab data we achieved a mean registration error of
0.94 mm with an average complete runtime of 20.9 seconds.
Omitting the finest level, we obtained a speedup to 6.56 sec-
onds with a moderate increase of average registration error to
1.00 mm. The detailed results of all cases are shown in Table
1. The result deformation fields were automatically checked
and found to be free of singularities.
For eight of the ten cases the landmark errors were equal
to or better than the lowest errors reported in [16]. Addition-
ally the computation time compares very favorably with the
competing algorithms. Comparing the single threaded com-
putation time to a multithreaded calculation on a 12-core sys-
tem, shown in Table 2, speedup factors from 11.82 to 13.34
are obtained, which implies a perfect linear scalability.
Hence, our algorithm combines accuracy and efficiency
to a very attractive package. In addition our method does not
require any special equipment such as multi-CPU servers or
specialized GPUs. It runs on readily available stock hardware
that is already used in the clinic.
6. REFERENCES
[1] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable med-
ical image registration: a survey.,” IEEE Transactions on Med-
ical Imaging, vol. 32, no. 7, pp. 1153–90, 2013.
[2] C. J. Galb´
an et al., “Computed tomography-based biomarker
provides unique signature for diagnosis of COPD phenotypes
and disease progression,” Nature Medicine, 2012.
[3] B. Fischer and J. Modersitzki, “A unified approach to fast im-
age registration and a new curvature based registration tech-
nique,” Linear Algebra Appl, vol. 380, pp. 107–124, 2004.
[4] E. Haber, S. Heldmann, and J. Modersitzki, “Adaptive mesh
refinement for nonparametric image registration,” SIAM J Sci
Comput, vol. 30, pp. 3012–3027, 2008.
[5] R. Shams et al., “A survey of medical image registration on
multicore and the GPU,” Signal Processing Magazine, IEEE,
vol. 27, no. 2, pp. 50–60, 2010.
[6] F. Ino, K. Ooyama, and K. Hagihara, “A data distributed par-
allel algorithm for nonrigid image registration,” Parallel Com-
puting, vol. 31, no. 1, pp. 19–43, 2005.
[7] X. Gu et al., “Implementation and evaluation of various
demons deformable image registration algorithms on a GPU,”
Phys Med Biol, vol. 55, no. 1, pp. 207, 2010.
[8] J. R ¨
uhaak, L. K¨
onig, et al., “A fully parallel algorithm for mul-
timodal image registration using normalized gradient fields,”
in ISBI, IEEE, 2013, pp. 572–575.
[9] J. Modersitzki, FAIR: Flexible Algorithms for Image Registra-
tion, SIAM, 2009.
[10] J. Nocedal and S.J. Wright, Numerical Optimization, Springer,
2006.
[11] B. Fischer and J. Modersitzki, “Curvature based image regis-
tration,” J Math Imaging Vis, vol. 18, no. 1, pp. 81–85, 2003.
[12] J. R¨
uhaak, S. Heldmann, T. Kipshagen, and B. Fischer,
“Highly accurate fast lung CT registration,” in SPIE Medical
Imaging, Image Processing, 2013.
[13] R. Castillo et al., “A framework for evaluation of deformable
image registration spatial accuracy using large landmark point
sets,” Phys Med Biol, vol. 54, no. 7, pp. 1849, 2009.
[14] E. Castillo et al., “Four-dimensional deformable image regis-
tration using trajectory modeling,” Phys Med Biol, vol. 55, no.
1, pp. 305, 2010.
[15] B. Lassen et al., “Lung and lung lobe segmentation methods
at Fraunhofer MEVIS,” in Fourth International Workshop on
Pulmonary Image Analysis, 2011, pp. 185–200.
[16] R. Castillo, “DIR-Lab - Results,” www.dir- lab.com/
Results.html, 2013, [Online; accessed 01-Oct-2013].