Content uploaded by Martin Urschler
Author content
All content in this area was uploaded by Martin Urschler on Jun 17, 2016
Content may be subject to copyright.
From Local to Global Random
Regression Forests: Exploring Anatomical
Landmark Localization
Darko ˇ
Stern1?, Thomas Ebner2, and Martin Urschler1,2,3
1Ludwig Boltzmann Institute for Clinical Forensic Imaging, Graz, Austria
2Institute for Computer Graphics and Vision, Graz University of Technology, Austria
3BioTechMed-Graz, Austria
Abstract. State of the art anatomical landmark localization algorithms
pair local Random Forest (RF) detection with disambiguation of locally
similar structures by including high level knowledge about relative land-
mark locations. In this work we pursue the question, how much high-level
knowledge is needed in addition to a single landmark localization RF to
implicitly model the global configuration of multiple, potentially ambigu-
ous landmarks. We further propose a novel RF localization algorithm
that distinguishes locally similar structures by automatically identifying
them, exploring the back-projection of the response from accurate local
RF predictions. In our experiments we show that this approach achieves
competitive results in single and multi-landmark localization when ap-
plied to 2D hand radiographic and 3D teeth MRI data sets. Additionally,
when combined with a simple Markov Random Field model, we are able
to outperform state of the art methods.
1 Introduction
Automatic localization of anatomical structures consisting of potentially ambigu-
ous (i.e. locally similar) landmarks is a crucial step in medical image analysis
applications like registration or segmentation. Lindner et al. [5] propose a state
of the art localization algorithm, which is composed of a sophisticated statistical
shape model (SSM) that locally detects landmark candidates by three step opti-
mization over a random forest (RF) response function. Similarly, Donner et al. [2]
use locally restricted classification RFs to generate landmark candidates, fol-
lowed by a Markov Random Field (MRF) optimizing their configuration. Thus,
in both approaches good RF localization accuracy is paired with disambiguation
of landmarks by including high-level knowledge about their relative location. A
different concept for localizing anatomical structures is from Criminisi et al. [1],
suggesting that the RF framework itself is able to learn global structure configu-
ration. This was achieved with random regression forests (RRF) using arbitrary
?This work was supported by the province of Styria (HTI:Tech for Med ABT08-22-
T-7/2013-13) and the Austrian Science Fund (FWF): P 28078-N33.
2ˇ
Stern et al.
Fig. 1. Overview of our RRF based localization strategy. (a) 37 anatomical landmarks
in 2D hand X-ray images and differently colored MRF configurations. (b) In phase
1, RRF is trained locally on an area surrounding a landmark (radius R) with short
range features, resulting in accurate but ambiguous landmark predictions (c). (d) Back-
projection is applied to select pixels for training the RRF in phase 2 with larger feature
range (e). (f) Estimated landmarks by accumulating predictions of pixels in local neigh-
bourhood. (g,h) One of two independently predicted wisdom teeth from 3D MRI.
long range features and allowing pixels from all over the training image to glob-
ally vote for anatomical structures. Although roughly capturing global structure
configuration, their long range voting is inaccurate when pose variations are
present, which led to extending this concept with a graphical model [4]. Ebner
et al. [3] adapted the work of [1] for multiple landmark localization without the
need for an additional model and improved it by introducing a weighting of vot-
ing range at testing time and by adding a second RRF stage restricted to the
local area estimated by the global RRF. Despite putting more trust into the
surroundings of a landmark, their results crucially depend on empirically tuned
parameters defining the restricted area according to first stage estimation.
In this work we pursue the question, how much high-level knowledge is needed
in addition to a single landmark localization RRF to implicitly model the global
configuration of multiple, potentially ambiguous landmarks [6]. Investigating dif-
ferent RRF architectures, we propose a novel single landmark localization RRF
algorithm, robust to ambiguous, locally similar structures. When extended with
a simple MRF model, our RRF outperforms the current state of the art method
of Lindner et al. [5] on a challenging multi-landmark 2D hand radiographs data
set, while at the same time performing best in localizing single wisdom teeth
landmarks from 3D head MRI.
2 Method
Although being constrained by all surrounding objects, the location of an anatom-
ical landmark is most accurately defined by its neighboring structures. While
From local to global random regression forest localization 3
increasing the feature range leads to more surrounding objects being seen for
defining a landmark, enlarging the area from which training pixels are drawn
leads to the surrounding objects being able to participate in voting for a land-
mark location. We explore these observations and investigate the influence of
different feature and voting ranges, by proposing several RRF strategies for sin-
gle landmark localization. Following the ideas of Lindner et al. [5] and Donner et
al. [2], in the first phase of the proposed RRF architectures, the local surround-
ings of a landmark are accurately defined. The second RRF phase establishes
different algorithm variants by exploring distinct feature and voting ranges to
discriminate ambiguous, locally similar structures. In order to maintain the ac-
curacy achieved during the first RRF phase, locations outside of a landmark’s
local vicinity are recognized and banned from estimating the landmark location.
2.1 Training the RRF
We independently train an RRF for each anatomical landmark. Similar to [1,3],
at each node of the Ttrees of a forest, the set of pixels Snreaching node nis
pushed to left (Sn,L) or right (Sn,R) child node according to the splitting decision
made by thresholding a feature response for each pixel. Feature responses are
calculated as differences between mean image intensity of two rectangles with
maximal size sand maximal offset orelative to a pixel position vi;i∈Sn.
Each node stores a feature and threshold selected from a pool of NFrandomly
generated features and NTthresholds, maximizing the objective function I:
I=X
i∈Sn
di−d(Sn)
2−X
c∈{L,R}
X
i∈Sn,c
di−d(Sn,c)
2.(1)
For pixel set S,diis the i-th voting vector, defined as the vector between land-
mark position land pixel position vi, while d(S) is the mean voting vector of
pixels in S. For later testing, we store at each leaf node lthe mean value of
relative voting vectors dlof all pixels reaching l.
First training phase: Based on a set of pixels SI, selected from the training
images at the location inside a circle of radius Rcentered at the landmark
position, the RRF is first trained locally with features whose rectangles have
maximal size in each direction sIand maximal offset oI, see Fig. 1b. Training of
this phase is finished when a maximal depth DIis reached.
Second training phase: Here, our novel algorithm variants are designed by
implementing different strategies how to deal with feature ranges and selection
of the area from which pixels are drawn during training. By pursuing the same
local strategy as in the first phase for continuing training of the trees up to a
maximal depth DII , we establish the localRRF similar to the RF part in [5, 2]. If
we continue training to depth DI I with a restriction to pixels SIbut additionally
allow long range features with maximal offset oII >oIand maximal size sI I >sI,
we get fAdaptRRF. Another way of introducing long range features, but still
keeping the same set of pixels SI, was proposed for segmentation in Peter et
al. [7]. They optimize for each forest node the feature size and offset instead
4ˇ
Stern et al.
of the traditional greedy RF node training strategy. For later comparison, we
have adapted the strategy from [7] for our localization task by training trees
from root node to a maximal depth DI I using this optimization. We denote it as
PeterRRF. Finally, we propose two strategies where feature range and area from
which to select pixels are increased in the second training phase. By continuing
training to depth DII , allowing in the second phase large scale features (oI I ,
sII ) and simultaneously extending the training pixels (set of pixels SI I ) to the
whole image, we get the fpAdaptRRF. Here SI I is determined by randomly
sampling from pixels uniformly distributed in the image. The second strategy
uses a different set of pixels SII , selected according to back-projection images
computed from the first training phase. This concept is a main contribution of
our work, therefore the next paragraph describes it in more detail.
2.2 Pixel Selection by Back-projection Images
In the second training phase, pixels SII from locally similar structures are explic-
itly introduced, since they provide information that may help in disambiguation.
We automatically identify similar structures by applying the RRF from the first
phase on all training images in a testing step as described in Section 2.3. Thus,
pixels from the area surrounding the landmark as well as pixels with locally
similar appearance to the landmark end up in the first phase RRFs terminal
nodes, since the newly introduced pixels are pushed through the first phase
trees. The obtained accumulators show a high response on structures with a
similar appearance compared to the landmark’s local appearance (see Fig. 1c).
To identify pixels voting for a high response, we calculate for each accumula-
tor a back-projection image (see Fig. 1d), obtained by summing for each pixel
vall accumulator values at the target voting positions v+dlof all trees. We
finalize our backProjRRF strategy by selecting for each tree training pixels SII
as Npx randomly sampled pixels according to a probability proportional to the
back-projection image (see Fig. 1e).
2.3 Testing the RRF
During testing, all pixels of a previously unseen image are pushed through the
RRF. Starting at the root node, pixels are passed recursively to the left or right
child node according to the feature tests stored at the nodes until a leaf node
is reached. The estimated location of the landmark L(v) is calculated based on
the pixels position vand the relative voting vector dlstored in the leaf node l.
However, if the length of voting vector |dl|is larger than radius R, i.e. pixel v
is not in the area closely surrounding the landmark, the estimated location is
omitted from the accumulation of the landmark location predictions. Separately
for each landmark, the pixel’s estimations are stored in an accumulator image.
2.4 MRF Model
For multi-landmark localization, high-level knowledge about landmark configu-
ration may be used to further improve disambiguation between locally similar
From local to global random regression forest localization 5
structures. An MRF selects the best candidate for each landmark according to
the RRF accumulator values and a geometric model of the relative distances be-
tween landmarks, see Fig. 1a. In the MRF model, each landmark Licorresponds
to one variable while candidate locations selected as the Ncstrongest maxima
in the landmark’s accumulator determine the possible states of a variable. The
landmark configuration is obtained by optimizing energy function
E(L) =
NL
X
i=1
Ui(Li) + X
{i,j}∈C
Pi,j (Li, Lj),(2)
where unary term Uiis set to the RRF accumulator value of candidate Liand
the relative distances of two landmarks from the training annotations define
pairwise term Pi,j , modeled as normal distributions for landmark pairs in set C.
3 Experimental Setup and Results
We evaluate the performance of our landmark localization RRF variants on data
sets of 2D hand X-ray images and 3D MR images of human teeth. As evaluation
measure, we use the Euclidean distance between ground truth and estimated
landmark position. To measure reliability, the number of outliers, defined as lo-
calization errors larger than 10mm for hand landmarks and 7 mm for teeth, are
calculated. For both data sets, which were normalized in intensities by perform-
ing histogram matching, we perform a three-fold cross-validation, splitting the
data into 66% training and 33% testing data, respectively.
Hand Dataset consists of 895 2D X-ray hand images publicly available at
Digital Hand Atlas Database 1. Due to their lacking physical pixel resolution,
we assume a wrist width of 50mm, resample the images to a height of 1250
pixels and normalize image distances according to the wrist width as defined
by the ground-truth annotation of two landmarks (see Fig. 1a). For evaluation,
NL= 37 landmarks, many of them showing locally similar structures, e.g. finger
tips or joints between the bones, were manually annotated by three experts.
Teeth Dataset consists of 280 3D proton density weighted MR images of
left or right side of the head. In the latter case, images were mirrored to create a
consistent data set of images with 208 x 256 x 30 voxels and a physical resolution
of 0.59 x 0.59 x 1 mm per voxel. Specifying their center locations, two wisdom
teeth per data set were annotated by a dentist. Localization of wisdom teeth is
challenging due to the presence of other locally similar molars (see Fig. 1g).
Experimental setup: For each method described in Section 2, an RRF
consisting of NT= 7 trees is built separately for every landmark. The first RRF
phase is trained using pixels from training images within a range of R= 10mm
around each landmark position. The splitting criterion for each node is greedily
optimized with NF= 20 candidate features and NT= 10 candidate thresholds
except for PeterRRF. The random feature rectangles are defined by maximal
1Available from http://www.ipilab.org/BAAweb/, as of Jan. 2016
6ˇ
Stern et al.
error [mm]
0 5 10 15
Cumulative Distribution
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
hand dataset
CriminisiRRF
EbnerRRF
localRRF
PeterRRF
fAdaptRRF
fpAdaptRRF
backProjRRF
error [mm]
0 5 10 15 20
Cumulative Distribution
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
teeth dataset
CriminisiRRF
EbnerRRF
localRRF
PeterRRF
fAdaptRRF
fpAdaptRRF
backProjRRF
Fig. 2. Cumulative localization error distributions for hand and teeth data sets.
size in each direction sI= 1mm and maximal offset oI=R. In the second RRF
phase, Npx = 10000 pixels are introduced and feature range is increased to a
maximal feature size sII = 50mm and offset in each direction oI I = 50mm.
Treating each landmark independently on both 2D hands and 3D teeth
dataset, the single-landmark experiments show the performance of the
methods in case it is not feasible (due to lack of annotation) or semantically
meaningful (e.g. third vs. other molars) to define all available locally similar
structures. We compare our algorithms that start with local feature scale ranges
and increase to more global scale ranges (localRRF, fAdaptRRF, PeterRRF,
fpAdaptRRF, backProjRRF ) with reimplementations of two related works that
start from global feature scale ranges (CriminisiRRF [1], with maximal feature
size sII and offset oI I from pixels uniformly distributed over the image) and op-
tionally decrease to more local ranges (EbnerRRF [3]). First training phases stop
for all methods at DI= 13, while the second phase continues training within
the same trees until DII = 25. To ensure fair comparison, we use the same
RRF parameters for all methods, except for the number of candidate features
in PeterRRF, which was set to NF= 500 as suggested in [7]. Cumulative error
distribution results of the single-landmark experiments can be found in Fig. 2.
Table 1 shows quantitative localization results regarding reliability for all hand
landmarks and for subset configurations (fingertips, carpals, radius/ulna).
The multi-landmark experiments allow us to investigate the benefits
of adding high level knowledge about landmark configuration via an MRF to
the prediction. In addition to our reimplementation of the related works [1,3],
Lindner et al. [5] applied their code onto our hand data set using DI= 25 in their
implementation of the local RF stage. To allow a fair comparison with Lindner
et al. [5], we modify our two training phases by training two separate forests
for both stages until maximum depths DI=DII = 25, instead of continuing
training trees of a single forest. Thus, we investigate our presented backProjRRF,
the combination of backProjRRF with an MRF, localRRF combined with an
MRF, and the two state of the art methods from Ebner et al. [3] (EbnerRRF )
From local to global random regression forest localization 7
Table 1. Multi-landmark localization reliability results on hand radiographs for all
landmarks and subset configurations (compare Fig. 1 for configuration colors).
method mean ±std. outliers
EbnerRRF 0.97 ±2.45 228 (6.89h)
Lindner et al. [5] 0.85 ±1.01 20 (0.60h)
localRRF+MRF 0.80 ±0.91 14 (0.42h)
backProj 0.84 ±1.58 57 (1.72h)
backProj+MRF 0.80 ±0.91 15 (0.45h)
landmark subset localRRF backProj backProj
configuration +MRF +MRF
full ••••14 (0.4h) 15 (0.5h) 57 (1.7h)
fingertips •14 (3.1h)5(1.1h) 17 (3.8h)
radius,ulna •495 (92.2h)6(1.1h) 11 (2.0h)
carpals •17 (2.7h)13 (2.1h) 14 (2.2h)
and Lindner et al. [5]. The MRF, which is solved by a message passing algorithm,
uses Nc= 75 candidate locations (i.e. local accumulator maxima) per landmark
as possible states of the MRF variables. Quantitative results on multi-landmark
localization reliability for the 2D hand data set can be found in Table 1. Since all
our methods including EbnerRRF are based on the same local RRFs, accuracy
is the same with a median error of µhand
E= 0.51mm, which is slightly better
than accuracy of Lindner et al. [5] (µhand
E= 0.64mm).
4 Discussion and Conclusion
Single landmark RRF localization performance is highly influenced by both, se-
lection of the area from which training pixels are drawn and range of hand-crafted
features used to construct its forest decision rules, yet exact influence is currently
not fully understood. As shown in Fig. 2, the global CriminisiRRF method, is
not giving accurate localization results (median error µhand
E= 2.98mm), al-
though it shows the capability to discriminate ambiguous structures due to the
use of long range features and training pixels from all over the image. As a rea-
son for low accuracy we identified greedy node optimization, that favors long
range features even at deep tree levels when no ambiguity among training pix-
els is present anymore. Our implementation of PeterRRF [7], which overcomes
greedy node optimization by selecting optimal feature range in each node, shows
a strong improvement in localization accuracy (µhand
E= 0.89mm). Still it is not
as accurate as the method of Ebner et al. [3], which uses a local RRF with short
range features in the second stage (µhand
E= 0.51mm), while also requiring a sig-
nificantly larger number (around 25 times) of feature candidates per node. The
drawback of EbnerRRF is essentially the same as for localRRF if the area, from
which local RRF training pixels are drawn, despite being reduced by the global
RRF of the first stage, still contains neighboring, locally similar structures. To
investigate RRFs capability to discriminate ambiguous structures reliably while
preserving high accuracy of locally trained RRFs, we switch the order of Ebn-
erRRF stages, thus inverting their logic in the spirit of [5, 2]. Therefore, we ex-
tended localRRF by adding a second training phase that uses long range features
for accurate localization and differently selects areas from which training pixels
are drawn. While increasing the feature range in fAdaptRRF shows the same
accuracy compared to localRRF (µhand
E= 0.51mm), reliability is improved, but
not as strong as when introducing novel pixels into the second training phase.
Training on novel pixels is required to make feature selection more effective in
8ˇ
Stern et al.
discriminating locally similar structures, but it is important to note that they do
not participate in voting at testing time since the accuracy obtained in the first
phase would be lost. With our proposed backProjRRF we force the algorithm to
explicitly learn from examples which are hard to discriminate, i.e. pixels belong-
ing to locally similar structures, as opposed to fpAdaptRRF, where pixels are
randomly drawn from the image. Results in Fig. 2 reveal that highest reliability
(0.172% and 7.07 % outliers on 2D hand and 3D teeth data sets, respectively) is
obtained by backProjRRF, while still achieving the same accuracy as localRRF.
In a multi-landmark setting, RRF based localization can be combined with
high level knowledge from an MRF or SSM as in [5,2]. Method comparison re-
sults from Table 1 show that our backProjRRF combined with an MRF model
outperforms the state-of-the-art method of [5] on the hand data set in terms
of accuracy and reliability. However, compared to localRRF our backProjRRF
shows no benefit when both are combined with a strong graphical MRF model. In
cases where such a strong graphical model is unaffordable, e.g. if expert annota-
tions are limited (see subset configurations in Table 1), combining backProjRRF
with an MRF shows much better results in terms of reliability compared to lo-
calRRF+MRF. This is especially prominent in the results for radius and ulna
landmarks. Moreover, Table 1 shows that even without incorporating an MRF
model, the results of our backProjRRF are competitive to the state of the art
methods when limited high level knowledge is available (fingertips, radius/ulna,
carpals). Thus, in conclusion, we have shown the capability of RRF to success-
fully model locally similar structures by implicitly encoding global landmark
configuration while still maintaining high localization accuracy.
References
1. Criminisi, A., Robertson, D., Konukoglu, E., Shotton, J., Pathak, S., White, S.,
Siddiqui, K.: Regression forests for efficient anatomy detection and localization in
computed tomography scans. Med. Image Anal. 17(8), 1293–1303 (2013)
2. Donner, R., Menze, B.H., Bischof, H., Langs, G.: Global localization of 3D anatom-
ical structures by pre-filtered Hough Forests and discrete optimization. Med. Image
Anal. 17(8), 1304–1314 (2013)
3. Ebner, T., ˇ
Stern, D., Donner, R., Bischof, H., Urschler, M.: Towards Automatic
Bone Age Estimation from MRI: Localization of 3D Anatomical Landmarks. In:
MICCAI 2014, Part II. LNCS, vol. 8674, pp. 421–428 (2014)
4. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebra Local-
ization in Pathological Spine CT via Dense Classification from Sparse Annotations.
In: MICCAI 2013, Part II. LNCS, vol. 8150, pp. 262–270 (2013)
5. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.F.: Robust and Accurate Shape
Model Matching using Random Forest Regression-Voting. IEEE Trans. PAMI 37,
1862–1874 (2015)
6. Lindner, C., Thomson, J., arcOGEN Consortium, T., Cootes, T.: Learning-Based
Shape Model Matching: Training Accurate Models with Minimal Manual Input. In:
MICCAI 2015, Part III. LNCS, vol. 9351, pp. 580–587 (2015)
7. Peter, L., Pauly, O., Chatelain, P., Mateus, D., Navab, N.: Scale-Adaptive Forest
Training via an Efficient Feature Sampling Scheme. In: MICCAI 2015, Part I. LNCS,
vol. 9349, pp. 637–644 (2015)