Conference PaperPDF Available

From Local to Global Random Regression Forests: Exploring Anatomical Landmark Localization

Authors:
  • Ludwig Boltzmann Institute for Clinical-Forensic Imaging, Graz, Austria

Abstract and Figures

State of the art anatomical landmark localization algorithms pair local Random Forest (RF) detection with disambiguation of locally similar structures by including high level knowledge about relative landmark locations. In this work we pursue the question, how much high-level knowledge is needed in addition to a single landmark localization RF to implicitly model the global configuration of multiple, potentially ambiguous landmarks. We further propose a novel RF localization algorithm that distinguishes locally similar structures by automatically identifying them, exploring the back-projection of the response from accurate local RF predictions. In our experiments we show that this approach achieves competitive results in single and multi-landmark localization when applied to 2D hand radiographic and 3D teeth MRI data sets. Additionally, when combined with a simple Markov Random Field model, we are able to outperform state of the art methods.
Content may be subject to copyright.
From Local to Global Random
Regression Forests: Exploring Anatomical
Landmark Localization
Darko ˇ
Stern1?, Thomas Ebner2, and Martin Urschler1,2,3
1Ludwig Boltzmann Institute for Clinical Forensic Imaging, Graz, Austria
2Institute for Computer Graphics and Vision, Graz University of Technology, Austria
3BioTechMed-Graz, Austria
Abstract. State of the art anatomical landmark localization algorithms
pair local Random Forest (RF) detection with disambiguation of locally
similar structures by including high level knowledge about relative land-
mark locations. In this work we pursue the question, how much high-level
knowledge is needed in addition to a single landmark localization RF to
implicitly model the global configuration of multiple, potentially ambigu-
ous landmarks. We further propose a novel RF localization algorithm
that distinguishes locally similar structures by automatically identifying
them, exploring the back-projection of the response from accurate local
RF predictions. In our experiments we show that this approach achieves
competitive results in single and multi-landmark localization when ap-
plied to 2D hand radiographic and 3D teeth MRI data sets. Additionally,
when combined with a simple Markov Random Field model, we are able
to outperform state of the art methods.
1 Introduction
Automatic localization of anatomical structures consisting of potentially ambigu-
ous (i.e. locally similar) landmarks is a crucial step in medical image analysis
applications like registration or segmentation. Lindner et al. [5] propose a state
of the art localization algorithm, which is composed of a sophisticated statistical
shape model (SSM) that locally detects landmark candidates by three step opti-
mization over a random forest (RF) response function. Similarly, Donner et al. [2]
use locally restricted classification RFs to generate landmark candidates, fol-
lowed by a Markov Random Field (MRF) optimizing their configuration. Thus,
in both approaches good RF localization accuracy is paired with disambiguation
of landmarks by including high-level knowledge about their relative location. A
different concept for localizing anatomical structures is from Criminisi et al. [1],
suggesting that the RF framework itself is able to learn global structure configu-
ration. This was achieved with random regression forests (RRF) using arbitrary
?This work was supported by the province of Styria (HTI:Tech for Med ABT08-22-
T-7/2013-13) and the Austrian Science Fund (FWF): P 28078-N33.
2ˇ
Stern et al.
Fig. 1. Overview of our RRF based localization strategy. (a) 37 anatomical landmarks
in 2D hand X-ray images and differently colored MRF configurations. (b) In phase
1, RRF is trained locally on an area surrounding a landmark (radius R) with short
range features, resulting in accurate but ambiguous landmark predictions (c). (d) Back-
projection is applied to select pixels for training the RRF in phase 2 with larger feature
range (e). (f) Estimated landmarks by accumulating predictions of pixels in local neigh-
bourhood. (g,h) One of two independently predicted wisdom teeth from 3D MRI.
long range features and allowing pixels from all over the training image to glob-
ally vote for anatomical structures. Although roughly capturing global structure
configuration, their long range voting is inaccurate when pose variations are
present, which led to extending this concept with a graphical model [4]. Ebner
et al. [3] adapted the work of [1] for multiple landmark localization without the
need for an additional model and improved it by introducing a weighting of vot-
ing range at testing time and by adding a second RRF stage restricted to the
local area estimated by the global RRF. Despite putting more trust into the
surroundings of a landmark, their results crucially depend on empirically tuned
parameters defining the restricted area according to first stage estimation.
In this work we pursue the question, how much high-level knowledge is needed
in addition to a single landmark localization RRF to implicitly model the global
configuration of multiple, potentially ambiguous landmarks [6]. Investigating dif-
ferent RRF architectures, we propose a novel single landmark localization RRF
algorithm, robust to ambiguous, locally similar structures. When extended with
a simple MRF model, our RRF outperforms the current state of the art method
of Lindner et al. [5] on a challenging multi-landmark 2D hand radiographs data
set, while at the same time performing best in localizing single wisdom teeth
landmarks from 3D head MRI.
2 Method
Although being constrained by all surrounding objects, the location of an anatom-
ical landmark is most accurately defined by its neighboring structures. While
From local to global random regression forest localization 3
increasing the feature range leads to more surrounding objects being seen for
defining a landmark, enlarging the area from which training pixels are drawn
leads to the surrounding objects being able to participate in voting for a land-
mark location. We explore these observations and investigate the influence of
different feature and voting ranges, by proposing several RRF strategies for sin-
gle landmark localization. Following the ideas of Lindner et al. [5] and Donner et
al. [2], in the first phase of the proposed RRF architectures, the local surround-
ings of a landmark are accurately defined. The second RRF phase establishes
different algorithm variants by exploring distinct feature and voting ranges to
discriminate ambiguous, locally similar structures. In order to maintain the ac-
curacy achieved during the first RRF phase, locations outside of a landmark’s
local vicinity are recognized and banned from estimating the landmark location.
2.1 Training the RRF
We independently train an RRF for each anatomical landmark. Similar to [1,3],
at each node of the Ttrees of a forest, the set of pixels Snreaching node nis
pushed to left (Sn,L) or right (Sn,R) child node according to the splitting decision
made by thresholding a feature response for each pixel. Feature responses are
calculated as differences between mean image intensity of two rectangles with
maximal size sand maximal offset orelative to a pixel position vi;iSn.
Each node stores a feature and threshold selected from a pool of NFrandomly
generated features and NTthresholds, maximizing the objective function I:
I=X
iSn
did(Sn)
2X
c∈{L,R}
X
iSn,c
did(Sn,c)
2.(1)
For pixel set S,diis the i-th voting vector, defined as the vector between land-
mark position land pixel position vi, while d(S) is the mean voting vector of
pixels in S. For later testing, we store at each leaf node lthe mean value of
relative voting vectors dlof all pixels reaching l.
First training phase: Based on a set of pixels SI, selected from the training
images at the location inside a circle of radius Rcentered at the landmark
position, the RRF is first trained locally with features whose rectangles have
maximal size in each direction sIand maximal offset oI, see Fig. 1b. Training of
this phase is finished when a maximal depth DIis reached.
Second training phase: Here, our novel algorithm variants are designed by
implementing different strategies how to deal with feature ranges and selection
of the area from which pixels are drawn during training. By pursuing the same
local strategy as in the first phase for continuing training of the trees up to a
maximal depth DII , we establish the localRRF similar to the RF part in [5, 2]. If
we continue training to depth DI I with a restriction to pixels SIbut additionally
allow long range features with maximal offset oII >oIand maximal size sI I >sI,
we get fAdaptRRF. Another way of introducing long range features, but still
keeping the same set of pixels SI, was proposed for segmentation in Peter et
al. [7]. They optimize for each forest node the feature size and offset instead
4ˇ
Stern et al.
of the traditional greedy RF node training strategy. For later comparison, we
have adapted the strategy from [7] for our localization task by training trees
from root node to a maximal depth DI I using this optimization. We denote it as
PeterRRF. Finally, we propose two strategies where feature range and area from
which to select pixels are increased in the second training phase. By continuing
training to depth DII , allowing in the second phase large scale features (oI I ,
sII ) and simultaneously extending the training pixels (set of pixels SI I ) to the
whole image, we get the fpAdaptRRF. Here SI I is determined by randomly
sampling from pixels uniformly distributed in the image. The second strategy
uses a different set of pixels SII , selected according to back-projection images
computed from the first training phase. This concept is a main contribution of
our work, therefore the next paragraph describes it in more detail.
2.2 Pixel Selection by Back-projection Images
In the second training phase, pixels SII from locally similar structures are explic-
itly introduced, since they provide information that may help in disambiguation.
We automatically identify similar structures by applying the RRF from the first
phase on all training images in a testing step as described in Section 2.3. Thus,
pixels from the area surrounding the landmark as well as pixels with locally
similar appearance to the landmark end up in the first phase RRFs terminal
nodes, since the newly introduced pixels are pushed through the first phase
trees. The obtained accumulators show a high response on structures with a
similar appearance compared to the landmark’s local appearance (see Fig. 1c).
To identify pixels voting for a high response, we calculate for each accumula-
tor a back-projection image (see Fig. 1d), obtained by summing for each pixel
vall accumulator values at the target voting positions v+dlof all trees. We
finalize our backProjRRF strategy by selecting for each tree training pixels SII
as Npx randomly sampled pixels according to a probability proportional to the
back-projection image (see Fig. 1e).
2.3 Testing the RRF
During testing, all pixels of a previously unseen image are pushed through the
RRF. Starting at the root node, pixels are passed recursively to the left or right
child node according to the feature tests stored at the nodes until a leaf node
is reached. The estimated location of the landmark L(v) is calculated based on
the pixels position vand the relative voting vector dlstored in the leaf node l.
However, if the length of voting vector |dl|is larger than radius R, i.e. pixel v
is not in the area closely surrounding the landmark, the estimated location is
omitted from the accumulation of the landmark location predictions. Separately
for each landmark, the pixel’s estimations are stored in an accumulator image.
2.4 MRF Model
For multi-landmark localization, high-level knowledge about landmark configu-
ration may be used to further improve disambiguation between locally similar
From local to global random regression forest localization 5
structures. An MRF selects the best candidate for each landmark according to
the RRF accumulator values and a geometric model of the relative distances be-
tween landmarks, see Fig. 1a. In the MRF model, each landmark Licorresponds
to one variable while candidate locations selected as the Ncstrongest maxima
in the landmark’s accumulator determine the possible states of a variable. The
landmark configuration is obtained by optimizing energy function
E(L) =
NL
X
i=1
Ui(Li) + X
{i,j}∈C
Pi,j (Li, Lj),(2)
where unary term Uiis set to the RRF accumulator value of candidate Liand
the relative distances of two landmarks from the training annotations define
pairwise term Pi,j , modeled as normal distributions for landmark pairs in set C.
3 Experimental Setup and Results
We evaluate the performance of our landmark localization RRF variants on data
sets of 2D hand X-ray images and 3D MR images of human teeth. As evaluation
measure, we use the Euclidean distance between ground truth and estimated
landmark position. To measure reliability, the number of outliers, defined as lo-
calization errors larger than 10mm for hand landmarks and 7 mm for teeth, are
calculated. For both data sets, which were normalized in intensities by perform-
ing histogram matching, we perform a three-fold cross-validation, splitting the
data into 66% training and 33% testing data, respectively.
Hand Dataset consists of 895 2D X-ray hand images publicly available at
Digital Hand Atlas Database 1. Due to their lacking physical pixel resolution,
we assume a wrist width of 50mm, resample the images to a height of 1250
pixels and normalize image distances according to the wrist width as defined
by the ground-truth annotation of two landmarks (see Fig. 1a). For evaluation,
NL= 37 landmarks, many of them showing locally similar structures, e.g. finger
tips or joints between the bones, were manually annotated by three experts.
Teeth Dataset consists of 280 3D proton density weighted MR images of
left or right side of the head. In the latter case, images were mirrored to create a
consistent data set of images with 208 x 256 x 30 voxels and a physical resolution
of 0.59 x 0.59 x 1 mm per voxel. Specifying their center locations, two wisdom
teeth per data set were annotated by a dentist. Localization of wisdom teeth is
challenging due to the presence of other locally similar molars (see Fig. 1g).
Experimental setup: For each method described in Section 2, an RRF
consisting of NT= 7 trees is built separately for every landmark. The first RRF
phase is trained using pixels from training images within a range of R= 10mm
around each landmark position. The splitting criterion for each node is greedily
optimized with NF= 20 candidate features and NT= 10 candidate thresholds
except for PeterRRF. The random feature rectangles are defined by maximal
1Available from http://www.ipilab.org/BAAweb/, as of Jan. 2016
6ˇ
Stern et al.
error [mm]
0 5 10 15
Cumulative Distribution
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
hand dataset
CriminisiRRF
EbnerRRF
localRRF
PeterRRF
fAdaptRRF
fpAdaptRRF
backProjRRF
error [mm]
0 5 10 15 20
Cumulative Distribution
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
teeth dataset
CriminisiRRF
EbnerRRF
localRRF
PeterRRF
fAdaptRRF
fpAdaptRRF
backProjRRF
Fig. 2. Cumulative localization error distributions for hand and teeth data sets.
size in each direction sI= 1mm and maximal offset oI=R. In the second RRF
phase, Npx = 10000 pixels are introduced and feature range is increased to a
maximal feature size sII = 50mm and offset in each direction oI I = 50mm.
Treating each landmark independently on both 2D hands and 3D teeth
dataset, the single-landmark experiments show the performance of the
methods in case it is not feasible (due to lack of annotation) or semantically
meaningful (e.g. third vs. other molars) to define all available locally similar
structures. We compare our algorithms that start with local feature scale ranges
and increase to more global scale ranges (localRRF, fAdaptRRF, PeterRRF,
fpAdaptRRF, backProjRRF ) with reimplementations of two related works that
start from global feature scale ranges (CriminisiRRF [1], with maximal feature
size sII and offset oI I from pixels uniformly distributed over the image) and op-
tionally decrease to more local ranges (EbnerRRF [3]). First training phases stop
for all methods at DI= 13, while the second phase continues training within
the same trees until DII = 25. To ensure fair comparison, we use the same
RRF parameters for all methods, except for the number of candidate features
in PeterRRF, which was set to NF= 500 as suggested in [7]. Cumulative error
distribution results of the single-landmark experiments can be found in Fig. 2.
Table 1 shows quantitative localization results regarding reliability for all hand
landmarks and for subset configurations (fingertips, carpals, radius/ulna).
The multi-landmark experiments allow us to investigate the benefits
of adding high level knowledge about landmark configuration via an MRF to
the prediction. In addition to our reimplementation of the related works [1,3],
Lindner et al. [5] applied their code onto our hand data set using DI= 25 in their
implementation of the local RF stage. To allow a fair comparison with Lindner
et al. [5], we modify our two training phases by training two separate forests
for both stages until maximum depths DI=DII = 25, instead of continuing
training trees of a single forest. Thus, we investigate our presented backProjRRF,
the combination of backProjRRF with an MRF, localRRF combined with an
MRF, and the two state of the art methods from Ebner et al. [3] (EbnerRRF )
From local to global random regression forest localization 7
Table 1. Multi-landmark localization reliability results on hand radiographs for all
landmarks and subset configurations (compare Fig. 1 for configuration colors).
method mean ±std. outliers
EbnerRRF 0.97 ±2.45 228 (6.89h)
Lindner et al. [5] 0.85 ±1.01 20 (0.60h)
localRRF+MRF 0.80 ±0.91 14 (0.42h)
backProj 0.84 ±1.58 57 (1.72h)
backProj+MRF 0.80 ±0.91 15 (0.45h)
landmark subset localRRF backProj backProj
configuration +MRF +MRF
full 14 (0.4h) 15 (0.5h) 57 (1.7h)
fingertips 14 (3.1h)5(1.1h) 17 (3.8h)
radius,ulna 495 (92.2h)6(1.1h) 11 (2.0h)
carpals 17 (2.7h)13 (2.1h) 14 (2.2h)
and Lindner et al. [5]. The MRF, which is solved by a message passing algorithm,
uses Nc= 75 candidate locations (i.e. local accumulator maxima) per landmark
as possible states of the MRF variables. Quantitative results on multi-landmark
localization reliability for the 2D hand data set can be found in Table 1. Since all
our methods including EbnerRRF are based on the same local RRFs, accuracy
is the same with a median error of µhand
E= 0.51mm, which is slightly better
than accuracy of Lindner et al. [5] (µhand
E= 0.64mm).
4 Discussion and Conclusion
Single landmark RRF localization performance is highly influenced by both, se-
lection of the area from which training pixels are drawn and range of hand-crafted
features used to construct its forest decision rules, yet exact influence is currently
not fully understood. As shown in Fig. 2, the global CriminisiRRF method, is
not giving accurate localization results (median error µhand
E= 2.98mm), al-
though it shows the capability to discriminate ambiguous structures due to the
use of long range features and training pixels from all over the image. As a rea-
son for low accuracy we identified greedy node optimization, that favors long
range features even at deep tree levels when no ambiguity among training pix-
els is present anymore. Our implementation of PeterRRF [7], which overcomes
greedy node optimization by selecting optimal feature range in each node, shows
a strong improvement in localization accuracy (µhand
E= 0.89mm). Still it is not
as accurate as the method of Ebner et al. [3], which uses a local RRF with short
range features in the second stage (µhand
E= 0.51mm), while also requiring a sig-
nificantly larger number (around 25 times) of feature candidates per node. The
drawback of EbnerRRF is essentially the same as for localRRF if the area, from
which local RRF training pixels are drawn, despite being reduced by the global
RRF of the first stage, still contains neighboring, locally similar structures. To
investigate RRFs capability to discriminate ambiguous structures reliably while
preserving high accuracy of locally trained RRFs, we switch the order of Ebn-
erRRF stages, thus inverting their logic in the spirit of [5, 2]. Therefore, we ex-
tended localRRF by adding a second training phase that uses long range features
for accurate localization and differently selects areas from which training pixels
are drawn. While increasing the feature range in fAdaptRRF shows the same
accuracy compared to localRRF (µhand
E= 0.51mm), reliability is improved, but
not as strong as when introducing novel pixels into the second training phase.
Training on novel pixels is required to make feature selection more effective in
8ˇ
Stern et al.
discriminating locally similar structures, but it is important to note that they do
not participate in voting at testing time since the accuracy obtained in the first
phase would be lost. With our proposed backProjRRF we force the algorithm to
explicitly learn from examples which are hard to discriminate, i.e. pixels belong-
ing to locally similar structures, as opposed to fpAdaptRRF, where pixels are
randomly drawn from the image. Results in Fig. 2 reveal that highest reliability
(0.172% and 7.07 % outliers on 2D hand and 3D teeth data sets, respectively) is
obtained by backProjRRF, while still achieving the same accuracy as localRRF.
In a multi-landmark setting, RRF based localization can be combined with
high level knowledge from an MRF or SSM as in [5,2]. Method comparison re-
sults from Table 1 show that our backProjRRF combined with an MRF model
outperforms the state-of-the-art method of [5] on the hand data set in terms
of accuracy and reliability. However, compared to localRRF our backProjRRF
shows no benefit when both are combined with a strong graphical MRF model. In
cases where such a strong graphical model is unaffordable, e.g. if expert annota-
tions are limited (see subset configurations in Table 1), combining backProjRRF
with an MRF shows much better results in terms of reliability compared to lo-
calRRF+MRF. This is especially prominent in the results for radius and ulna
landmarks. Moreover, Table 1 shows that even without incorporating an MRF
model, the results of our backProjRRF are competitive to the state of the art
methods when limited high level knowledge is available (fingertips, radius/ulna,
carpals). Thus, in conclusion, we have shown the capability of RRF to success-
fully model locally similar structures by implicitly encoding global landmark
configuration while still maintaining high localization accuracy.
References
1. Criminisi, A., Robertson, D., Konukoglu, E., Shotton, J., Pathak, S., White, S.,
Siddiqui, K.: Regression forests for efficient anatomy detection and localization in
computed tomography scans. Med. Image Anal. 17(8), 1293–1303 (2013)
2. Donner, R., Menze, B.H., Bischof, H., Langs, G.: Global localization of 3D anatom-
ical structures by pre-filtered Hough Forests and discrete optimization. Med. Image
Anal. 17(8), 1304–1314 (2013)
3. Ebner, T., ˇ
Stern, D., Donner, R., Bischof, H., Urschler, M.: Towards Automatic
Bone Age Estimation from MRI: Localization of 3D Anatomical Landmarks. In:
MICCAI 2014, Part II. LNCS, vol. 8674, pp. 421–428 (2014)
4. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebra Local-
ization in Pathological Spine CT via Dense Classification from Sparse Annotations.
In: MICCAI 2013, Part II. LNCS, vol. 8150, pp. 262–270 (2013)
5. Lindner, C., Bromiley, P.A., Ionita, M.C., Cootes, T.F.: Robust and Accurate Shape
Model Matching using Random Forest Regression-Voting. IEEE Trans. PAMI 37,
1862–1874 (2015)
6. Lindner, C., Thomson, J., arcOGEN Consortium, T., Cootes, T.: Learning-Based
Shape Model Matching: Training Accurate Models with Minimal Manual Input. In:
MICCAI 2015, Part III. LNCS, vol. 9351, pp. 580–587 (2015)
7. Peter, L., Pauly, O., Chatelain, P., Mateus, D., Navab, N.: Scale-Adaptive Forest
Training via an Efficient Feature Sampling Scheme. In: MICCAI 2015, Part I. LNCS,
vol. 9349, pp. 637–644 (2015)
... Specifically, the connections among landmarks describing a 3-D object are established according to their cost to the resulting shape representation and by considering concepts from transportation theory. Klinder et al. [16] first reported an automatically whole-spine vertebral bone identification, detection, segmentation method in CT images.Štern et al. [5] investigated different RRF architectures and proposed a novel random forest localization algorithm. The proposed algorithm can implicitly model the global configuration of multiple, potentially ambiguous landmarks and distinguish locally similar structures by automatically identifying and exploring the back-projection of the response from accurate local RF predictions. ...
... CoorTransforme-based detector outperforms state-of-the-art landmark detection methods including Ibragimov et al. [4], Stern et al. [5], Lindner et al. [6], Urschler et al. [7], Payer et al. [29], and Zhu et al. [8]. The results of the CoorTransformer and comparison methods are shown in Table I. ...
Preprint
Full-text available
Heatmap-based anatomical landmark detection is still facing two unresolved challenges: 1) inability to accurately evaluate the distribution of heatmap; 2) inability to effectively exploit global spatial structure information. To address the computational inability challenge, we propose a novel position-aware and sample-aware central loss. Specifically, our central loss can absorb position information, enabling accurate evaluation of the heatmap distribution. More advanced is that our central loss is sample-aware, which can adaptively distinguish easy and hard samples and make the model more focused on hard samples while solving the challenge of extreme imbalance between landmarks and non-landmarks. To address the challenge of ignoring structure information, a Coordinated Transformer, called CoorTransformer, is proposed, which establishes long-range dependencies under the guidance of landmark coordination information, making the attention more focused on the sparse landmarks while taking advantage of global spatial structure. Furthermore, CoorTransformer can speed up convergence, effectively avoiding the defect that Transformers have difficulty converging in sparse representation learning. Using the advanced CoorTransformer and central loss, we propose a generalized detection model that can handle various scenarios, inherently exploiting the underlying relationship between landmarks and incorporating rich structural knowledge around the target landmarks. We analyzed and evaluated CoorTransformer and central loss on three challenging landmark detection tasks. The experimental results show that our CoorTransformer outperforms state-of-the-art methods, and the central loss significantly improves the performance of the model with p-values< 0.05.
... 8,9 Most of the available solutions for landmark detection rely on machine learning, [10][11][12] however, previous methods have been proposed for other image modalities and have not been validated for Cone-beam computed tomography (CBCT) scans with various imaging acquisition protocols to lower radiation dose in dentistry. Other approaches for landmark identification rely on sub-optimal search strategies, i.e., exhaustive scanning, 11,12 oneshot displacement estimation, 13,14 or end-to-end image mapping techniques. 15,16 In many cases, these methods can lead to falsepositive detection results and excessively high computation times. ...
Article
Objective: To present and validate an open-source fully automated landmark placement (ALICBCT) tool for cone-beam computed tomography scans. Material and methods: One hundred and forty-three large and medium field of view cone-beam computed tomography (CBCT) were used to train and test a novel approach, called ALICBCT that reformulates landmark detection as a classification problem through a virtual agent placed inside volumetric images. The landmark agents were trained to navigate in a multi-scale volumetric space to reach the estimated landmark position. The agent movements decision relies on a combination of DenseNet feature network and fully connected layers. For each CBCT, 32 ground truth landmark positions were identified by 2 clinician experts. After validation of the 32 landmarks, new models were trained to identify a total of 119 landmarks that are commonly used in clinical studies for the quantification of changes in bone morphology and tooth position. Results: Our method achieved a high accuracy with an average of 1.54±0.87 mm error for the 32 landmark positions with rare failures, taking an average of 4.2s computation time to identify each landmark in one large 3D-CBCT scan using a conventional GPU. Conclusion: The ALICBCT algorithm is a robust automatic identification tool that has been deployed for clinical and research use as an extension in the 3D Slicer platform allowing continuous updates for increased precision.
Article
Full-text available
This study aims to develop a proficient and clinically applicable algorithm that can accurately assess bone age. This algorithm is based on the principles of the Tanner‐Whitehouse 3 (TW3) integral approach, and aims to achieve efficiency, scalability, and interpretability. We developed a model for bone age prediction in children. The model was tested on a pediatric dataset from a tertiary care hospital consisting of left‐hand radiographs of children between the age of 0 and 18. Our model consists of removing the arm portion using a pre‐trained YOLO network, localizing 37 key points in the hand bone portion using a spatial configuration network, and segmenting the original image through 20 of these points to obtain 20 fixed‐size patches. Finally, each of the 20 bone images is classified by training a visual transformer (ViT) model. In this study, a hybrid network, SVTNet, was developed that incorporates visual transformers to obtain estimates of bone age in the carpal (C series) and metacarpal (RUS series) bones. The sum of the clinical TW3 scoring region scores and bone maturity scores were utilized to determine the bone age for each corresponding region. The performance of the algorithm was evaluated in terms of both training and testing by evaluating 3871 left hand X‐ray micrographs obtained from a tertiary hospital in China. The results showed that the average absolute error of bone age estimation was 0.50 years for the RUS series of bones and 0.47 years for the C series of bones. The main contribution of this study is to propose, for the first time, a ViT‐based bone age assessment method that automates the entire process of the TW3 algorithm and is clinically interpretable, with predictive accuracy comparable to that of an experienced orthopedic surgeon.
Chapter
Accurate localization of anatomical landmarks has a critical role in clinical diagnosis, treatment planning, and research. Most existing deep learning methods for anatomical landmark localization rely on heatmap regression-based learning, which generates label representations as 2D Gaussian distributions centered at the labeled coordinates of each of the landmarks and integrates them into a single spatial resolution heatmap. However, the accuracy of this method is limited by the resolution of the heatmap, which restricts its ability to capture finer details. In this study, we introduce a multiresolution heatmap learning strategy that enables the network to capture semantic feature representations precisely using multiresolution heatmaps generated from the feature representations at each resolution independently, resulting in improved localization accuracy. Moreover, we propose a novel network architecture called hybrid transformer-CNN (HTC), which combines the strengths of both CNN and vision transformer models to improve the network’s ability to effectively extract both local and global representations. Extensive experiments demonstrated that our approach outperforms state-of-the-art deep learning-based anatomical landmark localization networks on the numerical XCAT 2D projection images and two public X-ray landmark detection benchmark datasets. Our code is available at https://github.com/seriee/Multiresolution-HTC.git.
Article
Full-text available
Objective and Impact Statement . In this work, we develop a universal anatomical landmark detection model which learns once from multiple datasets corresponding to different anatomical regions. Compared with the conventional model trained on a single dataset, this universal model not only is more light weighted and easier to train but also improves the accuracy of the anatomical landmark location. Introduction . The accurate and automatic localization of anatomical landmarks plays an essential role in medical image analysis. However, recent deep learning-based methods only utilize limited data from a single dataset. It is promising and desirable to build a model learned from different regions which harnesses the power of big data. Methods . Our model consists of a local network and a global network, which capture local features and global features, respectively. The local network is a fully convolutional network built up with depth-wise separable convolutions, and the global network uses dilated convolution to enlarge the receptive field to model global dependencies. Results . We evaluate our model on four 2D X-ray image datasets totaling 1710 images and 72 landmarks in four anatomical regions. Extensive experimental results show that our model improves the detection accuracy compared to the state-of-the-art methods. Conclusion . Our model makes the first attempt to train a single network on multiple datasets for landmark detection. Experimental results qualitatively and quantitatively show that our proposed model performs better than other models trained on multiple datasets and even better than models trained on a single dataset separately.
Chapter
Localization of coronary ostia landmarks in Computed Tomography Angiography (CTA) volumes is a crucial step in developing various automatic diagnostic procedures. In this study, we propose a one-step method of coronary ostia landmark localization that utilizes a residual U-Net with heatmap matching and 3D Differentiable Spatial to Numerical Transform (DSNT). We evaluate the method using two datasets: a Coronary Computed Tomography Angiography (CCTA) dataset containing 201 scans and a publicly available ImageTBAD dataset containing 77 CTA scans annotated with coronary ostia landmarks. On the CCTA dataset we report median Euclidean distance error – 1.14 mm on the left coronary ostium and 0.98 mm on the right coronary ostium. On the ImageTBAD CTA dataset we report median Euclidean distance error – 3.48 mm on the left coronary ostium and 2.97 mm on the right coronary ostium. Our evaluation shows that the proposed method improves accuracy of coronary ostia landmark localization when compared to other known methods.
Conference Paper
Full-text available
Recent work has shown that statistical model-based methods lead to accurate and robust results when applied to the segmentation of bone shapes from radiographs. To achieve good performance, model-based matching systems require large numbers of annotations, which can be very time-consuming to obtain. Non-rigid registration can be applied to unlabelled images to obtain correspondences from which models can be built. However, such models are rarely as effective as those built from careful manual annotations, and the accuracy of the registration is hard to measure. In this paper, we show that small numbers of manually annotated points can be used to guide the registration, leading to significant improvements in performance of the resulting model matching system, and achieving results close to those of a model built from dense manual annotations. Placing such sparse points manually is much less time-consuming than a full dense annotation, allowing good models to be built for new bone shapes more quickly than before. We describe detailed experiments on varying the number of sparse points, and demonstrate that manually annotating fewer than 30% of the points is sufficient to create robust and accurate models for segmenting hip and knee bones in radiographs. The proposed method includes a very effective and novel way of estimating registration accuracy in the absence of ground truth.
Article
Full-text available
A widely used approach for locating points on deformable objects in images is to generate feature response images for each point, and then to fit a shape model to these response images. We demonstrate that Random Forest regression-voting can be used to generate high quality response images quickly. Rather than using a generative or a discriminative model to evaluate each pixel, a regressor is used to cast votes for the optimal position of each point. We show that this leads to fast and accurate shape model matching when applied in the Constrained Local Model framework. We evaluate the technique in detail, and compare it with a range of commonly used alternatives across application areas: the annotation of the joints of the hands in radiographs and the detection of feature points in facial images. We show that our approach outperforms alternative techniques, achieving what we believe to be the most accurate results yet published for hand joint annotation and state-of-the-art performance for facial feature point detection.
Conference Paper
Full-text available
Bone age estimation (BAE) is an important procedure in forensic practice which recently has seen a shift in attention from X-ray to MRI based imaging. To automate BAE from MRI, localization of the joints between hand bones is a crucial first step, which is challenging due to anatomical variations, different poses and repeating structures within the hand. We propose a landmark localization algorithm using multiple random regression forests, first analyzing the shape of the hand from information of the whole image, thus implicitly modeling the global landmark configuration, followed by a refinement based on more local information to increase prediction accuracy. We are able to clearly outperform related approaches on our dataset of 60 T1-weighted MR images, achieving a mean landmark localization error of 1.4 ± 1.5mm, while having only 0.25% outliers with an error greater than 10mm.
Article
Full-text available
The accurate localization of anatomical landmarks is a challenging task, often solved by domain specific approaches. We propose a method for the automatic localization of landmarks in complex, repetitive anatomical structures. The key idea is to combine three steps: (1) a classifier for pre-filtering anatomical landmark positions that (2) are refined through a Hough regression model, together with (3) a parts-based model of the global landmark topology to select the final landmark positions. During training landmarks are annotated in a set of example volumes. A classifier learns local landmark appearance, and Hough regressors are trained to aggregate neighborhood information to a precise landmark coordinate position. A non-parametric geometric model encodes the spatial relationships between the landmarks and derives a topology which connects mutually predictive landmarks. During the global search we classify all voxels in the query volume, and perform regression-based agglomeration of landmark probabilities to highly accurate and specific candidate points at potential landmark locations. We encode the candidates’ weights together with the conformity of the connecting edges to the learnt geometric model in a Markov Random Field (MRF). By solving the corresponding discrete optimization problem, the most probable location for each model landmark is found in the query volume. We show that this approach is able to consistently localize the model landmarks despite the complex and repetitive character of the anatomical structures on three challenging data sets (hand radiographs, hand CTs, and whole body CTs), with a median localization error of 0.80 mm, 1.19 mm and 2.71 mm, respectively.
Conference Paper
In the context of forest-based segmentation of medical data, modeling the visual appearance around a voxel requires the choice of the scale at which contextual information is extracted, which is of crucial importance for the final segmentation performance. Building on Haar-like visual features, we introduce a simple yet effective modification of the forest training which automatically infers the most informative scale at each stage of the procedure. Instead of the standard uniform sampling during node split optimization, our approach draws candidate features sequentially in a fine-to-coarse fashion. While being very easy to implement, this alternative is free of additional parameters, has the same computational cost as a standard training and shows consistent improvements on three medical segmentation datasets with very different properties.
Conference Paper
Accurate localization and identification of vertebrae in spinal imaging is crucial for the clinical tasks of diagnosis, surgical planning, and post-operative assessment. The main difficulties for automatic methods arise from the frequent presence of abnormal spine curvature, small field of view, and image artifacts caused by surgical implants. Many previous methods rely on parametric models of appearance and shape whose performance can substantially degrade for pathological cases. We propose a robust localization and identification algorithm which builds upon supervised classification forests and avoids an explicit parametric model of appearance. We overcome the tedious requirement for dense annotations by a semi-automatic labeling strategy. Sparse centroid annotations are transformed into dense probabilistic labels which capture the inherent identification uncertainty. Using the dense labels, we learn a discriminative centroid classifier based on local and contextual intensity features which is robust to typical characteristics of spinal pathologies and image artifacts. Extensive evaluation is performed on a challenging dataset of 224 spine CT scans of patients with varying pathologies including high-grade scoliosis, kyphosis, and presence of surgical implants. Additionally, we test our method on a heterogeneous dataset of another 200, mostly abdominal, CTs. Quantitative evaluation is carried out with respect to localization errors and identification rates, and compared to a recently proposed method. Our approach is efficient and outperforms state-of-the-art on pathological cases.
Article
This paper proposes a new algorithm for the efficient, automatic detection and localization of multiple anatomical structures within three-dimensional computed tomography (CT) scans. Applications include selective retrieval of patients images from PACS systems, semantic visual navigation and tracking radiation dose over time. The main contribution of this work is a new, continuous parametrization of the anatomy localization problem, which allows it to be addressed effectively by multi-class random regression forests. Regression forests are similar to the more popular classification forests, but trained to predict continuous, multi-variate outputs, where the training focuses on maximizing the confidence of output predictions. A single pass of our probabilistic algorithm enables the direct mapping from voxels to organ location and size. Quantitative validation is performed on a database of 400 highly variable CT scans. We show that the proposed method is more accurate and robust than techniques based on efficient multi-atlas registration and template-based nearest-neighbor detection. Due to the simplicity of the regressor's context-rich visual features and the algorithm's parallelism, these results are achieved in typical run-times of only ∼4s on a conventional single-core machine.
Towards Automatic Bone Age Estimation from MRI: Localization of 3D Anatomical Landmarks
  • T Ebner
  • D Štern
  • R Donner
  • H Bischof
  • M Urschler
Ebner, T.,Štern, D., Donner, R., Bischof, H., Urschler, M.: Towards Automatic Bone Age Estimation from MRI: Localization of 3D Anatomical Landmarks. In: MICCAI 2014, Part II. LNCS, vol. 8674, pp. 421-428 (2014)
Vertebra Localization in Pathological Spine CT via Dense Classification from Sparse Annotations
  • B Glocker
  • D Zikic
  • E Konukoglu
  • D R Haynor
  • A Criminisi
Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebra Localization in Pathological Spine CT via Dense Classification from Sparse Annotations. In: MICCAI 2013, Part II. LNCS, vol. 8150, pp. 262-270 (2013)