ThesisPDF Available

Automatic Landmark Detection for Preoperative Planning of Hip Surgery using Image Processing and Convolutional Neural Networks


Abstract and Figures

Preoperative planning software helps orthopedists being well prepared for hip surgeries. Implants can be virtually positioned specific to the patient’s anatomy prior to the actual operation. Currently, landmarks (points on a x-ray) must be placed manually by the surgeon in a time-consuming and error-prone fashion to derive implant locations. Automating landmark detection helps to automate the whole planning process in a fast and accurate manner. 2 different image processing and 2 convolutional neural network (CNN) approaches are therefore described and evaluated to detect 6 landmarks automatically on a clinical dataset of 1054 pelvis x-ray’s. A cascaded regression CNN (CCNNR) achieves a mean radial error of 2.34 ± 2.5 mm and a leg length discrepancy error of 2.2 ± 2.55 mm in an average time of 2.81s per image (unoptimized, CPU). The CCNNR can hereby be applied to different 2D planning tasks. On another skull x-ray dataset (IEEE ISBI Challenge 2015 for automated cephalometry [1]), it outperforms previous approaches in all but one measure. Furthermore, a heuristic for cascade definition is presented to detect landmarks even in high-resolution images and with small dataset sizes by continuously and smoothly simplifying the task.
Content may be subject to copyright.
Department of Simulation and Graphics
Master Thesis
in partial fulfillment of the requirements for the degree of
Master of Science (M.Sc.)
Automatic Landmark Detection for
Preoperative Planning of Hip Surgery
using Image Processing and
Convolutional Neural Networks
Author Benjamin Bergner
Student number 195598
Submitted on February 22, 2018
Primary Reviewer Jun.-Prof. Dr. Christian Hansen
Secondary Reviewer Adj. Assoc. Prof. Dr. Claes Lundstr¨om
Preoperative planning software helps orthopedists being well prepared for hip
surgeries. Implants can be virtually positioned specific to the patient’s anatomy
prior to the actual operation.
Currently, landmarks (points on a x-ray) must be placed manually by the sur-
geon in a time-consuming and error-prone fashion to derive implant locations.
Automating landmark detection helps to automate the whole planning process
in a fast and accurate manner. 2 different image processing and 2 convolutional
neural network (CNN) approaches are therefore described and evaluated to detect
6 landmarks automatically on a clinical dataset of 1054 pelvis x-ray’s.
A cascaded regression CNN (CCNNR) achieves a mean radial error of 2.34 ±2.5
mm and a leg length discrepancy error of 2.2±2.55 mm in an average time
of 2.81s per image (unoptimized, CPU). The CCNNR can hereby be applied to
different 2D planning tasks. On another skull x-ray dataset (IEEE ISBI Challenge
2015 for automated cephalometry [1]), it outperforms previous approaches in all
but one measure. Furthermore, a heuristic for cascade definition is presented to
detect landmarks even in high-resolution images and with small dataset sizes by
continuously and smoothly simplifying the task.
Statutory Declaration
I declare that I have authored this thesis independently, that I have not used
other than the declared sources / resources, and that I have explicitly marked
all material which has been quoted either literally or by content from the used
This thesis was not used in the same or in a similar version to achieve an academic
grading or is being published elsewhere.
Magdeburg, February 22, 2018
I foremost want to thank Mattias Bergbom for giving me the possibility to inten-
sify my machine learning studies in the digital health domain and for managing
everything necessary in order to conduct this thesis successfully.
I want to thank my advisors Christian Hansen and Claes Lundstr¨om as well as
Olof Sandberg for giving me valuable feedback on thesis structure and method-
ology. Furthermore, I want to thank Daniel Forsberg for helpful discussions on
machine learning.
Without Sectra’s orthopedic department, it would have been difficult (and less
insightful) to annotate over 1000 x-rays, thank you.
Last but not least, I want to express my gratitude to everybody who encouraged
and welcomed me warmly during Sweden’s cold winter season.
1 Introduction 1
2 Related Work 5
3 Methods 9
3.1 Contour Change Points . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Recursive Regions of Interest . . . . . . . . . . . . . . . . . . . . . 13
3.3 Convolutional Neural Network Classification . . . . . . . . . . . . 17
3.4 ShapeModel ............................. 22
3.5 Cascaded Convolutional Neural Network Regression . . . . . . . . 25
4 Data Processing 31
5 Implementation 35
5.1 Pseudocodes.............................. 35
5.1.1 CCP Feature Extraction . . . . . . . . . . . . . . . . . . . 35
5.1.2 RROI Prediction . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 CNN Architectures and Learning . . . . . . . . . . . . . . . . . . 40
6 Evaluation 43
6.1 Metrics................................. 43
6.2 Results................................. 44
7 Discussion 51
7.1 Method Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Product Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8 Conclusion & Future Work 57
A Challenging X-Rays 61
viii Contents
B CCNNR Prediction Paths 65
C Cephalometric Landmark Detection Details 69
List of Figures 73
Bibliography 77
1 Introduction
Total hip replacement (THR) is a common orthopedic surgery. In 2015, most
people were operated in Switzerland, Germany and Austria (308, 299, 271 per
100.000 population) [2]. Since 2000, the rate has increased by 30% on average
(OECD29). As one of the ten most disabling diseases in developed countries,
osteoarthritis is often the reason for this surgery and is mostly performed on
people above 60. Osteoarthritis is the degeneration of cartilage in the joint space
between caput and acetabulum. It results in pain due to bone rubbing (compare
both pelvis sides in figure 1.1 left). In THR, cartilage and damaged bone are
removed. Femoral head and stem are replaced by prostheses. Diseased cartilage
is reamed out and exchanged with an acetabular prosthesis called acetabular
component or cup (see figure 1.1 right).1
Figure 1.1: Two different pelvis x-rays; left: Landmarks and corresponding
anatomy, right: Surgical outcome showing prostheses
Before entering the operating room, surgeons plan on the patient’s x-ray to fit
implants to the individual anatomy. It includes the selection of prostheses types
1More information about THR, see [3]
2 Introduction
and sizes as well as their location and orientation relative to the anatomy. Leg
length discrepancy (LLD) shall be compensated.2Plans can subsequently be
viewed on a monitor during operation and overall ensure that surgeons are well
THR planning on x-ray images is one of the features of Sectra’s preoperative
planning suite, OrthoStation. The process can be divided in the following steps:
X-ray import
Automatic marker detection for correct distance scaling within the image
Manual and guided landmark placement
Automatic calculation and setup of prostheses sizes and locations
Manual correction and adjustment of components
With automatisms, planning is facilitated and surgical outcomes improve. One
of the last repetitive manual steps consists in placing landmarks on a x-ray.
Landmarks are salient and defined points used by humans and machines for
application-specific orientation. In orthopedics, they are needed to give auto-
matic suggestions for prosthesis positioning, alignment and standardized sizes.
Six important anatomical landmarks for THR planning are both sides of the
ischium and trochanter minor as well as the caput center and acetabular roof
(see figure 1.1 left). The first two landmarks are defined as lowest ischium points
that form a tangent to it and accumulate equal area under the convex anatomical
curves on both sides. The landmarks on the trochanters minor are defined as
upmost points heading to the image center. In some cases, these definitions
can apply to various locations (see also chapter 4). The caput ideally has the
shape of a circle within a 2D image that we want to capture. Landmark 5 is the
caput center whereas landmark 6 lies on the vertical shift to it and touches the
acetabular roof.
With the first 4 landmarks, LLD can be calculated. This measure serves for
automatic adjustment of prostheses sizes, positions and their relations to each
other to compensate for unequal leg lengths. It is the difference between the line
2For more information about the value of preoperative planning in THR surgery, see [4]
lengths from the trochanters minor perpendicular to the line spanned by right
and left ischium (see figure 1.1 left). The distance between caput center and
acetabular roof can be considered as cup radius and hence determines its size. In
addition, landmark 5 specifies orientation of cup (and as a consequence thereof
the stem) as 45between ischium line and caput center. Since the caput is not
always similar to a circle (see appendix A) and cup positioning is subjective to
the surgeon’s choice, finding the last two landmarks is regarded as an extra for
this thesis and can only give default cup placements that might need adjustments
as personally desired.
We can conclude that landmarks play a crucial role in THR planning. Detect-
ing them automatically (called automatic landmark detection or ALD) provides
surgeons (and patients) and Sectra various advantages as listed in table 1.1.
Surgeons Business development
Decreases planning time Happier customers
Increases planning quality in stress-
ful clinical daily routine
Follows automation chain in ortho-
pedic solution
Increases planning quality for inex-
perienced surgeons
Application scalable (hip, knee,
shoulder, spine)
Less attention overhead Expandable to 3D planning (manual
placement difficult)
Fewer clicks leads to better UX Usable as basis for segmentation
(common image analysis task)
Wow effect Usable for marketing (customers,
potential customers, future employ-
Table 1.1: Advantages of automatic landmark detection for orthopedic surgeons
and business development
Manual placement is error-prone due to image quality, landmark definition ambi-
guities and dependent on the surgeon’s experience in his profession as well as with
the software. In addition, it increases planning time and attention overhead by
following a guide. Even though manual setting can be accurate, time restrictions
in clinical daily routine reduces time spent on this mundane task and hence leads
to more errors. From a business development point of view, ALD is valuable as
4 Introduction
it has wide applicability within the orthopedic suite and can be expanded to 3D
The goal of this thesis is to investigate the possibilities to detect landmarks
automatically on hip x-rays. Those can then e.g. be suggested to the surgeon or
utilized to fully automate planning. Therefore, simple image processing as well as
machine learning techniques are tested. The research will be conducted in 2D to
reduce complexity in a new field for the company. In addition, the thesis outcome
is aimed to be transferable to production and hence must be reasonably fast.
In chapter 2, related work will be reviewed. In a bigger chapter 3, all tested
methods are explained. Chapter 4 gives more information about the data used in
order to train and test models while chapter 5 focuses on implementation details.
In chapter 6, all methods are evaluated and chapter 7 discusses these results by
comparing all approaches and gives insights for production usage. The outcomes
are subsequently concluded and future work recommendations are presented in
chapter 8.
2 Related Work
ALD is a common procedure in medical imaging. It is mostly used as a prelimi-
nary step for segmentation, as a means for registration algorithms and to measure
geometric relationships for anatomical analysis. The latter is the application in
this thesis. Especially for hips, only few publications exist.
In [5] e.g., the authors detect landmarks to support atlas-based segmentation of
pelvis and femur in CT images within a computer aided diagnosis and planning
system for periacetabular osteotomy. As predictor, random forest regression is
used. Around every landmark in the training data, sub-volumes are sampled.
Subsequently, mean intensity and variance of volume blocks as well as displace-
ments from block centers to the target landmark are calculated. Every tree then
votes for a landmark location for random sub-volumes in test images. Finally, a
Gaussian probability map is constructed to determine the best fit position. Sim-
ilar to the previous approach, [6] apply random forests with Haar-like features to
first detect objects in a global search followed by a local search. Their goal is to
find femur boundary points for segmentation. In [7, 8], the authors apply atlases
to segment bone structures and detect landmarks for 3D hip operation planning.
An expired Siemens patent [9] describes their invention of automatically finding
landmarks for THR surgery by intensity ridge maps within a computed region
of interest. Previous work for the hip is either used for segmentation, in 3D or
implements vague image analysis methods. In this thesis, new approaches for 2D
hip landmark detection instead of 3D are described and evaluated on a clinical
dataset especially for integration in planning software for hip surgery. Besides
conventional machine learning methods, two convolutional neural network vari-
ants are evaluated.
Apart from anatomical landmark detection in hip images, several applications
for other parts of the body have been investigated. [10] e.g. use a template
6 Related Work
matching approach to find landmarks in chest images as initial registration of
two CT scans. [11] apply landmark detection to MR brain image registration.
Besides using random forests, they propose point jumping to iteratively refine
vote positions. Another example is shown in [12] where aortic valve landmarks
are identified for planning and visual guidance during surgery.
One recent publication [13] implements a CNN to find landmarks for radiographic
assessment of adolescent idiopathic scoliosis. They implement boost layers for fea-
ture outliers detection and alignment as well as anatomy shaped regression out-
puts. Another interesting application due to its similarity is anatomical landmark
detection on skull x-rays for automated cephalometry. Two public challenges were
organized and results are reported in [1, 14]. An open small high-resolution 2D
dataset has been provided and several authors evaluate their models on this data.
One recent paper [15] makes use of classification CNNs and a shape model [16]
to find landmarks even with few training images and achieves best performance
among other competitors within 2 mm distance bound. One of the methods in
this thesis will be evaluated on this dataset and applies a CNN regression cascade
suitable for small datasets without the dependence on an explicit shape model.
In addition, it provides fast predictions.
Further deep learning approaches on medical data in 3D are reviewed in [17]. Be-
sides use-cases in the field of medicine, another important ALD research branch
resides in images depicting faces (e.g. mouth corners, nose tip, eyes) and humans.
Some applications in this field are face detection, pose estimation [18], facial ex-
pression analysis [19], augmented reality [20, 21] and modeling of age progression
[22]. Even though not industry-related, the task of finding landmarks far away
from each other and dimensions (2D) are comparable to hip images in this thesis
(in contrast to most medical ALD research). Deep learning approaches are com-
mon in this domain. One of the reasons is the bigger amount of available data.
In [23], a cascade of regression CNNs is used for coarse-to-fine prediction of 5
facial landmarks in a training set of 10.000 images of size 39x39 px. Another typ-
ical strategy is multi-task learning [24, 25, 26]. Multi-task learning assumes that
features learned for a related task are useful in predicting outputs for the main
task. In [24] e.g., comparable results to the cascaded approach are achieved with
CNNs learning head pose, gender, age, facial expression and attributes as well
as facial landmarks at the same time. Other complex methods in this industry
encompass deep deformation networks [27] and adversarial learning [28].
An abundance of training data even for small images is typical in these papers
because finding landmarks in faces can be difficult due to variances like people
wearing glasses, occlusions caused by hair, different head poses and expressions.
The clinical hip dataset is much smaller and comprises high resolution images
but methods do not need to handle above mentioned hurdles to the same extent.
In this thesis, it is shown that the idea of using cascades is also suitable on very
distinct dataset properties including visual features, variance types, dataset and
image sizes. Furthermore, a heuristic is presented to estimate cascade hyperpa-
rameters (size of refinement regions, restricted augmentation shifting) that make
learning on small datasets and big images possible.
Besides the previously mentioned contributions, the thesis provides a suitable
approach for production usage, a beginning path in machine learning for Sectra’s
orthopedic department and educational insights about the complexity of such
problems, how and how not to solve them.
3 Methods
A landmark inside a x-ray image is represented by a single pixel. The number
of possible solutions corresponds hence to the amount of pixels (potentially mil-
lions). It would facilitate the detection process to shrink the search space into a
region of interest (ROI). For ROI computation and within it, visual features are
extracted from a set of images Itrain to describe landmark locations l∈ Ltrain
and their surrounding. The combination of Itrain and Ltrain is called the training
dataset Dtrain. With the help of those visual features, landmark positions ˆ
can be estimated in images Itest sampled from a distinct set, called the test set
Dtest. A third validation or development set Dval, might be included to tune
Local visual features might occur similarly in different areas within an image
(imagine the pelvis symmetry). Therefore, the extraction of multiple estimates
increases the chance of encountering the true location. Those estimates are called
candidates Clof a landmark l. Furthermore, anatomical landmarks show similar
geometric relationships (distances, angles) between each other, e.g. the right
trochanter minor is (hopefully) always located right to the left one. We can
exploit these relationships with shape models to narrow down candidates to a
final estimate. The landmark detection process can consequently be divided in
three steps:
1. ROI extraction
2. Landmark candidate selection based on visual features
3. Final estimation by shape modeling
In order to solve the landmark detection problem, descriptions of four strategies
follow that make use of these steps in different ways. The first two, contour
10 Methods
change points and recursive regions of interest, are simple approaches based on
intuition and mere image processing that utilize hand-crafted features. The latter
two, CNN classification and cascaded CNN regression, are deep learning models
that learn features by themselves and make decisions based on them. While the
first three methods apply a shape model with explicit statistical analysis, the last
relies on an implicit model.
3.1 Contour Change Points
Due to high ambiguity of cup placements and high variances in the caput region,
the main task has been restricted to find the first 4 landmarks. Imagining the
pelvis contour, the trochanter minor and ischium landmarks lie on steep contour
changes. The question of landmark detection can therefore be shifted to contour
change point (CCP) detection. Figure 3.1 gives an overview of the approach
which divides into image preprocessing, feature extraction, feature storage and
Reduction E dge D etection
Morphologi cal
Closing Skeletoniza tion
Preprocessing Feature Extraction
Feature Database
Landmark Nr .
54 43 65 , ...
CC 54
CC 43
CC 65
GT 2
1,2 1,2
441,2 1,2
GT 1 GT 2
GT 3 GT 4
Chain Codes (CC)
Figure 3.1: Contour change point approach overview
As first preprocessing step, histogram equalization is applied to all images ensur-
ing clearer contrast, i.e. a higher intensity gradient, around the landmark area.
Gaussian smoothing follows to reduce noise. Then, edges are detected with the
Canny algorithm [29]. Two thresholds tmin and tmax determine which parts of
3.1 Contour Change Points 11
the image will be considered as edges. Sobel kernel based gradient magnitudes
below tmin are non-edges, over tmax true edges and gradient magnitudes between
the thresholds are considered as edges if they connect to true ones. Thresholds
underlie dynamic adjustments (see 3.1, 3.2) computed with the median image
intensity miin the range [0,1] and dmin, dmax as hyperparameters.
tmin =max(0,(1 dmin)·mi) (3.1)
tmax =min(1,(1 + dmax)·mi) (3.2)
In order to connect nearby edges, morphological closing (dilation followed by
erosion) is applied. Since this operation introduces several filled areas, skele-
tonization subsequently decreases the contour thickness to a single pixel again.
The effects of each preprocessing step is depicted in figure 3.2.
Figure 3.2: Preprocessing steps exemplified on left ischium, top left: Gaussian
smoothing blurs histogram equalized image and hence reduces noise, top right:
canny edge detection, bottom left: morphological closing, bottom right: skele-
12 Methods
After contour lines have been made visible, they need to be converted into a
machine-readable format for feature extraction. Chain codes (CCs) [30] for hor-
izontal, vertical and diagonal directions can be used to translate contours into
numerical strings. A contour finding algorithm returns the coordinates of all di-
rection changes for each background separated contour as a prerequisite for the
subsequent pixel-wise transcription.
Several considerations must be addressed for coordinate to numerical string trans-
Starting coordinates of each contour must be determined. Pixels lying at
the coordinate extrema with only one neighbor are favored.
A contour pixel can have multiple 8-connected neighbors, e.g. when lines
cross (see figure 3.2 bottom right). Due to line crossings, a contour can
consist of several chains.
Already passed coordinates must be blacklisted to circumvent infinite loops
in circular contours.
A traversing orientation (e.g. clock-wise) must be set.
Given extracted numerical chains (e.g. ”00664422” for a rectangle), we are able to
find change points (CPs). Each CP consists of two numbers, the first representing
original, the second upcoming direction. The combination ”54” e.g. symbolizes a
direction change from down-left to left. As can be seen in figure 3.1, a CP is not
placed at every direction change, rather only at locations of significant change to
account for pelvis shape variance.
Each contour is traversed from a determined start point and the number of CC
occurrences is counted. Once a CC hits a change threshold tc, it will be set as
first element of a CP. If tcis not reached, the contour is too short and discarded.
In order to find the second element, we continue traversing the contour and again
count contributions of each CC. Codes far away from the first element in the
code compass (see figure 3.1) are considered as contributing more to a change
than nearby codes. Given a first element of e.g. CC 6, CC 0 would contribute
twice as much to the change as CC 7. If the same direction as the first element is
encountered, the magnitude of change is decreased by 1. If the change magnitude
3.2 Recursive Regions of Interest 13
becomes negative, no significant change has been detected and all contributions
are reset. However, if tcis hit, the second element will be set as the CC with
highest contribution. If several CC contributed equally, the nearest CC to the
first element is chosen because it occurred most often. The coordinate of the CP
is the first occurrence of the second element in the current session. The final CPs
are finally orientation normalized (e.g. CC 46 should be equal to CC 20).
Once all CPs have been extracted, they need to be composed to features. A
feature in this context is a combination of CPs. Within neighborhood radius rN
centered at the ground truth (GT) landmark position, CPs and their knearest
neighbors compose a feature. They are stored landmark-wise in a feature database
F DB filled from all i∈ Itrain. Features are utilized for prediction in a lazy
fashion. In an i∈ Itest, features are extracted again and compared to entries
from F DB . Whenever a match for a landmark is found, the initially empty Clis
updated as union with the respective CP coordinate of the found feature.
Conclusively, several variations for feature extraction are feasible to account for
image variances and to avoid that too many features are found in a test instance.
Some of them are:
Allow similar CPs (e.g. ”45”, ”46”) as valid matches when comparing
features in F DB and Itest
Collect knearest neighbors but accept features with #matches < k
Reduce F DB size by only including most occurring features
3.2 Recursive Regions of Interest
The previous method does not benefit from search space reduction. Another
intuitive approach that makes extensive use of this are recursive regions of interest
(RROI). For our application, a true ROI is a subimage that possesses a landmark.
It preserves the most important part of an image including characteristics around
a landmark. Most importantly, it discards unnecessary parts. If found accurately,
they significantly shrink the possible options. ROIs can be applied before the
actual landmark detection algorithm. In case of the previously described CCP
14 Methods
algorithm e.g., it could reduce the number of candidates since local CP features
might occur many times within an image but will less in a smaller area. The
risk of using ROIs is that if not found accurately, i.e. the region does not possess
the landmark, subsequent detection algorithms cannot identify the true location
Once a ROI is found for a particular landmark, we can again simplify detection by
discovering another ROI within the previous one. This concept can be continued
in a recursive manner, not only to decrease the search space but to reveal the
landmark itself. In this extended case, ROIs are applied until only one pixel
remains and hence act as a landmark detection method. An example is depicted
in Figure 3.3 left.
Figure 3.3: Landmark prediction with recursive ROIs; left: successful detection of
landmark 1, middle: deep layers detect similar coordinates for landmark 3, right:
shallow layers detect different locations for landmark 3
A ROI for a landmark lfrom a test image itest is determined by comparing
sampled patches pPfrom itest with a reference patch rlof same size. This
is called correspondence matching. One measure of comparison is the pixel-wise
sum of squared differences (SSD) (see 3.3) with u, v as row and column iterators
and sas region size. The ROI is then considered as sample with minimum SSD
because it correlates best with the reference.1
(p(u, v)rl(u, v))2(3.3)
1More about correspondence matching and similarity measures in [31]
3.2 Recursive Regions of Interest 15
Subimages as potential ROIs are sampled in a sliding window manner without
considering samples that lie partly outside their source image. A reference patch
(ground truth) of size scan be calculated as the pixel-wise mean µs
lover all
training regions centered around the true landmark as in 3.4.
l(u, v) = Pi∈Itrain i(u, v)
|Itrain|,(u, v)s(3.4)
The mean is a simple but appropriate choice because affine transformations like
rotation and scaling are limited. Reasons for this are the similar acquisition with
a x-ray machine and the restricted variance of human anatomy. In addition,
we only look at patches centered around the true landmark coordinates, further
limiting differences. Even though the pelvis anatomy can vary especially between
humans in need of surgery, they still have the same basic shape. Figure 3.4 shows
lfor l∈ L over Itrain for the biggest region size. The means make most common
visual features visible and appear as blurred versions of original images.
A second step that simplifies recognition of similar patches is scaling intensity
values of both reference and samples between [0,1] (3.5). This treats every im-
age equally by giving more importance to visual structures instead of regional
brightness differences.
inorm =iimin
imax imin
There are several hyperparameters that control RROI:
number of layers A
The already mentioned layer-wise region size sa,
Layer-wise sliding window stridea
Spawns per layer spawna
The last layer must either have a size of one pixel or the center of the last region
is postulated as predicted landmark. In deeper layers, scan be decreased as
the chance to encounter similar structures decreases along with it. For every
layer, not only new samples must be drawn respective to the previous region,
16 Methods
Figure 3.4: Mean image intensities over training set for landmarks 1 to 6
also intensity-adjusted means of equal size from Itrain must be calculated (see
Figure 3.5) for layer-wise correspondence matching.
sais assumed to be the number of vertical/horizontal pixels, defining a square,
for layer awith sa> sa+1. The bigger sa, the higher the probability that the true
landmark will reside in the predicted ROI over images in Itest . Is satoo small, a
patch similar to the true region might be predicted as ROI. The right ischium e.g.
appears similar to the left ischium in a small region but becomes more dissimilar
when symmetric structures like the trochanters minor become visible in a bigger
strideacontrols how many pixels to move horizontally/vertically for sliding win-
dow patch sampling beginning at coordinate (0,0) in the top-left image corner.
Patches that partly lie outside the bounds will be discarded. As layers become
deeper and sadecreases, strideawill decrease with it. The hyperparameter has
two functions. First, the higher it is, the less computations need to be done. Set-
ting it to 1 results in an exhaustive search. Second, strideaset reasonably high
ensures that distant regions will be spawned, a wanted effect to reduce redundant
3.3 Convolutional Neural Network Classification 17
candidates and to create very distinct hypotheses that are subsequently refined
through following layers.
Figure 3.5: Mean image intensities over training set around landmark 1 with
reducing safrom top-left to bottom-right
Since not finding a proper ROI has severe consequences on subsequent calculations
(see figure 3.3 right), several ROIs will be spawned controlled by spawnabased on
minima of formula 3.3. Figure 3.3 middle shows another example where deeper
layers produce more similar predictions. Hyperparameter spawnatogether with
Ayield the maximum possible size of Cl(3.6). It would be reached in the non-
redundancy case.
|Cl|max =Y
3.3 Convolutional Neural Network Classification
Convolutional Neural Networks (CNNs) are a strong tool for image classification,
e.g. for predicting whether a tumor is malignant or benign [32]. In these examples,
18 Methods
complete images are fed into a CNN and the output is one of two (binary) or
more (multiclass) labels. Landmark detection can also be transformed into a
binary classification problem, where each pixel is classified into two categories, is
(positive class) or is not (negative class) a landmark. This said, a CNN is learned
and predicts separate for each landmark so that several training sets Dtrain,l need
to be constructed.
CNNs are data hungry, a property that is still unusual to be satisfied in the medi-
cal domain with reconsenting/ethical approval issues and data silos. Additionally,
for landmark detection, the effort of labeling and the recruitment of domain ex-
perts makes data acquisition a resource-intense undertaking. In a previous paper
[15], it is shown that even with few images, anatomical landmark detection on
skull x-ray data can be accurately solved with CNNs. A reason besides the simple
binary problem formulation is that many training data points can be generated
out of a single image. This results in a much bigger training set than the image
count itself. In comparison to simply taking the SSD as in the RROI algorithm
or construct features by hand as in CCP, CNN classification (CNNC) promises
to be a better predictor with automatic, data-specific representation learning. It
is hence tried out as a third method. What follows is a description of the method
applied in [15].
In order to construct Dtrain,l, positive and negative data points are randomly
sampled within certain radii as can be seen in figure 3.6 left/middle. Positives
representing GT are sampled within a small radius around the true landmark
location of each training image. This assumes that nearby pixels can still be
considered as valid landmark coordinates which do not decrease performance.
A second, bigger circle is spanned as a non-sampling area to provide greater
distinction between positive and negative samples. Lastly, a third circle excluding
the first two is spanned around the true landmark coordinate to sample negatives.
Next, squares are placed with the sample point as their center (see figure 3.6
right). The shuffled combination of subimages i∈ Itrain from each base image
together with their corresponding labels yl∈ {0,1}make up Dtrain,l and is created
for all l∈ L.
After dataset collection, the CNNs can be trained landmark-wise. Detailed in-
formation about all the mentioned concepts can be found in [33, 34]. Here, only
the most important intuitions are given which can be used as repetition. CNNs
3.3 Convolutional Neural Network Classification 19
Figure 3.6: Illustrative sampling procedure for landmark 1, left: negative sample
center points in red, positives in green, middle: zoomed to true landmark location,
non-sampling area represented by blue circle, right: sample boundaries drawn
are compositions of non-linear functions that map an input to an output. They
are structured in layers of neurons that are connected to each other. Those con-
nections are parameters, called weights, which are optimized based on the given
data, an error tracing method called backpropagation, a learning algorithm like
stochastic gradient descent and an objective function.
CNNs in particular consist of convolutional (conv) and most often fully-connected
(FC) layers. A conv layer consists again of several feature maps with equal
sized 2D shapes. They learn to be responsive to data-specific spatial inten-
sity patterns (features). In the first conv layer, those patterns can be hori-
zontal/vertical/diagonal edges or other pixel arrangements that might not be
interpretable by humans. Every feature map comprises neurons. Each neuron is
connected to a certain part of the input image (more generally: previous layer)
called local receptive field. Neurons react more extensive when exposed to filter
(weight matrix) correspondent patterns. Weights are called shared because they
build equal valued connections between all local receptive fields and neurons of
a specific feature map. This enables a feature map to find the same pattern
over the whole image. They therefore account for translations/shifts which occur
frequently in the hip dataset.
The next component is the pooling layer which can be interpreted as the second
part of a conv layer. A pooling layer reduces the dimensionality and hence over-
fitting by compressing feature maps. It has no own parameters (weights). The
most common and also in this thesis used pooling method is max-pooling which
20 Methods
only keeps the maximum values of windows spanned over their associated feature
Since shape sizes decrease in deeper layers, receptive fields become bigger relative
to what they see. In addition, every feature map of the second (and higher) conv
layers has access to all feature maps of the previous layer. The CNN is therefore
enabled to combine low-level features like edges to form higher-level representa-
tions like e.g. parts of the ischia, trochanters and other pelvis areas. The role
of conv layers in a CNN is learning patterns encountered in the training set. FC
layers on top of conv layers can combine these arbitrarily to make predictions.
At the beginning of this section, CNNs were defined as non-linear functions. Every
neuron’s output ois actually only a linear combination of its input intensities
x, weights wand bias b. To make them non-linear and better predictors, an
activation function like the rectified linear unit (ReLU) is applied on top of the
neuron’s prior calculation (see formula 3.7).
o=max 0,X
erec.f ield
Since landmark detection has here been formulated as a binary classification
problem, the output is given by a single neuron. Its final value is wrapped between
0 and 1 for which the sigmoid activation function (3.8) is used. A cost function for
sigmoid activation that allows fast learning by avoiding small gradients/ weight
updates for large output-prediction-differences is binary cross entropy CE (3.9),
with y, ˆyas GT and prediction over ntraining instances.
f(x) = 1
1 + ex(3.8)
CE =1
(yi·ln(ˆyi) + (1 yi)·ln(1 ˆyi)) (3.9)
A CNN consists of a forward and backward pass. The forward pass can be seen
as conducting step, calculating predictions given inputs. The backward pass via
backpropagation is, by intuition, a reflection of each neuron in the network about
its share of prediction error that it caused. This reflection makes them adapt in
3.3 Convolutional Neural Network Classification 21
a way such that the prediction is more similar to GT averaged over (parts of) the
training set.
Since ideally, every pixel is evaluated in the prediction stage, several pixels might
be declared as positive (ˆy > 0.5). There are two reasons for this: First, subimages
defined by nearby pixels around the true landmark location are almost identical.
Second, the visual appearance around the true landmark location might occur
similarly in other image parts. This is the same problem described in section 3.1
and based on settings also in section 3.2 where we only look at a narrow area
around GT. To save computation time, the source image size has been reduced
in [15]. Another option is to apply a stride to the original image. This results in
evaluating fewer pixels (d#pixels
stride e2if image is quadratic) but can cause perfor-
mance loss. Figure 3.7 shows a typical classification of the 50 highest estimated
coordinates per landmark.
Figure 3.7: Typical CNN classification output with 50 candidates for each land-
mark 1 to 6 with colors red, purple, blue, green, light blue, white (best viewed
digital and/or colored); left: stride = 10px, right: stride = 20px results in higher
Displaying many outputs means that some might have a low probability value
assigned by the network. The candidates build clusters near GTs but also around
landmark similar image features as shown in figure 3.8. In this zoomed version,
lower numbers represent higher probabilities. Candidates 1, 2 and 3 are all near
22 Methods
GT. However, candidate 5 lies in a different area. All estimates up to number
10 for this image have probabilities of more than 0.99 being landmark 3. More
drastic confusions occur between landmarks 1 and 2. Averaging or choosing the
coordinate with highest score as final prediction is hence not suitable. How to
decide this is the topic of the next subchapter.
Figure 3.8: Zoomed and ranked version of figure 3.7 right, candidate 5 has erro-
neously a high probability of representing GT 3
3.4 Shape Model
From all previous methods (section 3.1-3.3), we receive a number of candidates
per landmark. Now, a decision needs to be made which candidate to consider as
final output. While the number of candidates can be controlled with CNNC such
that all landmarks have an equal quantity of suggestions, CCP and RROI do not
possess this property. We need to determine which candidate combinations make
most sense given spatial relationships (distances, angles) between them including
the respective candidate intensity appearance likelihoods. Besides classification
3.4 Shape Model 23
CNN for ALD, the authors of [16, 35, 15] make another contribution by introduc-
ing a shape model and a game-theoretic framework to solve this problem. These
methods will be described shortly in the following. For more details, refer to the
original publications.
The first step is to compute an intensity appearance likelihood map alfor l∈ L
and c∈ Cl(3.10) in test image itest. Fraction itest(n)µNc
σNcdescribes the intensity
normalized pixel nfrom neighborhood Nc. The measures µl,σlare mean and
standard deviation matrices built from normalized patches of same size as Nc
centered at lover Dtrain.
al(c) = X
It is calculated with Gaussian kernel density estimation (KDE). KDE can gen-
erally and intuitively be described as adding up the likelihood landscape of a
variable compared between training set and test instance. Instead of using it
for detection as e.g. in [16], it is here taken as part of an evaluator. It gives
information about the degree to which a candidate can be considered a respec-
tive landmark based on its neighborhood pixels n∈ Ncwithin an image. A
standardization between [0,1] follows to get probabilities. A classification CNN
already outputs probabilities, such that (3.10) must only be applied to candidates
received from CCP and RROI.
The next concept is shape likelihood maps slp,lq, computed for all candidates of
each possible landmark pair (lp, lq) with lp6=lqin a test image. They are con-
structed with (3.11) as combination of distance D(3.12) and angle Φ (3.13) KDEs
comparing test image measures to all training image measures. Hyperparameter
λcontrols the importance of distance and angle information (set to 0.5) while
∆ and Θ (initially 1 and 0) set scaling and rotation changes of a test image’s
preliminary candidate combination compared to the means in the training set.
Standard deviations σDlp,lq, σϕlp,lqare computed on Dtrain. The results of 3.11
are, as in 3.10, standardized.
slp,lq(dclp,clq, ϕclp,clq,,Θ) = λDlp,lq(∆dclp,clq) + (1 λlp,lq(ϕclp,clq+ Θ) (3.11)
24 Methods
Dlp,lq(dclp,clq) = X
exp (dlp,lq(i)dclp,clq)2
Φlp,lq(ϕclp,clq) = X
exp (ϕlp,lq(i)ϕclp,clq)2
With both appearance and shape values, partial payoff matrices Wlp,lqare calcu-
lated (3.14) for each landmark pair lp, lq. They hold likelihoods for all candidates
clpbeing the respective landmark lpwhen encountered with each candidate clq
in a test image. Both appearance likelihood of a clpand the shape likelihood
between each of the candidates of two distinct landmarks are considered. The
importance of both is controlled by τand is here set to 0.7. Shape information is
factored in to a higher rate because candidate predictions are more error-prone
due to image symmetries, local view and previously discussed hyperparameters
like stride.
Wlp,lq(clp, clq,,Θ) = (1 τ)·al(clp) + τ·slp,lq(dclp,clq, ϕclp,clq,,Θ) (3.14)
With 3.15, the optimal candidate combination is determined as sum of landmark
candidate payoffs w. Function h(x) = x
x+αwith αcorresponding to the aver-
age payoff, prefers combinations with more similar outcomes than strong-weak
arrangements. The latter would force one landmark be represented by a weak
candidate if a candidate of another landmark has outstandingly high likelihoods.
The authors describe this concept as equalization.
ψ(L,C, W, ,Θ) = argmax
(Wlp,lq(clp, clq,,Θ))
The resulting candidate combination can be reevaluated (formula 3.15) with ∆,Θ
being adjusted by comparing rotation and scale of the current best combination
with the training set average. Finally, random forest regression is applied re-
peatedly by leave-one-out principle. The corresponding candidate nearest to this
prediction is considered as the final estimate and will help to eventually refine
3.5 Cascaded Convolutional Neural Network Regression 25
estimates for the other landmarks. For both procedures, candidates are included
from the complete pool C.
3.5 Cascaded Convolutional Neural Network
The last tested method is regression using CNNs. Regression constitutes contin-
uous output. In the medical domain, examples are prediction of care expenses, a
person’s age when contracting a particular illness based on lifestyle and precondi-
tions or the hospital return rate of patients with chronic diseases. Since landmarks
are represented as continuous pixel coordinates, regression can be applied, too.
The input in such a network is the base image itself. The training set must thus
not be constructed out of subimages. It is a disadvantage that the training set size
is restricted to these input images which is much smaller compared to section 3.3.
The chance of overfitting increases accordingly. The output, in contrast to the
binary decision in the classification chapter, will encompass the axis coordinates
for each landmark, resulting in 12 continuous outputs for 6 landmarks. As cost
function, mean squared error (MSE) is used (general form see 3.16). Besides in-
and output, the configuration is similar to the classification network.
MSE =1
In prediction phase, a classification network needs to evaluate sub images for each
pixel and landmark. A regression network though only processes a single image
of original size as during training. However, the task at hand is more difficult
because the output range is much broader. Furthermore, a classification network
must be learned for each landmark. This makes learning exhausting for detecting
many landmarks which is e.g. needed for subsequent segmentation. In contrast,
only a single regression network is leveraged irrespective of landmark quantity.
A nice property of modeling all landmarks with a single network is that they
learn together, i.e. landmarks influence each other while they train. This is
the wanted effect of the complex shape model (section 3.4) but comes free and
26 Methods
implicit with this architecture choice. The reason why mutual influencing works
can be justified by the dependency of weight gradients on equal neuron inputs in
the last layer. Modeling related tasks in parallel with a shared representation to
support the training of individual tasks is called multi-task learning [36].
Figure 3.9: First regression prediction (red) vs. truth (gold) with ROI bounding
boxes centered at output coordinates, left: typical scenario on a validation image,
right: uncommon output on a validation image
Figure 3.9 depicts a typical and a rare prediction on two validation images. The
predictions are far off from true locations. There are three reasons for this be-
1. Too few training images are available.
2. Input images have a high resolution causing overfitting.
3. The output format is continuous which is harder to fit than binary. In
addition, compared to classification in which the combination of high-level
features make up the decision for a class, the intuition for regression is not
easy to restore.
It is a valuable property of multi-task learning to model landmark positions not
only from image patterns but also based on adjacent landmarks. It causes the
network to preserve shape. If e.g. 1 landmark were not located near GT or the
distance differed significantly from other landmarks, we would not be able to
3.5 Cascaded Convolutional Neural Network Regression 27
consider subsequent improvements. A CNN regression network lays hence the
basis for creating regions of interest to reduce dimensionality. On these ROIs,
additional regression CNNs with smaller input can be applied repeatedly. Already
in [23], the authors created several CNNs in row, called cascades, to make coarse
to fine-grained landmark predictions on a big data set of small (39x39 px) face
images. In their paper, even the first network makes reasonable guesses and only
minor gains on following CNNs can be observed due to smaller image size.
With cascades, we loose the property of learning a single network. In fact, net-
works for each landmark must be trained after the first regression. New training
and validation sets need to be composed. An important hyperparameter is the
input image (ROI) size sROI of every cascade level. One way to determine it is
to look more closely at the distances between prediction and true positions on
validation data. More specifically, we plot these distances together with the rate
of GTs that lie within them for successive regressions in the cascade (figure 3.10).
The function’s steepness increases with the number of regressions meaning that
more true locations are in the prediction’s vicinity. The likelier functions become
(CCNNR2 & 3) the fewer improvements can be expected.
Figure 3.10: Distances from prediction to GT for all regressions in the cascade
Ideally, we favor a ROI size that ensures the inclusion of all GTs. At least
in validation data, one can guarantee for it by taking the maximum euclidean
28 Methods
distance dlˆ
lmax between ground truth land prediction ˆ
l(3.17). If dlˆ
lmax is far
away from a distance comprising e.g. 99% of true coordinates (e.g. for CNNR,
it is beyond 150 px), then it is sensible to neglect the noise in favor for a smaller
region size.
lmax = max
i∈Ival q(lxˆ
lx)2+ (lyˆ
A safety factor dsof desired value is added (3.18) because test data can differ from
previous considerations and to leave some space for better detection of landmarks
based on its surrounding. This in fact results only in half the size of a ROI since
GTs can be located in any direction. sROI as in (3.19) adds 1 to have a real
lmax +ds(3.18)
sROI = 2dh+ 1 (3.19)
In order to train a new CNN to find landmarks within ROIs, a new training
set is created per landmark. One acquisition option is to create subimages (of
size sROI ) around GT positions once for every training image. Another better
option is to augment the data by random shifts from GT within some range
[0, shif tmax]. Several ROI images are sampled per base image to create a much
bigger training set to reduce overfitting and hence create a stabler predictor. This
is reasonable since the random shifts provide repeatedly not only altered (as any
affine transformation does), but new pattern information.
This introduces another hyperparameter called padding pwhich restricts the shift
range. As can be seen in figure 3.10, the distance difference between e.g. 90%
and 100% of GTs reached is reasonably high. If we were to allow shifts of size
dh, the built training set would not represent validation data well because most
prediction to GT distances are far below this value. In addition, features in
the neighborhood necessary for accurate model development will not be visible
anymore. Padding ensures that ROIs still include enough space around the GT
3.5 Cascaded Convolutional Neural Network Regression 29
to learn the underlying structure for solving the problem. Therefore, we reduce
dhby pto get the maximum allowed shift shif tmax (3.20).
shiftmax =dhp(3.20)
Figure 3.11 left shows an example of these values. Figure 3.11 right is an example
of a newly generated training image for landmark 3 with almost shiftmax.
Figure 3.11: Constructing a new training set for landmark 3 based on first re-
gression results, left: ROI measures based on validation data, right: randomly
shifted sample for next regression dataset
Along the cascade, the number of layers and feature maps can decrease since the
important features become low-level. A remarkable property of CNN cascades is
that even with few training data, good results can be achieved by dividing the
problem into two tasks. Depending on the image size and data available, the
first CNNs create ROIs to simplify the problem while the backmost CNNs are
responsible for accurate prediction. Figure 3.12 shows example prediction paths of
a complete image and has been chosen since it visualizes continuous improvement
well. More predictions on different case types can be found in appendix B. Figure
3.13 shows the prediction path of a single landmark with its corresponding ROIs.
On the hip dataset, most improvement can generally be seen between first and
direct follow up regression.
30 Methods
Figure 3.12: Regression paths example with more accurate outcomes along the
cascade, red=1st estimate, purple=2nd, blue=3rd, green=final, gold=true position
Figure 3.13: Regression paths for landmark 3, left: ROI areas, right: Zoomed
into third ROI, enumerating regression outputs
4 Data Processing
The hip dataset has been extracted under ethical approval number 2017-444-31
with anonymization according to established practices pursuant to the Swedish
data protection act (Personuppgiftslagen). Anonymized image data was obtained
from the Link¨oping university hospital, a Sectra PACS and orthopedic package
The dataset consists of around 4000 postoperative and preoperative cases (with
and without prostheses) of both genders and are stored in DICOM files. Post-
operative images were excluded because hip replacement surgery planning with
the orthopedic package is mostly conducted for preoperative cases. Furthermore,
it simplifies the landmark detection task. The data is raw, i.e. images might be
duplicates and are not filtered to always have clear/unambiguous landmark posi-
tions. Besides duplicates, only x-rays in which guided manual planning would not
have been technically possible (e.g. landmark concealed/not in image boundaries)
have been excluded. Due to time limitations, a dataset of size 1054 remained post
removal. Many outliers are part of the kept set that make landmark placement
challenging even for humans. Those high variances originate from:
bad contrast/illumination (e.g. too high consistent brightness)
patient gender and physics
arthritis (caput center and acetabular roof merely or not visible)
flat ischia (placement of landmarks ambiguous without contour changes)
invisible trochanters due to patient rotation
artifacts (words, objects of unknown type)
32 Data Processing
Appendix A shows some examples of these conditions.
DICOM files possess the property Photometric Interpretation that specifies the
intended perception of pixels and is changed to have equal value (0 = black).
Some images were upside-down and needed a rotation.
Labels are not included and were obtained with the help of employees of Sectra’s
orthopedics department acting as experts. Therefore, a labeling tool has been
developed. The design goal was to resemble the click guide process. This includes
color choice, click order and visual aids like points representing landmarks, the
line between right/left ischium as well as a circle to facilitate capturing the caput
(see figure 4.1). Furthermore, a right click reverts selections to allow changes
and a console provides an overview of the current state. Packages of 10 files
each were stored at the network file system and every expert could work on these
depending on available time. All experts were educated about the conduct and
landmark definitions (see chapter 1) prior to labeling. Obtained landmarks were
reinspected and obvious errors corrected.
Figure 4.1: Label clicker: every click draws a new point
The files have varying number of pixels, pixel ratios as well as spacings (mm per
pixel). Data modeled by a standard convolutional neural network need an equal
number of input dimensions while CCP and RROI do not. In order to avoid the
creation of different datasets, the original will be altered to have files with equal
pixel quantities (neural network input dimensions). Therefore, the format is first
changed to TIFF for better access to image processing libraries and easier file
handling. For ratio equalization (1:1), the shorter image side is zero-padded at
its upper extremum to match the longer side. Thereafter, images and landmark
coordinates are resized to the smallest image (2021 pixels) in the set for pixel
equalization. Images are lastly intensity range adjusted [0-1] to account e.g. for
varying brightness and because the original files have differing minimum and
maximum intensities (DICOM tag: Bits Allocated).
While many training instances are available for CNNC due to multisampling
within a single image, the quantity of training examples for all other techniques
is restricted by dataset size. With data augmentation, the training data is ex-
panded by small random changes of the original data which results in more robust
models that better handle overfitting. Therefore, every training image (and its
corresponding set of landmarks) is augmented several times and equally often
by translation and rotation which results in 3000 training images (4 augmented
images per original + original). Rotations up to 4are allowed in any direction
with exception of 0. Similarly, maximum translations of 8% of the image di-
mension are allowed. The offset is filled with zeros. The bounds were chosen
to still represent realistic cases. If the resulting landmarks lie outside the image
dimensions, a new random constellation is computed until they fit. For every
part of the cascade, 6000 training and 2000 validation images were generated (10
augmentations per original).
The dataset is split into training, validation and test set consisting of 600, 200
and 254 images, respectively. Models consisting of parameters (weights in the
neural network case) are built from the training set. Hyperparameters (values
set manually affecting the model performance) are chosen based on the validation
set. Evaluation is done on the test set.
5 Implementation
Python 3.5 is chosen as general programming language allowing fast experimenta-
tion. In addition, many third-party packages can be leveraged. Keras with Ten-
sorflow backend is used for neural network implementations. A Nvidia GeForce
GTX 1060 6GB has been provided for faster neural network learning. In this
chapter, implementation details in form of pseudocodes are given for CCP and
RROI. Furthermore, applied CNN architectures are depicted. Hyperparameters
are mentioned in both subchapters.
5.1 Pseudocodes
5.1.1 CCP Feature Extraction
OpenCV implementations are utilized to preprocess images (histogram equaliza-
tion, Gaussian blurring, canny, closing). Gaussian blurring is exerted with kernel
size (5,5). Standard deviations are calculated from the kernel. Canny thresholds
are tmin = 0.7 and tmax = 0.3 and a Sobel kernel shape (5,5) is chosen. Mor-
phological closing is implemented with a (5,5) kernel, too. Finally, scikit-image’s
implementation for skeletonization [37] is harnessed.
Figure 5.1 shows the learning process pseudocode of CCP which is merely the
gathering of feature objects in a landmark dictionary. Several hyperparameter
settings have been tested with tc= 15 px, rN= 15 px and k= 1 (deactivating
all other variation settings) achieving best results. Contours as a collection of
change coordinates (in the sense of every direction change) are calculated with
OpenCV’s implementation of [38]. The calcChains function returns a list of
chains as numerical strings and their respective start coordinates for tracking. It
36 Implementation
does this by finding a starting point from each contour’s coordinates and travers-
ing it with binary filtering. Already crossed points are added to a blacklist. New
chains are initialized when passing junctions. Based on chains, CPs composed
of type and coordinate are extracted. Finally, features are calculated from CPs
occurring near the training images’ landmarks.
1: Init F DB as empty mapping
2: for all (i, Li)∈ Dtrain do
3: i=preprocess(i)
4: contours =f indC ontours(i)
5: chains, startP oints =calcChains(contours)
6: CP s =calcC P s(chains, startP oints)
7: F DB := calcF eatur es(CP s, Li)
8: end for
Figure 5.1: CCP learning procedure
The process of calculating changes is depicted in figure 5.2. If a chain string is
too short, it will be ignored. The threshold is here set to 2tcwith at least 1tc
needed to find a starting direction and another 1tcto obtain a target direction at
the maximum. The initial starting direction cpF irst is calculated with function
findInitDominantDirec which counts CC occurrences and outputs the one with
highest count as soon as tcis reached for that particular CC. In case tcis not
passed after full traversal, no CP is found for this chain.
Variables changeContribs and f irstCodeCoords store the number of occurrences
and the coordinate of first occurrence for every CC within extraction of a CP. Dur-
ing chain string traversal, these variables are updated. Variable sumOfChanges
representing the change magnitude is incremented/decremented based on the cur-
rent CC and its compass step distance to cpF irst. On surpassing tc, the target
direction cpSecond is determined as CC with most contributions and a new CP
is assembled. The starting direction is set to the target direction and all observer
variables are reset. This also applies to the case when the threshold becomes
lower than 0.
The function calcF eatures in figure 5.3 returns a dictionary that maps from
landmarks to lists of features Fifor a particular image i. A change coordinate
5.1 Pseudocodes 37
near a ground truth landmark and its knearest neighbors compose a feature
representing that landmark. The prediction is eventually an easy FDB lookup
as depicted in figure 5.4 where line 2 is a shortened version of the previously
described functions.
1: function calcCPs(chains, startP oints)
2: Init CP s as list
3: for all chain chains, startP oint startP oints do
4: if |chain|<2tcthen
5: nextIteration
6: end if
7: cpF irst, curr entCoord =findInitDominantDirec(chain, startP oint, tc)
9: Init firstC odeCoords, changeContribs as empty/zeroed CC mapping
10: Init sumOf C hanges as 0
11: while traverse(chain)with cc chain do
12: updateObservers(f irstCodeCoords, changeContribs, sumOf Changes, cc)
13: if sumOf C hanges > tcthen
14: cpSecond =getMaxContributor(changeContribs)
15: type = (cpF irst, cpSecond)
16: coord =f irstCodeCoords[cpSecond]
17: CP s := (type, coord)
18: cpF irst =cpSecond
19: reset(changeContribs, sumOf Changes, f irstCodeC oords)
20: else if sumOfChanges < 0then
21: reset(changeContribs, sumOf Changes, f irstCodeC oords)
22: end if
23: end while
24: end for
25: Return: CP s
26: end function
Figure 5.2: CCP algorithm: Change point calculation
38 Implementation
1: function calcFeatures(CP s, Li)
2: Init Fias empty mapping
3: for all cp CP s do
4: for all lid ∈ Lido
5: if euclDist(cp.coord, lid)< rNthen
6: N=getN eighborChanges(cp.coord, C P s, k)
7: Fi[id] := (cp, N)
8: end if
9: end for
10: end for
11: Return: Fi
12: end function
Figure 5.3: CCP algorithm: Feature extraction
1: for all i∈ Itest do
2: Ftest =getF eatures(i)
3: ˆ
Li=lookup(Ftest , F D B)
4: end for
Figure 5.4: CCP algorithm: Prediction as a lookup
5.1.2 RROI Prediction
Figure 5.5 shows the candidate prediction pseudocode for all landmarks of a sin-
gle test image. Means over Itrain with sizes sa, a = 1..A are taken in line 4 and
embody model learning. The reference coordinate ref Crd corresponds initially
to the point of origin (top-left corner) and is later adjusted to the respective
ROI starting position in the global coordinate system. Coordinates of all ROI
starting positions will be stored in ROICrds as a layer to coordinate list dictio-
nary. Finally, the center positions of the deepest ROI are considered as landmark
candidates. If a=Aand sa= 1, then candidates correspond to ROICrds[1].
ROI calculation is done horizontally first with a recursive function calcROIs
as depicted in figure 5.6. Variable patchDif fs holds the SSD values of mean
and sliding window sampled patches of size sa, calculated in line 9. Starting
5.1 Pseudocodes 39
coordinates relative to the current search area and of quantity spawnafrom lowest
SSD patches are stored in relCrds. The resulting global coordinates are stored in
ROICrds[a] considering the current reference coordinate. The newly encountered
ROIs are used to calculate ROIs one layer deeper until maximum depth Ais
The following hyperparameters, found through grid-search, produce best results
on validation data:
A= 9
s1..A ={1000,200,100,50,10,6,4,2,1}
stride1..A ={100,50,20,10,1,1,1,1,1}
spawn1..A ={3,7,2,1,1,1,1,1,1}
Notice that hyperparameters are dependent on image size. Generally, the first
layers are leveraged to receive distinct rough estimates (high size, stride and
spawns) while deep layers become more specific because less errors are made due
to similar visual features.
1: for all l∈ Lido
2: Init ref C rd as (0,0)
3: Init ROICrds as empty mapping
4: means =getM eanP atchesLm(Itrain, s1..A, l)
5: ROICrds := calcROIs(itest, ref Crd, ROICrds, means, a = 1)
6: Ci,l =getLastLayerCenters(ROICrds)
7: end for
Figure 5.5: RROI algorithm: Landmark candidate prediction
40 Implementation
1: function calcROIs(i, ref Crd, ROICrds, means, a)
2: if a>Athen
3: Return: ROICrds
4: end if
6: Init patchDif f s as empty list
7: while slidingW indow (i, sa, stridea)do
8: patch =generateNext()
9: patchDif f s := calcP atchDif f(patch, means[a])
10: end while
11: relCrds =getM inDiff Coords(patchDif fs, spawna)
12: for all crd relCrds do
13: ROICrds[a] := refCrd +crd
14: ref C rd := ROICrds[a]
15: ref I mg =getSubImg(i, crd, sa)
16: ROICrds := calcROIs(ref Img, ref Crd, ROI Crds, means, a + 1)
17: end for
19: Return: ROICrds
20: end function
Figure 5.6: RROI algorithm: Recursive function for ROI coordinate calculation
5.2 CNN Architectures and Learning
A generic CNN architecture is illustrated in figure 5.7. It gives some intuition
about how CNNs learn representations where the first weight matrix learned to
recognize vertical, the second diagonal and the third horizontal edges. Feature
maps exemplify reactions to the input in an illustrative manner, with black regions
decoded as high activations. Receptive field windows of conv 2 view more area of
its input due to pooling’s compression and have access to all maps from pooling
1 so that higher-level feature compositions are learned. Regarding the amount of
layers, a conv and max pooling layer are described as a single conv layer. The
max-pooling window width and height are chosen to be (2,2). All CNNs possess
2 FC layers, the first having 512 neurons. All hidden layers are ReLU activated.
5.2 CNN Architectures and Learning 41
All networks are learned with Adam and standard parameters from the original
paper [39].
Figure 5.7: Generic CNN architecture with intuition of low to high-level feature
Further settings as comparison between all CNN variants are listed in table 5.1.
Different settings for #Conv, #F eatureM aps, Rec.F ield and Regularization were
tuned with a reduced training/validation set and a small amount of epochs. The
basis image size for sampling in CNNC is kept at 2021 x 2021 px with stride = 10
px and radii = (10,30,550) px. For CCNNRs, input dimensions are chosen
based on the heuristic presented in chapter 3.5 with dh= 250,150,50 px and
shiftmax = 100,50,30 px for the 3 refinement CNNs.
The mini-batch size for CNNC is set to 64 due to its smaller input dimensions
with batch normalization after every hidden activation. The choice of batch
size is especially restricted for CNNR due to high input dimension, choice of
architecture and limited GPU memory. For that reason, a batch size of 1 must
be applied. Within the cascade (CCNNRs), it is consistently increased to 10
to have less hyperparameters to optimize. For the same reason, architecture
depth and feature map quantities are restricted. Increasing them for CNNR and
CCNNR 1 (e.g. with more GPUs/GPU memory) might further improve results
due to the outcome’s dependency on high-level features.
The number of feature maps increases for deeper layers as many more complex
combinations of previous features can be composed to account for high-level vari-
ations (edge vs. pelvis). Receptive fields are reduced over layers as their area
of view increase. Zero-padding is applied to equalize output and input maps
with one exception. In conv 1, it is unlikely to neglect image structures near
42 Implementation
landmarks at the boundaries, so that we can use larger receptive fields to shrink
dimensionality in following layers. Strides are set such that the output shape
shrinks consistently and the number of overall parameters is kept small within
the FC layers. For binary classification, sigmoid is a proper output activation
function, with a single last layer neuron. ReLU can be applied for continuous
landmark coordinates 0 for all 2|L| neurons in the multi-task setting and only
2 outputs for cascaded networks.
CNN Variants Hip Dataset
Input Dim. 111x111 2021x2021 501x501 301x301 101x101
|Batch|64 1 10 10 10
Norm. yes no yes yes yes
#Conv 3 4 4 4 3
#F C 2 2 2 2 2
#F eature
Rec.F ield (5,5),(5,5),
P adding yes Conv1 : no
Rest :yes
yes yes yes
P ooling M ax(2,2) Max(2,2) M ax(2,2) Max(2,2) M ax(2,2)
Strides 2,1,1 4,2,1,1 2,2,1,1 2,1,1,1 2,1,1
#Output 1 12 2 2 2
Sigmoid ReLU ReLU ReLU ReLU
Loss bin.cross
L2(0.01) L2(0.01),
Table 5.1: Architecture Setting for all CNN variants using the hip dataset
6 Evaluation
In the first subchapter, all used metrics are defined. Thereafter, all results are
mentioned for both hip and skull datasets. Further reasoning and explanations
are given in chapter 7.
6.1 Metrics
Two general and two application specific metrics are calculated for method com-
parison over Dtest with corresponding predictions ˆ
L. The first general metric is
the mean accuracy within certain distance bounds M ADB (6.3). It uses the
euclidean distance (6.1) with ∆ describing the difference of the specified image
dimension as well as a counter cnt (6.2) which always returns 1 when ˆ
llies in
distance bound bwith bbounds ={2,3.5,5,7.5,10}mm.
ED(l, ˆ
l) = qx2
l+ ∆y2
cnt(x) =
0 if x > 0
1 if x0
MADB =Plp,ˆ
LPi∈Itest cnt(ED(l(i)
|Itest| · |L| (6.3)
The second metric is the mean radial error MRE (6.4) and sums over distances
44 Evaluation
MRE =Plp,ˆ
LPi∈Itest ED(l(i)
|Itest| · |L| (6.4)
The first application specific metric is the mean leg length discrepancy error
MLLDE (6.7) with LLD (6.6) as difference between perpendicular distances d
of trochanter landmarks lt∈ {l3, l4}from the ischium line defined by l1, l2(6.5).
d(l1, l2, lt) = |yl2,l1·lt,x xl2,l1·lt,y +l2,x ·l1,y l2,y ·l1,x |
ED(l1, l2)(6.5)
LLD(L) = |d(l1, l2, l3)d(l1, l2, l4)|, l1to l4L(6.6)
MLLDE =Pi∈Itest |LLD(L)LLD(ˆ
Finally, the mean ischium line angle difference M I AE (6.8) is measured. It gives
information about the correctness for landmark combination estimations l1, l2
that define both orientation of the cup and are part of LLD calculation. It is
measured based upon angles 6l1l2180here starting from a horizontal line
right to l2with a counterclockwise direction.
MIAE =Pi∈Itest |6l(i)
6.2 Results
At the beginning of this project, only 143 images were available. Those are
leveraged for determining inter-expert MRE and to address two important issues.
First, to get a feeling about human performance on this task. Second, to estimate
how accurate the system can be approximately based on the later received training
data. Two rounds with the setting described in chapter 4 are conducted with
experts labeling an image collection in the first round and taking another from
a distinct expert in the second. The MREs in mm for landmarks 1 to 6 are
4.76, 4.07, 1.49, 1.81, 2.99, 2.7, for all landmarks 2.97 ±3.24. At that time, the
6.2 Results 45
labeling tool did not include an ischium line helper which could cause higher errors
encountered for the first two landmarks. These results should be interpreted as
an indicator for the two questions raised above and by no means as comprehensive
answers because the sample size is too small. Figure 6.1 shows the landmark-wise
MRE while figure 6.2 exemplifies used bounds for MADB on a hip image.
Figure 6.1: MRE for all landmarks compared between experts on 143 images.
In figures 6.3 and 6.4, metric MADB is displayed for all 6 and first 4 landmarks.
Method CCP is only considered for the latter plot since it will not always find
candidates for landmarks 5 and 6. It was not designed to handle landmarks
distant from edges. Another reason for separating plots is that the main task
is set to find the first 4 landmarks only, due to the subjectiveness of placing
cups. For CNNC, it’s worth to mention theoretical possible results considering
candidates nearest to GTs as final estimations because large differences to figure
6.3 with a stride of 10px were encountered. MADB in % for distance bounds 2
to 10 mm are 75.8, 85.8, 89.3, 91.5 and 92.6.
Table 6.1 lists the MRE for every landmark, additionally with standard deviations
for 4 and 6 groups. Figure 6.6 and 6.7 show the application metrics separated
by performance (be aware of scales). For the best performing method CCNNR
with 3 successive refinements after the initial regression, histograms for LLD and
46 Evaluation
Figure 6.2: Used distance bounds (2, 3.5, 5, 7.5, 10) in mm on example
ischium angle error are depicted in figures 6.7 and 6.8.
Algorithm 1 2 3 4 5 6 first 4 all
CCP 49.70 42.81 48.82 29.83 −−42.79±35.61
RROI 24.83 27.00 20.51 33.68 23.64 32.64 26.51 ±39.24 27.05±41.45
CNNC 7.26 10.00 12.84 19.61 12.72 19.60 12.43 ±29.59 13.67±32.39
CNNR 9.22 10.07 8.60 9.21 7.33 7.54 9.27 ±5.47 8.66 ±5.23
CCNNR 1 3.76 3.87 2.53 2.56 2.98 3.52 3.18 ±2.98 3.20 ±2.80
CCNNR 2 2.54 3.16 1.79 1.69 2.81 2.83 2.30 ±2.73 2.47 ±2.52
CCNNR 3 2.46 2.80 1.59 1.60 3.06 2.51 2.12 ±2.70 2.34 ±2.50
Table 6.1: MRE for all algorithms, reported on all landmarks
Finally, CCNNRs are also evaluated on the IEEE ISBI 2015 Challenge dataset [1]
for finding 19 skull landmarks needed for automated quantitative cephalometry.
It is compared to the previously best performing methods as in [15]. The MADB
results are shown in 6.9 and 6.10 for two different test sets in the bounds 2, 2.5,
3 and 4 mm. MRE in mm for test set 1 is 1.51 ±1.29, for test set 2 1.79 ±1.08.
More information about the data and implementation as well as example images
can be found in appendix C.
6.2 Results 47
Figure 6.3: MADB on test set including all landmarks, reported on all methods
except CCP
Figure 6.4: MADB on test set including the first 4 landmarks, reported on all
48 Evaluation
Figure 6.5: MLLDE and MIAE with standard deviations for image processing
methods and CNNC
Figure 6.6: MLLDE and MIAE with standard deviations for CNN regressions
6.2 Results 49
Figure 6.7: Histogram showing distribution of LLD error over the hip test set
with CCNNR 3
Figure 6.8: Histogram showing distribution of ischium angle error over the hip
test set with CCNNR 3
50 Evaluation
Figure 6.9: Comparison of CCNNR 4 result to other papers on IEEE ISBI 2015
Challenge test set 1
Figure 6.10: Comparison of CCNNR 4 result to other papers on IEEE ISBI 2015
Challenge test set 2
7 Discussion
In this chapter, methods will be compared based on previous evaluations. Hy-
potheses will be made about why some methods work better than others. In
order to be well usable in production, not only algorithm performance is of im-
portance. Therefore, implementation and UX aspects are discussed in a second
7.1 Method Comparison
The CCP algorithm performed worse in all evaluations and finds only 2.56% of
all landmarks in the 2 mm bound. Edges can often not be made visible. If canny
thresholds are set too low, detailed but unnecessary contours become visible. On
the other hand, setting it too high leaves major gaps far away from landmark
coordinates. Low contrast landmark neighborhoods remain undetected. Further-
more, landmarks must not lie near change points which can be elucidated best on
extreme cases like hidden trochanters, flat ischium or the caput center. CCP is
also difficult to tweak. Training on many images generates an abundance of test
instance features and thus candidates. In general, the number of candidates is
not adjustable. It is not even guaranteed that one is returned if too few training
images are utilized or the number of required neighbors yields an underrepresen-
tation of features. The only advantage is easy intuition but in conclusion lacks
RROI performs better than CCP with 10.71% within 2 mm (6 landmarks). It
always finds candidates even for landmarks 5/6 and is again easy to understand.
Still, the algorithm has several drawbacks which are reflected in its results. The
speed advantage from applying high strides is exchanged for high translation
52 Discussion
variance. This leads to inconsistent behavior, causes high standard deviations
and makes hyperparametrization ungeneralizable. Another drawback is rotation
sensitivity as the mean shape used for comparison represents mainly unrotated
instances. Candidates can end up on the wrong side of the symmetric pelvis
shape if the initial ROI size is too small (local view problem). The exact number
of candidates cannot be controlled due to ROI redundancy in deeper layers. In
order to mitigate this risk and to find distinct candidates, spawns should be >1
only in early layers for wide spread. Finally, even though high strides are applied,
sliding window approaches are still slow in prediction since no model is learned.
It is generally a good idea to apply binary classification with a CNN. The de-
cision space is restricted to 2, many training instances can be generated out of
a single image and CNNs are well known for their good results on classification
tasks. CNNC achieves best theoretically possible accuracies with 75.8% within 2
mm (6 landmarks) even with a stride of 10px. Compared to final results, a huge
fall is observed to 14.57%. The discrepancies shrink with higher bounds. It can
be explained by considering that the stride forces candidates to have a minimum
distance of 1.5 mm (dependent on pixel spacing) to each other. However, the
shape model does not account for such changes appropriately and will still as-
sign high likelihoods to (relatively) near candidates. One could then reason to
increase the intensity likelihood factor. This on the other hand results in more
side confusions due to pelvis symmetry (ischium, caput). The only possibilities
to accomplish better results is hence to decrease stride to 1, resulting in excessive
evaluation times as every pixel must be checked or to decrease image size. One
advantage over previous discussed methods is the predictive power of CNNs. In
addition, a certainty score is assigned to every pixel and must hence not be com-
puted in retrospect when calculating payoffs. Finally, the number of candidates
can be controlled.
Preceding methods rely on a hand-constructed complex shape model. It cannot
account well for high variance cases like high LLDs. In addition, it adds time of
polynomial complexity (see [35]) to select best candidate combinations. CNNR
tackles the speed problem of CNNC as only one prediction must be conducted.
Unfortunately, this effect diminishes again as we add a cascade and every land-
mark needs its own network. Including model loading and unoptimized code,
a GPU needs on average 1.14s for CCNNR 3. A CPU without model loading
7.1 Method Comparison 53
requires 2.81s on average. A single image with CPU takes 3.52s. Timings for
CCNNR 2 and 1 on a single image are 3.11 and 2.27s.
MADB after the initial regression is low with 5.25% within 2 mm (6 landmarks).
It cannot compete with CNNC in any bound because images are too big and
the dataset size does not suffice to account for it. Interestingly though, it pre-
serves shape better with its implicit and time cost free shape model (compare
MLLDE/MIAE) and is the principal argument for using cascades. After the first
refinement, results improve considerably to 37.4% within 2 mm as many more
training instances with new information can be sampled through shifting aug-
mentations. Furthermore, patch sizes become smaller. Lower improvements can
be seen along the cascade.
Comparing theoretically possible MADB values of CNNC with CCNNR 3, CNNC
is better by 16.61 and 3.06% for bounds 2 and 3.5 mm but worse by 2.63, 5.15
and 5.83 % for bounds 5, 7.5 and 10 mm. This suggests that CNNC was not able
to approach the correct landmark region in some occasions caused by single task
learning. Estimating only first 4 landmarks tends to work slightly better than
all 6 irrespective of method (e.g. 8.23% difference for CCNNR 3, 2 mm) except
for CNNR. It could mean that landmarks 5 and 6 are harder to learn. It also
leads to the hypothesis that CNNR performance increases the more landmarks
are available for multi-task learning. Intuitively, knowing of landmark positions
5 and 6, predicting the first 4 landmarks becomes simpler.
MRE and MADB, particularly the lower bounds, should be viewed with caution.
Those results are not always expressive especially for ischium and caput land-
marks as they often do not possess a clear definition (flat ischium, not circle like
caput). More expressive is the LLD and ischium angle error that set the first 4/2
landmarks in relation to each other. Best results are achieved with CCNNR 3
with a mean LLD error approaching the minimum mean of LLDs post surgery
which varies from 1 to 15.9 mm in literature (see [40]) and a mean ischium angle
error below 1. The log distribution of application specific metrics reveals that
most test instances are even below the already low means.
Comparing MREs between experts and CCNNR 3, the total mean of the machine
is lower. On a landmark basis, experts achieved a slightly better mean value for
landmarks 3 and 5. However, as the same experts have labeled the evaluated test
54 Discussion
data and the sample size of expert comparison is low, no meaningful conclusions
can be drawn.
One disadvantage of cascades is that landmarks are not learned together after
the initial regression which can lead to shape drifting. A fact that is rather
unproblematic as long as landmarks are clearly defined from a local viewpoint.
This is not the case for the caput center with no contrast given in small ROIs
(imagine a zoomed version with landmark 5 as center). Looking at table 6.1
for landmark 5 confirms that hypothesis. From cascade 1 to 2, only minimal
improvements can be observed. From cascade 2 to 3, MRE even increases slightly
(and is higher than in CCNNR 1) in contrast to all other landmarks.
With results on skull data it is shown that this approach is also suitable for
small datasets of only 150 images. In this case, the cascade is preferably deeper
with a softer transition from complex to simple. The prediction jumps from first
to second regression are farer but become again smaller (see typical example in
appendix C) as more meaningful shifting augmentations can be performed. In
row with this argumentation, the dataset size and image size is unproblematic as
long as continuous improvement can be observed on ongoing smaller becoming
ROIs, a trade-off between architecture depth and prediction time. The refine-
ments/cascade can hence be interpreted as the system’s ability to recover from
its previous errors. In any case, a validation set with separate instances would fur-
ther help to approximate the number of necessary ROIs, their sizes and maximal
shifts for training with the presented heuristic.
Except for the 2 mm bound on test set 2, CCNNR 4 yield better results than
previous approaches (as reported in [15]). The 90% mark has been reached on 3
mm in test set 1 as well as on testset 2 for 4 mm. As the clinically acceptable
bound for cephalometry is merely 2 mm, it is argued [1] that automation is an
unsolved problem. With optimization of the currently almost identical hip data
architecture at present and/or more data provided, the current performance is
expected to increase further and can definitely be used to semi-automate if not
even fully automate cephalometry.
7.2 Product Integration 55
7.2 Product Integration
The goal is to make hip surgery planning easier, faster and more accurate. Placing
landmarks is one of the few last manual steps in planning. The achieved results
and computational speed of the CCNNR approach enable discussions about in-
tegration into Sectra’s orthopedic package. In this section, some technicalities as
well as user experience aspects with (dis-)/advantages are discussed.
Results should be reported and conceivably trained on data from different hospi-
tals to draw conclusions about the models ability to generalize. Optimally, new
images are automatically included for relearning existing models. With larger
becoming datasets, techniques that allow for adjustment instead of complete
relearning should be researched. Conceptually, new images must currently be
scaled to the input size of the first CNN. Another option is to change the archi-
tecture to use techniques like global average pooling [41] to be scaling indepen-
dent. Implementation wise, tensorflow/keras models and parameters need to be
loaded/converted with tensorflow’s C++ API which can then be integrated into
C#. At present, only a python executable has been successfully called from C#
From a product point of view, several aspects should be addressed:
1. Automation grade
2. Automation control
3. Communication
4. Interface
Within the first point, the question above all is whether landmarks are still visible
for users. To answer it, more data about how surgeons place landmarks and to
which extent implants are moved after their position has been calculated based on
those landmarks should be gathered. If it turns out that diverse adjustments are
realized irrespective of placed landmarks, then it is more sensible to hide them.
Implants are thus set per automatic labeling and surgeons adjust their sizes and
positions. This implies that adjustments can be performed to every extent even
with severe faulty detections. If users normally accept the implant placement
56 Discussion
calculated from landmarks, it is more sensible to let the option to manually alter
landmarks suggestions.
Control over automation raises two questions. Will users be able to turn off au-
tomation? Do they need to activate it or is it automatically activated? If the
feature is enabled manually under the user’s consent, he must be informed about
its existence in a proper way. An advantage of this path is that within the soft-
ware, nothing changes for the user immediately. With automatic activation, users
need to be educated about how landmarks/implants are adjusted appropriately.
An option for turning off automation might be provided to let customers more
control over what happens within the software.
Third, planning can have strong effects on the patient’s wellbeing. After a cou-
ple of correct landmark suggestions, surgeon’s trust in the system’s automatisms
might increase. They subsequently become more susceptible in accepting wrong
suggestions. Neural networks are black boxes. Even though CNN’s internal pro-
cesses can be visualized, it is still difficult to explain them to non-experts. Any
combination of jump distances (e.g. sum, product) within the cascade can give
an indication about the model’s certainty. Surgeons should still be informed that
automatisms do not guarantee optimal planning. ALD should hence be commu-
nicated as solely giving a default from which to perform manual adjustments if
Lastly, manual labeling starts normally after entering hip planning. If landmark
detection is activated by consent, a right click for opening the context menu to
choose automatic labeling suffices and provides fast access. If manual landmark
adjustments are possible, the user must be enabled to easily identify landmark ids
as well as informed about the possibility to change their locations via affordances.
One option is to select landmarks and drag (additional finer adjustments per
keyboard direction keys) them to the corrected location. Graphics representing
landmarks must in this case be big enough to allow selections. Another possibility
is to register clicks of the environment near points.
8 Conclusion & Future Work
The goal of this thesis was to detect landmarks automatically on hip x-rays to
support orthopedists in total hip replacement surgery planning. Therefore, 4
methods have been presented and were discussed in terms of accuracy, computa-
tional efficiency and production usage. Among these methods, CCP and RROI
as image processing techniques exhibit high prediction errors and slow perfor-
A binary classification convolutional neural network shows high theoretical accu-
racies but requires tremendous parallelization for the computation time demand
within a few seconds in production. Unfortunately, the authors using CNNC
[15] for skull landmark detection did not mention any times. Average times for
previous methods reviewed in [1] range from 18 to 146 seconds for detecting 19
The investigated cascaded convolutional neural network regression approach to-
gether with the presented ROI size and training set composition heuristic proves
oneself as fast and accurate. On average, 2,81s unoptimized on a CPU are needed
to detect 6 landmarks. Even though efficiency is promising, the needed time in-
creases along with cascade depth and number of landmarks. Solving tasks like
landmark-based segmentation which might require hundreds of landmarks, results
hence in higher computation times and task-specific as well as enhanced CNN
techniques as e.g. in [13] might be considered. Consequent research should there-
fore be conducted in accelerating cascade based neural networks that simplify
challenging problems continuously in a multi-task/non task-separating setting.
The cascaded approach achieved a mean leg length discrepancy error of 2.2±2.55
mm and is thus below any LLD necessary to affect patients negatively [42]. With
both time and evaluation results for CCNNRs, one can draw the conclusion that
the above mentioned thesis goal has been fulfilled. However, only x-rays from
58 Conclusion & Future Work
people treated at the Link¨oping university hospital were used for training and
evaluation. It would be interesting to research how well the model generalizes to
x-rays from disparate sources and to expand the training set to possibly account
for them. Some hospitals might use different x-ray machines/acquisition processes
while people from other regions exhibit distinct pelvis variances. Furthermore,
the inter-observer performance of surgeons should be assessed to allow stronger
judgments about how the machine performs compared to them. Data acquisition
might become easier in the future and models outperform humans in most cases.
Surgeons will still predict rare cases much better when machines err on them.
For that reason, more sophisticated data augmentation like elastic distortions or
even generative adversarial networks might be utilized to expand data beyond
what humans can naturally provide.
CCNNRs are applicable to all kinds of 2D automatic landmark detection and have
been shown to work on big images and small dataset sizes as opposed to [23]. On
the skull dataset (for automatic cephalometry), it outperforms previously best
approaches in all but one distance bound by up to 6.46% without fine-tuning
(adaption from hip data). Previous approaches [35, 43, 44, 45, 46] competing in
the IEEE ISBI 2015 challenge rely in some form on hand-crafted features (e.g.
Haar-like, Histogram of Oriented Gradient, Zernike). Even [15], in which visual
appearance features are learned by a CNN, relies on a hand-crafted shape model.
In the introductory text of chapter 3, landmark detection has been divided in the
3 tasks ROI extraction, landmark candidate selection based on visual features
and final estimation by modeling of geometric relationships. While RROI, CCP
and the described explicit shape model focus only on one of these steps, CCNNR
unites them without hand-crafting features.
Comparing dataset sizes, the skull training set is 4 times smaller than the one
for the hip and therefore utilized 4 refinements instead of 3. In general, it would
be interesting to learn more about needed dataset sizes for specific problems in
machine learning. For cascades, 4 research questions result:
1. Which function emerges when the training set size is assessed in relation to
the number of cascades needed to achieve a desired performance?
2. Given a large training set, how would the prediction goodness differ com-
paring few vs. arbitrary many cascades?
3. Given a fixed number of cascades, how would results improve when adding
more data to learn on?
4. Finally and most interestingly, would it be possible to learn on a single
instance1and achieve good evaluation results?
Cascades simplify landmark detection as they are applied to smaller becoming
regions. With deeper cascades, it is assumed that small datasets (e.g. skull
data) can be tackled. Generalizing from these two hypotheses, it might be pos-
sible to solve landmark detection based on a single training image with the most
extreme case of applying cascades with ROI sizes becoming smaller by 1 pixel.
More broadly, techniques that simplify arbitrary regression as well as classifica-
tion problems continuously for training data reduction would be well desired in
the machine learning community (especially in the health care sector in which
data acquisition is still a major problem). The collection of those techniques
could be termed as deep simplification learning.
As next steps specifically for Sectra, the keras/tensorflow models need to be trans-
ferred to C# or C++ language for production integration. More test data from
disparate sources should be used to prove wider applicability of trained networks.
Training on more data leads to higher accuracies of the first regression and re-
quires less refinements and hence a computational speedup. X-rays have been
filtered to only include preop cases and could be expanded to postop. Better
hardware and architecture refinements (gradient checkpointing [48] and synthetic
gradients [49] for deeper CNNs, global average pooling [41], lime [50]/class ac-
tivation maps [51] for visualization) can further improve results with a lower
memory footprint, foster general applicability on different image sizes and pro-
vide better reasoning capabilities about a network’s output. After functionality
has been proven on hips, expansion to knee, shoulder and spine is appropriate.
Research on the applicability for 3D planning can furthermore increase surgeon
1This is called one-shot learning, learn more in [47]
A Challenging X-Rays
Landmark detection on the pelvis is challenging because anatomy changes be-
tween patients. In some cases, it is even hard for humans to estimate these posi-
tions. In this appendix, several cases demonstrate various difficulties encountered
in the dataset.
Figure A.1: Left: Rotations, hidden left trochanter minor, objects like rulers,
right: weak contrast for landmarks 1 and 2, caput center and acetabular roof
difficult to label
62 Challenging X-Rays
Figure A.2: Left: Protective gonad shield, right: Dark as well as very bright
Figure A.3: Left: Translation showing the upper pelvis which is underrepresented
in the dataset, right: Right trochanter minor and ischium rub against each other,
no clear space between them as in most other cases.
Figure A.4: Left: Top and bottom without visual information (black), flat ischium
(no clear landmark definition), right: Rare case in which landmark 4 is above
landmark 5 (no geometrical constraints and rules applicable)
Figure A.5: Left: Pelvis symmetry lost, again a translation showing different
anatomy compared to most other x-rays, right: Gender variances, also in this
case, one trochanter minor is invisible.
B CCNNR Prediction Paths
All 254 test images including prediction paths as already presented in chapter
3.5 have been plotted to a pdf file. In this appendix, most interesting cases are
selected including the few existing bad as well as good predictions on difficult
cases (good and bad not based on metrics, just valued by viewing).
Figure B.1: Worst performing examples; left: Failed to spot hidden trochanter
minor and ischium due to low contrast, right: Almost all predictions are far off
because only few images in the dataset show the upper pelvis part.
Figure B.2: Again rather bad predictions, left: Flat ischium causes wrong posi-
tion, LLD not affected, right: Rubbing bones are rare cases. It is hence difficult
to account for them.
66 CCNNR Prediction Paths
Figure B.3: Good examples, left: Gonad shield at the bottom does not interfere
with predictions, right: Method accounts for rotations.
Figure B.4: Difficult examples, left: Hidden trochanter minor and caput without
joint space, right: Rotated hidden trochanter detected.
Figure B.5: Left: Acetabular roof landmark prediction better than ground truth,
right: Another example of correct hidden trochanter estimation.
Figure B.6: Left: The hidden trochanter might be estimated more accurately
than the true coordinate considering the symmetric shape of the pelvis, right:
Remarkable performance on low contrast that is difficult to spot even for humans.
Figure B.7: Left: Comparably big refinement necessary due to uncommon object
in the x-ray. right: Handling dark and bright images is unproblematic.
C Cephalometric Landmark
Detection Details
The two test sets comprise 150 and 100 images. The image size for every image
is 2400x1935 pixels. The pixel spacing is 0.1 mm. Networks have only been
learned on training data (again 150 images) and 4 refinement regressions were
applied. This is one regression more than used on hip data. The reason is that
no validation data is accessible and the training set size is small.
Figure C.1: Skull prediction paths; left: uncommon example with no distant
jumps, right: common example with big jumps, 3 predictions even originate
outside the image boundary
The first regression is, as previously, run on whole images. Generally, 50 epochs
70 Cephalometric Landmark Detection Details
were used at the maximum based on best mean absolute error on training data.
Augmentations are used as before resulting in 750 instances. For the first region,
sROI is set to 1001 px (dh= 500 px) with shiftmax = 300 px. No real heuristic is
applied as no validation data is available. For the upcoming ROIs, same hyper-
parameters as for the hip are chosen. As validation data, more random samples
are drawn from the actual training set. Training and validation data sizes for all
refinements are (3000, 1500). Only for the first refinement, (1050, 300) has been
chosen to save time and disk space.
CNN settings are almost equal to the hip setting, no fine-tuning to the particular
dataset has been done. Since the initial skull images are bigger, strides are set
to (4,2,2,1). The first refinement CNN applies the same settings as the first hip
refinement CNN but has strides (4,2,1,1) and a batch size of 1.
Two example predictions are shown in image C.1. Often, big jumps from first
regression to first refinement can be observed. Adding more (different, not only
augmented) training instances will likely decrease them similar to hip data.
List of Abbreviations
CNN Convolutional Neural
CCP Contour Change Point
ROI Region of Interest
RROI Recursive Region of Interest
CNNC Convolutional Neural
Network Classification
CNNR Convolutional Neural
Network Regression
CCNNR Cascaded Convolutional
Neural Network Regression
CC Change Code
CP Change Point
FDB Feature Database
conv convolutional
FC fully-connected
ReLU Rectified Linear Unit
THR Total Hip Replacement
LLD Leg Length Discrepancy
MRE Mean Radial Error
MSE Mean Squared Error
DO Dropout
ED Euclidean Distance
API Application programming
IEEE Institute of Electrical and
Electronics Engineers
ISBI International Symposium
on Biomedical Imaging
MADB Mean Accuracy within
Distance Bound
MLLDE Mean Leg Length
Discrepancy Error
MIAE Mean Iscium Angle Error
ALD Automatic Landmark
LM Landmark
CT Computerized Tomography
MR Magnetic Resonance
72 Cephalometric Landmark Detection Details
PACS Picture Archiving and
Communication System
DICOM Digital Imaging and
Communications in
TIFF Tagged Image File Format
ML Machine Learning
RF Random Forest
GT Ground Truth
kNN k-Nearest Neighbor
SSD Sum of Squared Differences
CE Cross Entropy
KDE Kernel Density Estimation
MTL Multi-Task Learning
UX User Experience
CPU Central processing unit
GPU Graphics processing unit
List of Figures
1.1 Two different pelvis x-rays; left: Landmarks and corresponding
anatomy, right: Surgical outcome showing prostheses . . . . . . . 1
3.1 Contour change point approach overview . . . . . . . . . . . . . . 10
3.2 Preprocessing steps exemplified on left ischium, top left: Gaus-
sian smoothing blurs histogram equalized image and hence reduces
noise, top right: canny edge detection, bottom left: morphological
closing, bottom right: skeletonization . . . . . . . . . . . . . . . . 11
3.3 Landmark prediction with recursive ROIs; left: successful detec-
tion of landmark 1, middle: deep layers detect similar coordinates
for landmark 3, right: shallow layers detect different locations for
landmark3 .............................. 14
3.4 Mean image intensities over training set for landmarks 1 to 6 . . . 16
3.5 Mean image intensities over training set around landmark 1 with
reducing safrom top-left to bottom-right . . . . . . . . . . . . . . 17
3.6 Illustrative sampling procedure for landmark 1, left: negative sam-
ple center points in red, positives in green, middle: zoomed to true
landmark location, non-sampling area represented by blue circle,
right: sample boundaries drawn . . . . . . . . . . . . . . . . . . . 19
3.7 Typical CNN classification output with 50 candidates for each
landmark 1 to 6 with colors red, purple, blue, green, light blue,
white (best viewed digital and/or colored); left: stride = 10px,
right: stride = 20px results in higher spread . . . . . . . . . . . . 21
3.8 Zoomed and ranked version of figure 3.7 right, candidate 5 has
erroneously a high probability of representing GT 3 . . . . . . . . 22
3.9 First regression prediction (red) vs. truth (gold) with ROI bound-
ing boxes centered at output coordinates, left: typical scenario on
a validation image, right: uncommon output on a validation image 26
74 List of Figures
3.10 Distances from prediction to GT for all regressions in the cascade 27
3.11 Constructing a new training set for landmark 3 based on first re-
gression results, left: ROI measures based on validation data, right:
randomly shifted sample for next regression dataset . . . . . . . . 29
3.12 Regression paths example with more accurate outcomes along the
cascade, red=1st estimate, purple=2nd, blue=3rd , green=final, gold=true
position ................................ 30
3.13 Regression paths for landmark 3, left: ROI areas, right: Zoomed
into third ROI, enumerating regression outputs . . . . . . . . . . 30
4.1 Label clicker: every click draws a new point . . . . . . . . . . . . 32
5.1 CCP learning procedure . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 CCP algorithm: Change point calculation . . . . . . . . . . . . . 37
5.3 CCP algorithm: Feature extraction . . . . . . . . . . . . . . . . . 38
5.4 CCP algorithm: Prediction as a lookup . . . . . . . . . . . . . . . 38
5.5 RROI algorithm: Landmark candidate prediction . . . . . . . . . 39
5.6 RROI algorithm: Recursive function for ROI coordinate calculation 40
5.7 Generic CNN architecture with intuition of low to high-level fea-
turelearning.............................. 41
6.1 MRE for all landmarks compared between experts on 143 images. 45
6.2 Used distance bounds (2, 3.5, 5, 7.5, 10) in mm on example . . . . 46
6.3 MADB on test set including all landmarks, reported on all methods
exceptCCP .............................. 47
6.4 MADB on test set including the first 4 landmarks, reported on all
methods ................................ 47
6.5 MLLDE and MIAE with standard deviations for image processing
methodsandCNNC ......................... 48
6.6 MLLDE and MIAE with standard deviations for CNN regressions 48
6.7 Histogram showing distribution of LLD error over the hip test set
withCCNNR3 ............................ 49
6.8 Histogram showing distribution of ischium angle error over the hip
testsetwithCCNNR3........................ 49
6.9 Comparison of CCNNR 4 result to other papers on IEEE ISBI
2015 Challenge test set 1 . . . . . . . . . . . . . . . . . . . . . . . 50
List of Figures 75
6.10 Comparison of CCNNR 4 result to other papers on IEEE ISBI
2015 Challenge test set 2 . . . . . . . . . . . . . . . . . . . . . . . 50
A.1 Left: Rotations, hidden left trochanter minor, objects like rulers,
right: weak contrast for landmarks 1 and 2, caput center and ac-
etabular roof difficult to label . . . . . . . . . . . . . . . . . . . . 61
A.2 Left: Protective gonad shield, right: Dark as well as very bright
images ................................. 62
A.3 Left: Translation showing the upper pelvis which is underrepre-
sented in the dataset, right: Right trochanter minor and ischium
rub against each other, no clear space between them as in most
othercases. .............................. 62
A.4 Left: Top and bottom without visual information (black), flat is-
chium (no clear landmark definition), right: Rare case in which
landmark 4 is above landmark 5 (no geometrical constraints and
rulesapplicable)............................ 63
A.5 Left: Pelvis symmetry lost, again a translation showing different
anatomy compared to most other x-rays, right: Gender variances,
also in this case, one trochanter minor is invisible. . . . . . . . . . 63
B.1 Worst performing examples; left: Failed to spot hidden trochanter
minor and ischium due to low contrast, right: Almost all predic-
tions are far off because only few images in the dataset show the
upperpelvispart............................ 65
B.2 Again rather bad predictions, left: Flat ischium causes wrong po-
sition, LLD not affected, right: Rubbing bones are rare cases. It
is hence difficult to account for them. . . . . . . . . . . . . . . . . 65
B.3 Good examples, left: Gonad shield at the bottom does not interfere
with predictions, right: Method accounts for rotations. . . . . . . 66
B.4 Difficult examples, left: Hidden trochanter minor and caput with-
out joint space, right: Rotated hidden trochanter detected. . . . . 66
B.5 Left: Acetabular roof landmark prediction better than ground
truth, right: Another example of correct hidden trochanter esti-
mation. ................................ 66
76 List of Figures
B.6 Left: The hidden trochanter might be estimated more accurately
than the true coordinate considering the symmetric shape of the
pelvis, right: Remarkable performance on low contrast that is dif-
ficult to spot even for humans. . . . . . . . . . . . . . . . . . . . . 67
B.7 Left: Comparably big refinement necessary due to uncommon ob-
ject in the x-ray. right: Handling dark and bright images is un-
problematic............................... 67
C.1 Skull prediction paths; left: uncommon example with no distant
jumps, right: common example with big jumps, 3 predictions even
originate outside the image boundary . . . . . . . . . . . . . . . . 69
[1] Wang, Ching-Wei,Cheng-Ta Huang,Meng-Che Hsieh,Chung-
Hsing Li,Sheng-Wei Chang,Wei-Cheng Li,R´
emy Vandaele,
el Mar´
ebastien Jodogne,Pierre Geurts et al.: Evalua-
tion and comparison of anatomical landmark detection methods for cephalo-
metric X-ray images: a grand challenge. IEEE transactions on medical imag-
ing, 34(9):1890–1900, 2015.
[2] OECD:Health at a Glance 2017.
[3] Jared R. H. Foran, MD:Total Hip Replacement. Web, 08 2015. Accessed:
[4] Eggli, S,M Pisan and ME M¨
uller:The value of preoperative planning
for total hip arthroplasty. J Bone Joint Surg Br, 80(3):382–390, 1998.
[5] Chu, Chengwen,Cheng Chen,Li Liu and Guoyan Zheng:Facts:
fully automatic ct segmentation of a hip joint. Annals of biomedical engi-
neering, 43(5):1247–1259, 2015.
[6] Lindner, Claudia,S Thiagarajah,J Mark Wilkinson,Gillian A
Wallis,Timothy F Cootes,arcOGEN Consortium et al.: Accu-
rate fully automatic femur segmentation in pelvic radiographs using regres-
sion voting. In International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 353–360. Springer, 2012.
[7] Ehrhardt, Jan,Heinz Handels,Bernd Strathmann,Thomas Ma-
lina,Werner Pl¨
otz and Siegfried J P¨
oppl:Atlas-based recognition of
anatomical structures and landmarks to support the virtual three-dimensional
78 Bibliography
planning of hip operations. In International Conference on Medical Im-
age Computing and Computer-Assisted Intervention, pages 17–24. Springer,
[8] Ehrhardt, J,H Handels,W Pl¨
otz,SJ P¨
oppl et al.: Atlas-based
recognition of anatomical structures and landmarks and the automatic com-
putation of orthopedic parameters. Methods Archive, 43(4):391–397, 2004.
[9] Wei, Guo-Qing,Jianzhong Qian and Helmuth Schramm:System
and method for the detection of anatomic landmarks for total hip replace-
ment, January 6 2004. US Patent 6,674,883.
[10] Betke, Margrit,Harrison Hong,Deborah Thomas,Chekema
Prince and Jane P Ko:Landmark detection in the chest and registration
of lung surfaces with an application to nodule registration. Medical Image
Analysis, 7(3):265–281, 2003.
[11] Han, Dong,Yaozong Gao,Guorong Wu,Pew-Thian Yap and
Dinggang Shen:Robust anatomical landmark detection with application to
MR brain image registration. Computerized Medical Imaging and Graphics,
46:277–290, 2015.
[12] Zheng, Yefeng,Matthias John,Rui Liao,Jan Boese,Uwe
Kirschstein,Bogdan Georgescu,S Kevin Zhou,J¨
org Kempfert,
Thomas Walther,Gernot Brockmann et al.: Automatic aorta seg-
mentation and valve landmark detection in C-arm CT: application to aortic
valve implantation. In International Conference on Medical Image Comput-
ing and Computer-Assisted Intervention, pages 476–483. Springer, 2010.
[13] Wu, Hongbo,Chris Bailey,Parham Rasoulinejad and Shuo Li:
Automatic Landmark Estimation for Adolescent Idiopathic Scoliosis Assess-
ment Using BoostNet. In International Conference on Medical Image Com-
puting and Computer-Assisted Intervention, pages 127–135. Springer, 2017.
[14] Wang, Ching-Wei,Cheng-Ta Huang,Jia-Hong Lee,Chung-Hsing
Li,Sheng-Wei Chang,Ming-Jhih Siao,Tat-Ming Lai,Bulat
z Vrtovec,Olaf Ronneberger et al.: A benchmark
for comparison of dental radiography analysis algorithms. Medical image
analysis, 31:63–76, 2016.
Bibliography 79
[15] Arık, Sercan ¨
O,Bulat Ibragimov and Lei Xing:Fully automated
quantitative cephalometry using convolutional neural networks. Journal of
Medical Imaging, 4(1):014501–014501, 2017.
[16] Ibragimov, Bulat,Boˇ
stjan Likar,Franjo Pernus et al.: A game-
theoretic framework for landmark-based image segmentation. IEEE Transac-
tions on Medical Imaging, 31(9):1761–1776, 2012.
[17] Litjens, Geert,Thijs Kooi,Babak Ehteshami Bejnordi,Arnaud
Arindra Adiyoso Setio,Francesco Ciompi,Mohsen Ghafoorian,
Jeroen AWM van der Laak,Bram van Ginneken and Clara I
anchez:A survey on deep learning in medical image analysis. arXiv
preprint arXiv:1702.05747, 2017.
[18] Zhu, Xiangxin and Deva Ramanan:Face detection, pose estimation,
and landmark localization in the wild. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE,
[19] Pantic, Maja and Leon J. M. Rothkrantz:Automatic analysis of fa-
cial expressions: The state of the art. IEEE Transactions on pattern analysis
and machine intelligence, 22(12):1424–1445, 2000.
[20] Wang, Patricia,Xiaofeng Tong,Yangzhou Du,Jianguo Li,Wei
Hu and Yimin Zhang:Augmented makeover based on 3D morphable model.
In Proceedings of the 19th ACM international conference on Multimedia,
pages 1569–1572. ACM, 2011.
[21] Azevedo, Pedro,Thiago Oliveira Dos Santos and Edilson
De Aguiar:An Augmented Reality Virtual Glasses Try-On System. In
Virtual and Augmented Reality (SVR), 2016 XVIII Symposium on, pages
1–9. IEEE, 2016.
[22] Ramanathan, Narayanan and Rama Chellappa:Modeling age pro-
gression in young faces. In Computer Vision and Pattern Recognition, 2006
IEEE Computer Society Conference on, volume 1, pages 387–394. IEEE,
80 Bibliography
[23] Sun, Yi,Xiaogang Wang and Xiaoou Tang:Deep convolutional net-
work cascade for facial point detection. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3476–3483, 2013.
[24] Zhang, Zhanpeng,Ping Luo,Chen Change Loy and Xiaoou Tang:
Facial landmark detection by deep multi-task learning. In European Confer-
ence on Computer Vision, pages 94–108. Springer, 2014.
[25] Zhang, Cha and Zhengyou Zhang:Improving multiview face detection
with multi-task deep convolutional neural networks. In Applications of Com-
puter Vision (WACV), 2014 IEEE Winter Conference on, pages 1036–1041.
IEEE, 2014.
[26] Ranjan, Rajeev,Vishal M Patel and Rama Chellappa:Hy-
perface: A deep multi-task learning framework for face detection, land-
mark localization, pose estimation, and gender recognition. arXiv preprint
arXiv:1603.01249, 2016.
[27] Yu, Xiang,Feng Zhou and Manmohan Chandraker:Deep deforma-
tion network for object landmark localization. In European Conference on
Computer Vision, pages 52–70. Springer, 2016.
[28] Chen, Yu,Chunhua Shen,Xiu-Shen Wei,Lingqiao Liu and Jian
Yang:Adversarial Learning of Structure-Aware Fully Convolutional Net-
works for Landmark Localization. arXiv preprint arXiv:1711.00253, 2017.
[29] Canny, John:A computational approach to edge detection. IEEE Trans-
actions on pattern analysis and machine intelligence, (6):679–698, 1986.
[30] Gonzales, Rafael C. and Paul Wintz:Digital Image Processing (2Nd
Ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
[31] Trucco, Emanuele and Alessandro Verri:Introductory techniques
for 3-D computer vision, volume 201. Prentice Hall Englewood Cliffs, 1998.
[32] Esteva, Andre,Brett Kuprel,Roberto A Novoa,Justin
Ko,Susan M Swetter,Helen M Blau and Sebastian Thrun:
Dermatologist-level classification of skin cancer with deep neural networks.
Nature, 542(7639):115–118, 2017.
Bibliography 81
[33] Nielsen, Michael A:Neural networks and deep learning, 2015. http:
[34] Goodfellow, Ian,Yoshua Bengio and Aaron Courville:Deep
Learning. MIT Press, 2016.
[35] Ibragimov, Bulat,Boˇ
stjan Likar,Franjo Pernuˇ
sand Tomaˇ
z Vr-
tovec:Automatic cephalometric X-ray landmark detection by applying
game theory and random forests. In Proceedings ISBI, pages 1–8, 2014.
[36] Caruana, Rich:Multitask learning. In Learning to learn, pages 95–133.
Springer, 1998.
[37] Zhang, TY and Ching Y. Suen:A fast parallel algorithm for thinning
digital patterns. Communications of the ACM, 27(3):236–239, 1984.
[38] Suzuki, Satoshi et al.: Topological structural analysis of digitized binary
images by border following. Computer vision, graphics, and image processing,
30(1):32–46, 1985.
[39] Kingma, Diederik P. and Jimmy Ba:Adam: A Method for Stochastic
Optimization. CoRR, abs/1412.6980, 2014.
[40] Konyves, A and GC Bannister:The importance of leg length discrepancy
after total hip arthroplasty. Bone & Joint Journal, 87(2):155–157, 2005.
[41] Lin, Min,Qiang Chen and Shuicheng Yan:Network in network. arXiv
preprint arXiv:1312.4400, 2013.
[42] Gurney, Burke:Leg length discrepancy. Gait & posture, 15(2):195–206,
[43] Chu, Chengwen,Cheng Chen,LP Nolte and G Zheng:Fully auto-
matic cephalometric x-ray landmark detection using random forest regression
and sparse shape composition. submitted to Automatic Cephalometric X-ray
Landmark Detection Challenge, 2014.
[44] Chen, C and G Zheng:Fully automatic landmark detection in cephalomet-
ric x-ray images by data-driven image displacement estimation. In Proc. ISBI
Int. Symp. Biomed. Imag. 2014, Automat. Cephalometric X-Ray Landmark
Detection Challenge, pages 17–24, 2014.
82 Bibliography
[45] Mirzaalian, Hengameh and Ghassan Hamarneh:Automatic globally-
optimal pictorial structures with random decision forest based likelihoods for
cephalometric x-ray landmark detection. In Proc. ISBI Int. Symp. Biomed.
Imag. 2014, Automat. Cephalometric X-Ray Landmark Detection Challenge,
pages 25–36. Citeseer, 2014.
[46] Vandaele, R´
el Mar´
ebastien JODOGNE and
Pierre Geurts:Automatic cephalometric x-ray landmark detection chal-
lenge 2014: A tree-based algorithm. ISBI, 2014.
[47] Fei-Fei, Li,Rob Fergus and Pietro Perona:One-shot learning of
object categories. IEEE transactions on pattern analysis and machine intel-
ligence, 28(4):594–611, 2006.
[48] Chen, Tianqi,Bing Xu,Chiyuan Zhang and Carlos Guestrin:
Training Deep Nets with Sublinear Memory Cost. CoRR, abs/1604.06174,
[49] Czarnecki, Wojciech Marian,Grzegorz Swirszcz,Max Jader-
berg,Simon Osindero,Oriol Vinyals and Koray Kavukcuoglu:
Understanding Synthetic Gradients and Decoupled Neural Interfaces. CoRR,
abs/1703.00522, 2017.
[50] Ribeiro, Marco Tulio,Sameer Singh and Carlos Guestrin:”Why
Should I Trust You?”: Explaining the Predictions of Any Classifier. In Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016,
pages 1135–1144, 2016.
[51] Zhou, Bolei,Aditya Khosla,Agata Lapedriza,Aude Oliva and
Antonio Torralba:Learning deep features for discriminative localiza-
tion. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE
Conference on, pages 2921–2929. IEEE, 2016.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Quantitative cephalometry plays an essential role in clinical diagnosis, treatment, and surgery. Development of fully automated techniques for these procedures is important to enable consistently accurate computerized analyses. We study the application of deep convolutional neural networks (CNNs) for fully automated quantitative cephalometry for the first time. The proposed framework utilizes CNNs for detection of landmarks that describe the anatomy of the depicted patient and yield quantitative estimation of pathologies in the jaws and skull base regions. We use a publicly available cephalometric x-ray image dataset to train CNNs for recognition of landmark appearance patterns. CNNs are trained to output probabilistic estimations of different landmark locations, which are combined using a shape-based model. We evaluate the overall framework on the test set and compare with other proposed techniques. We use the estimated landmark locations to assess anatomically relevant measurements and classify them into different anatomical types. Overall, our results demonstrate high anatomical landmark detection accuracy (∼1% to 2% higher success detection rate for a 2-mm range compared with the top benchmarks in the literature) and high anatomical type classification accuracy (∼76% average classification accuracy for test set). We demonstrate that CNNs, which merely input raw image patches, are promising for accurate quantitative cephalometry. © 2017 Society of Photo-Optical Instrumentation Engineers (SPIE).
Conference Paper
Full-text available
This paper presents a virtual try-on system to correctly visualize 3D objects (e.g., glasses) in the face of a given user. By capturing the image and depth information of a user through a low-cost RGB-D camera, we apply a face tracking technique to detect specific landmarks in the facial image. These landmarks and the point cloud reconstructed from the depth information are combined to optimize a 3D facial morphable model that fits as good as possible to the user's head and face. At the end, we deform the chosen 3D objects from its rest shape to a deformed shape matching the specific facial shape of the user. The last step projects and renders the 3D object into the original image, with enhanced precision and in proper scale, showing the selected object in the user's face. We validate the performance of our system on eight different subjects (four male and four female) and show results numerically and visually. Our results demonstrate that, by fitting a facial model to the user's face, the rendered virtual 3D objects look more realistic.
Full-text available
Dental radiography plays an important role in clinical diagnosis, treatment and surgery. In recent years, efforts have been made on developing computerized dental X-ray image analysis systems for clinical usages. A novel framework for objective evaluation of automatic dental radiography analysis algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2015 Bitewing Radiography Caries Detection Challenge and Cephalometric X-ray Image Analysis Challenge. In this article, we present the datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. The main contributions of the challenge include the creation of the dental anatomy data repository of bitewing radiographs, the creation of the anatomical abnormality classification data repository of cephalometric radiographs, and the definition of objective quantitative evaluation for comparison and ranking of the algorithms. With this benchmark, seven automatic methods for analysing cephalometric X-ray image and two automatic methods for detecting bitewing radiography caries have been compared, and detailed quantitative evaluation results are presented in this paper. Based on the quantitative evaluation results, we believe automatic dental radiography analysis is still a challenging and unsolved problem. The datasets and the evaluation software will be made available to the research community, further encouraging future developments in this field. (
To analyse the value and accuracy of preoperative planning for total hip replacement (THR) we digitised electronically and compared the hand-sketched preoperative plans with the pre- and postoperative radiographs of 100 consecutive primary THRs. The correct type of prosthesis was planned in 98%; the agreement between planned and actually used components was 92% on the femoral side and 90% on the acetabular side. The mean (± SD) absolute difference between the planned and actual position of the centre of rotation of the hip was 2.5 ± 1.1 mm vertically and 4.4 ± 2.1 mm horizontally. On average, the inclination of the acetabular component differed by 7 ± 2° and anteversion by 9 ± 3° from the preoperative plans. The mean postoperative leg-length difference was 0.3 ± 0.1 cm clinically and 0.2 ± 0.1 cm radiologically. More than 80% of intraoperative difficulties were anticipated. Preoperative planning is of significant value for the successful performance of THR.
Conference Paper
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them
Conference Paper
Adolescent Idiopathic Scoliosis (AIS) exhibits as an abnormal curvature of the spine in teens. Conventional radiographic assessment of scoliosis is unreliable due to the need for manual intervention from clinicians as well as high variability in images. Current methods for automatic scoliosis assessment are not robust due to reliance on segmentation or feature engineering. We propose a novel framework for automated landmark estimation for AIS assessment by leveraging the strength of our newly designed BoostNet, which creatively integrates the robust feature extraction capabilities of Convolutional Neural Networks (ConvNet) with statistical methodologies in order to adapt to the variability in X-ray images. In contrast to traditional ConvNets, our BoostNet introduces two novel concepts: 1) a BoostLayer for robust discriminatory feature embedding by removing outlier features, which essentially minimizes the intra-class variance of the feature space and 2) a spinal structured multi-output regression layer for compact modelling of landmark coordinate correlation. The BoostNet architecture estimates required spinal landmarks within a mean squared error (MSE) rate of 0.0046 in 431 subjects, demonstrating its potential for robust automated scoliosis assessment in the clinical setting.
Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs) show potential for general and highly variable tasks across many fine-grained object categories. Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images-two orders of magnitude larger than previous datasets-consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care.
Conference Paper
We propose a novel cascaded framework, namely deep deformation network (DDN), for localizing landmarks in non-rigid objects. The hallmarks of DDN are its incorporation of geometric constraints within a convolutional neural network (CNN) framework, ease and efficiency of training, as well as generality of application. A novel shape basis network (SBN) forms the first stage of the cascade, whereby landmarks are initialized by combining the benefits of CNN features and a learned shape basis to reduce the complexity of the highly nonlinear pose manifold. In the second stage, a point transformer network (PTN) estimates local deformation parameterized as thin-plate spline transformation for a finer refinement. Our framework does not incorporate either handcrafted features or part connectivity, which enables an end-to-end shape prediction pipeline during both training and testing. In contrast to prior cascaded networks for landmark localization that learn a mapping from feature space to landmark locations, we demonstrate that the regularization induced through geometric priors in the DDN makes it easier to train, yet produces superior results. The efficacy and generality of the architecture is demonstrated through state-of-the-art performances on several benchmarks for multiple tasks such as facial landmark localization, human body pose estimation and bird part localization.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We further propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects. Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted.
Comparison of human brain MR images is often challenged by large inter-subject structural variability. To determine correspondences between MR brain images, most existing methods typically perform a local neighborhood search, based on certain morphological features. They are limited in two aspects: (1) pre-defined morphological features often have limited power in characterizing brain structures, thus leading to inaccurate correspondence detection, and (2) correspondence matching is often restricted within local small neighborhoods and fails to cater to images with large anatomical difference. To address these limitations, we propose a novel method to detect distinctive landmarks for effective correspondence matching. Specifically, we first annotate a group of landmarks in a large set of training MR brain images. Then, we use regression forest to simultaneously learn (1) the optimal sets of features to best characterize each landmark and (2) the non-linear mappings from the local patch appearances of image points to their 3D displacements towards each landmark. The learned regression forests are used as landmark detectors to predict the locations of these landmarks in new images. Because each detector is learned based on features that best distinguish the landmark from other points and also landmark detection is performed in the entire image domain, our method can address the limitations in conventional methods. The deformation field estimated based on the alignment of these detected landmarks can then be used as initialization for image registration. Experimental results show that our method is capable of providing good initialization even for the images with large deformation difference, thus improving registration accuracy.