Content uploaded by WD Sun
Author content
All content in this area was uploaded by WD Sun on Jun 07, 2021
Content may be subject to copyright.
Published in IET Computer Vision
Received on 12th July 2010
Revised on 17th February 2011
doi: 10.1049/iet-cvi.2010.0098
ISSN 1751-9632
Contour-based iterative pose estimation of
3D rigid object
D.W. Leng W.D. Sun
Department of Electronic Engineering, Tsinghua University, Beijing 100084, People’s Republic of China
E-mail: jimdavid126@gmail.com
Abstract: Estimating pose parameters of a 3D rigid object based on a 2D monocular image is a fundamental problem in computer
vision. State-of-the-art methods usually assume that certain feature correspondences are available a priori between the input image
and object’s 3D model. This presumption makes the problem more algebraically tractable. However, when there is no feature
correspondence available a priori, how to estimate the pose of a truly 3D object using just one 2D monocular image is still
not well solved. In this article, a new contour-based method which solves both the pose estimation problem and the feature
correspondence problem simultaneously and iteratively is proposed. The outer contour of the object is firstly extracted from
the input 2D grey-level image, then a tentative point correspondence relationship is established between the extracted contour
and object’s 3D model, based on which object’s pose parameters will be estimated; the newly estimated pose parameters are
then used to revise the tentative point correspondence relationship, and the process is iterated until convergence. Experiment
results are promising, showing that the authors’ method has fast convergence speed and good convergence radius.
1 Introduction
Pose estimation of a 3D rigid object based on monocular
observation is a fundamental problem in computer vision,
computer graphics, photogrammetry, robotics etc.
Conventional methods often assume that certain feature
correspondence relationship is available a priori between
the input 2D grey-level image and the 3D model of the
object, for example, point correspondence [1 –5], which is
most commonly utilised; line correspondence [6 –8]; plane
correspondence [9, 10] and other feature correspondence
[11–13]. These presumed feature correspondence
relationships provide much algebraic ease for the problem
treatment.
However, under actual application situations, the
presumption that certain feature correspondence should be
given a priori is not always tenable. In fact, the problem of
determining the 2D –3D feature correspondence is even a
much harder problem than pose estimation (when certain
feature correspondence is given a priori) itself. To handle
this problem, different methods have been proposed and
mainly fall into three categories: (i) The main idea of the
first category’s methods is direct, that is, to determine the
feature correspondence first, then estimate the pose
parameters based on it. These methods heavily rely on
certain features’ extraction, for example, point feature
extraction, such as SIFT [14], SURF [15], SUSAN [16],
line feature extraction [6– 8] and stable region extraction
[12]. Viksten et al. [17, 18] give reviews of these
category’s methods. However, as already pointed out by
Lee et al. [19–21], no feature is stable and reliable enough
in general 3D situations. This highly restricts the
application range of these methods; besides, how to extract
corresponding feature on object’s 3D model is still an open
problem. (ii) The second category’s methods resort to
image recognition techniques to bypass the problem of
determining the 2D –3D feature correspondence [20, 22].A
gallery of profile images of object’s 3D model with
different view angles is created beforehand, then the input
image is compared within the gallery, and pose parameters
of the most similar profile image are claimed to be the
object’s pose. The major disadvantage of these methods is
that to obtain a relatively finer parameter approximation, a
much larger profile image gallery would be needed,
whose size grows exponentially. (iii) The third category’s
methods adopt iterative mechanism [21, 23, 24]: certain
holistic cost function is defined with the input image
and object’s 3D model. Pose parameters are estimated
along with the iterative optimisation of the cost function.
The major strengths of these methods are that no feature
correspondence is required a priori, no profile image gallery
is needed and is more stable than the methods of the first
category. The major weakness is that these methods often
have limited convergence radius, and a relatively good
initialisation would be necessary for these methods to
converge successfully.
The method proposed in this article generally belongs to
this third category, that is, no feature correspondence
between the input 2D image and object’s 3D model is
required a priori and object’s pose is estimated iteratively.
This method is motivated by the work presented in [23] by
Iwashita which uses only object’s contour to estimate pose
parameters. Iwashita [23] demonstrated the feasibility of
estimating object’s pose using only image contour, but the
IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 291
doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011
www.ietdl.org
pose estimation algorithm proposed in [23] still needs
improvements. To bypass the problem of determining
feature correspondence between the input 2D image and
object’s 3D model, Iwashita et al. [23] resorted to defining
a holistic ensemble cost function on all contour pixels:
complicated forces and torsion moments are introduced to
drive the object’s model towards aligning with the input
image. The ensemble cost function is highly non-linear
which casts a heavy burden on the subsequent optimisation
process and makes the algorithm prone to being trapped
into local minimum. Besides non-linearity, the ensemble
cost function of [23] did not differentiate between different
pose parameters but treated them equally. However, the six
pose parameters (three rotation angle components and three
translation components) are heterogeneous and affect the
object’s imaging process in different ways. To achieve good
pose estimation result, the algorithm should embody these
differences [1, 25]. The third problem is that [23] utilised
gradient descent method to optimise its cost function, but
owing to the high non-linearity of the cost function, explicit
partial differential equations cannot be derived, and
numerical approximation had to be used, which is
computationally expensive and error prone.
In this article, we propose a new scheme of iterative pose
estimation method for 3D rigid object based on object’s
contour. As in [23], here we also use object’s contour as
algorithm’s start point, but instead of defining ensemble
cost function then estimating pose parameters by optimising
it, we exploit the information provided by object’s contour
in a different way. After extracting object’s outer contour
from the input image, we extract contour pixel feature and
try to establish a tentative 2D –3D point correspondence
relationship between the input 2D grey-level image and
object’s 3D model. This tentative point correspondence
relationship is then used to estimate object’s pose
parameters with well-developed point-based pose estimation
algorithm [1], and the newly estimated pose parameters are
then fed back to revise the previously established tentative
point correspondence. This process is iterated until correct
point correspondence relationship is established and pose
parameters of the 3D object are successfully retrieved. The
main feature of our method is that both the pose estimation
problem and the feature correspondence problem are solved
simultaneously and iteratively. None of the three problems
mentioned of [23] exists in our method, making it
computationally much more efficient, faster and stabler.
Experiments show that our method has faster convergence
speed and wider convergence radius.
In a very recently published conference paper [26], Cui and
Hildenbrand proposed an iterative method with an idea
similar to that of ours, that is, iteratively establishing the
point correspondence relationship between the image
contour and object’s 3D model and estimating the pose
parameters based on it. The major difference between our
work and [26] lies in the way of establishing the contour
point correspondence. The way of [26] is 3D–3D-wise:
directly retrieve the nearest vertex point of object’s 3D
model to the 3D sight line of a contour pixel. For a
complex object whose 3D model contains thousands of
vertices, the computational burden of this 3D –3D-wise way
is rather heavy. To decrease the computational complexity,
[26] proposed to simplify the object’s 3D model with 3D
Fourier transform, but this in turn lowers the accuracy of
the established point correspondence relationship. Instead,
we adopt a 2D –2D– 3D-wise way: first establish the 2D –
2D point correspondence between input image contour and
projection image contour (see Section 2.2), then back-
project the 2D–2D correspondence onto the surface of
object’s 3D model to obtain the final 2D–3D point
correspondence relationship. The advantage of this indirect
way is that both the 2D– 2D point correspondence
establishing step and the back-projecting step can be
accelerated by fast algorithms, which are described in this
paper. Besides, the point correspondence established by Cui
and Hildenbrand [26] will never be exactly accurate since
only vertex information of object’s 3D model is used there,
causing poor convergence performance for an unevenly
triangulated object model. This problem is also solved by
our 2D–2D – 3D correspondence establishing method.
The remainder of this article is organised as follows: in
Section 2 the details of our method are described and
analysed theoretically; in Section 3 the performance of our
method is studied experimentally, and as will be shown, the
results of convergence speed and convergence radius of our
method are very promising; Section 4 concludes the article.
2 Iterative pose estimation based on the
object’s contour
2.1 Method description
In this section, we address the details of our method and
analyse it theoretically. Before entering into algorithm
specifics, we first describe the outline of the processing
flow of our method, as illustrated in Fig. 1. The whole
processing flow divides into two major stages:
preprocessing stage and iterative stage. The preprocessing
stage (Figs. 1a–c) receives a monocular 2D grey-level
image as input, then the outer contour of the object and
contour pixel features are extracted. The iterative stage
(Figs. 1d–f) first establishes a tentative 2D –3D point
correspondence relationship between the extracted contour
and object’s 3D model with pose parameters estimated at
the last iteration. New values of object’s pose parameters
are then re-estimated based on the established tentative
2D–3D point correspondence. This process is iterated until
correct 2D–3D point correspondence is obtained and the
object’s 3D pose parameters are successfully retrieved.
There are two presumptions made by our iterative pose
estimation method which readers should keep in mind:
1. The object’s 3D pose and its outer contour (on the image
plane) are in one-to-one correspondence. This means, for a
specific contour, there exists only one determinate
corresponding pose of the object;
2. When the object’s pose varies continuously, its
corresponding contour on the image plane also changes
continuously.
Presumption (1) legitimates the object’s outer contour as a
viable feature for pose estimating, and presumption (2)
ensures that the iterative mechanism possesses reasonable
convergence radius. Note that presumption (1) is an ideal
condition, for artificial 3D objects that are self-symmetric in
certain ways, presumption (1) cannot be satisfied strictly.
For example the contours in the top view and the bottom
view of an aircraft are identical, and thus the contour-based
pose estimation algorithm would not be able to differentiate
between these two cases. Fortunately, these cases are
mathematically rare whose integral measure equals to zero
in the possible 3D pose space, so we can still apply the
contour-based pose estimation algorithm to these objects;
292 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300
&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098
www.ietdl.org
and for the odd cases under which object’s contours are
identical, since the possible solutions are known a priori,
they would not be a big problem.
In this article, perspective camera model is used to describe
camera’s imaging process. Camera model and corresponding
coordinate frames configuration are shown in Fig. 2, in which
subscripts u,vand pindicate camera coordinate frame, object
self-centred coordinate frame and image plane coordinate
frame, respectively. Object self-centred coordinate frame
and camera coordinate frame are related by the rigid
transformation as
xu=Rxv+t(1)
or in matrix form as
Xu=RXv+thT(2)
in which his an all-ones vector, Ris the rotation matrix and
tis the translation vector. The image plane coordinate frame
and object self-centred coordinate frame are related by the
imaging equation
xp
1
K(R|t)xv
1
(3)
Symbol ‘’ means equal in homogeneous manner, and Kis
the inner camera parameter matrix.
In the remainder of this article, a calibrated camera is
assumed, that is, the inner camera parameter matrix Kis
known. This provides us the convenience of leaving K
unconsidered, and making the subsequent derivations more
compact.
2.2 Contour extraction and tentative point
correspondence establishing
Our method is based on the outer contour of the object. A
clean, continuous, single-pixel wide contour is necessary
for the method to produce good performance. To extract
object’s contour from the input image, various methods are
available. In our current implementation, we use a modified
active contour model proposed by the authors, which can
produce a closed and noise-robust contour, and can still be
very fast. The contour of the projection image which is
generated by projecting object’s 3D model onto the
image plane can be extracted by simple binary operations,
which can be implemented efficiently with modern central
processing unit–graphics processing unit (CPU–GPU)
computing architecture. For ease of statement, we will refer
the contour extracted from the input 2D grey-level image as
‘extracted contour’, and the contour got from the projection
image as ‘projected contour’ in the remainder of this article.
After extracting the object’s contour from the input image,
the next step is to extract contour pixel feature and establish
tentative point correspondence relationship between the
extracted contour and object’s 3D model. How to determine
the point correspondence between a 2D image and a 3D
model directly is still an open problem in computer vision.
Here we solve this problem indirectly, that is, firstly project
Fig. 2 Camera model and coordinate frames’ configuration
Fig. 1 Algorithm flowchart of the proposed method
aInput 2D grey-level image
bExtracted contour from the input image
cExtract contour pixel feature
dEstablish tentative point correspondence between the extracted contour and object’s 3D model
eModel wireframe overlaid on the 2D grey-level image with the estimated pose
fFinal pose estimation result
a–cBelong to the preprocessing stage, and d–fbelong to the iterative stage
IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 293
doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011
www.ietdl.org
the object’s 3D model onto the image plane to obtain the
projected contour, secondly determine the 2D–2D point
correspondence between the extracted contour and the
projected contour and thirdly points on the projected contour
which have corresponding point on the extracted contour are
back-projected onto the surface of object’s 3D model,
carrying the established 2D–2D point correspondence
relationship forward to the final 2D–3D point correspondence.
The contour pixel feature extracted in our current
implementation is simply the pixel’s position in image. This
is the simplest feature ever possible for contour pixels. As
will be shown in the experiment section, even this simple
feature can provide good convergence performance. More
complex features may improve our method’s convergence
performance, but this is not the focus of the current article.
In this article, we will spare our attention on how far the
proposed scheme can go.
2.2.1 Establishing 2D–2D point correspondence: To
determine the 2D –2D point correspondence between the
extracted contour from the input image and the projected
contour from the projection image, we resort to a very
intuitive observation, that is, when two poses of an object
are close to each other, the correct corresponding points
should also be near geometrically between the
corresponding two contours. Thus, we can first establish the
point correspondence tentatively by searching geometrically
nearest point pairs, and revise it iteratively with updated
pose parameters’ estimation results. Geometrically nearest
point pair is not a difficult mathematical concept; however,
to find all nearest point pairs between two contours, which
may contain hundreds of points, could be rather
computationally expensive if not implemented properly.
Here we propose to use distance map to establish the 2D –
2D point correspondence between the two contours
efficiently. Note that distance map is also used in [23], but
in a very different way. Distance map lies in the kernel of
[23]’s method, it is used to define the ensemble cost
function; here in our method, a distance map is used to
accelerate the 2D–2D point correspondence establishment.
A distance map describes the shortest distance of image
pixels to the given contour. More mathematically, given the
extracted contour C, for an image pixel x, its distance map
value is given as
D(x)=min
yx−y,∀y[C(4)
This definition is very similar to the signed distance function
commonly used in level set theory except that no sign
determination is necessary here. So, fast marching method
[27] which is often used to compute the signed distance
function can be used to build the distance map [23].
However, the fast marching method is designed for general
distance measure, and its O(Nlog N)(Nis the total number
of image pixels) computational complexity is still rather
high even for a medium-sized image. In our
implementation, since the distance map is built only based
on the Euclidean distance measure, we adopt the distance
transform method proposed in [28], whose computational
complexity is only O(N), and is much faster than fast
marching method for large Nvalue. It is worth emphasising
that the distance map needs only to be computed once
during the whole pose estimation process.
As its definition (4), the distance map value of a given
image pixel is the shortest distance from this position to the
given contour; thus, similar to the signed distance function
of level set, pixels that have the same distance map values
will form a closed and continuous isocontour, see Fig. 3.
To retrieve the nearest contour pixel for a given position,
we can simply trace down the minus gradient direction
from this position, and the first contour pixel met along this
path will be the nearest contour pixel as required. Fig. 3
illustrates how the nearest contour pixel can be retrieved
efficiently by taking advantage of the built distance map.
The greatest advantage of utilising distance map for nearest
point retrieving is that the distance map needs only to be
built once, and once the distance map is built, the
computational complexity of retrieving geometrically
nearest point is remarkably reduced to only O(1).
2.2.2 Establishing 2D–3D point correspondence:
After the 2D–2D point correspondence relationship
between the extracted contour and the projected contour is
established, the next step is to carry the established 2D–2D
point correspondence relationship forward to the final 2D –
3D point correspondence required by the subsequent point-
based pose estimation sub-procedure. What we need to do
is to back-project the points of the projected contour, which
have the corresponding point on the extracted contour, onto
the surface of the object’s 3D model to obtain the
corresponding 3D points’ coordinates. To accomplish this
task, we propose the following two-step method.
The first step is to retrieve the triangular patches which
correspond to the projected contour from the 3D model. To
fully utilise the great power of modern GPU, we dye each
triangular patch of the 3D model with different colour, then
using this colour attribute as hash index, we can efficiently
retrieve the required triangular patches from object’s 3D
model, which usually consists of thousands of triangular
patches.
Since the retrieved triangular patch could be relative large
with respect to the scale of the whole 3D model, it would be
too coarse to be used in the subsequent pose estimation sub-
procedure. The next step is to obtain the precise 3D points’
coordinates with the retrieved triangular patches. Assume
that there is no rotation and translation transform between
camera coordinate frame and object self-centred coordinate
frame, and let the three vertices of a triangular patch be
represented as X
v
¼{x
v1
,x
v2
,x
v3
}, then the plane defined
Fig. 3 Retrieving the nearest contour pixel for a given position
The black dot is the start position, with the help of distance map, the required
nearest contour pixel (shown as grey dot) can be found very efficiently
294 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300
&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098
www.ietdl.org
by this triangular patch will be
P=(xv1−xv3)×(xv2−xv3)
−xv3·(xv1×xv2)
(5)
Let x
g
represent the corresponding 3D point’s coordinates,
then it will be given as
xvg =(−P(4)/xg·P(1:3))xg(6)
2.3 Iterative pose estimation and convergence
specifics
After the tentative 2D–3D point correspondence relationship
is established between the input image and object’s 3D
model, the next step is to execute the point-based pose
estimation sub-procedure. In our current implementation,
we adopt the orthogonal iteration (OI) algorithm proposed
in [1] which is fast and numerically precise. Now with the
tentative 2D–3D point correspondence and the OI
algorithm, new values can be estimated for the 3D object’s
pose parameters. These updated pose parameters are then
fed back to generate new projection image, establish new
tentative point correspondence and then estimate new
values for the object’s 3D pose parameters. This process is
then iterated until correct point correspondence relationship
is established and correct pose estimation result is retrieved;
if the process does not converge after a preassigned
iteration number, we just abort it and return a failure.
To measure the fitness of the pose estimation result
returned by the iterative method, we perform the ‘XOR’
operation with the binary image extracted from the input
grey-level image and the binary projection image. If the
pose estimation result is close to object’s actual pose, then
the area of the result regions after the XOR operation would
be small. So define a ratio
Aratio =area(BIob ⊕BIpr)
area(BIpr)(7)
in which BI
ob
represents the binary image extracted from the
input grey-level image, BI
pr
represents the binary projection
image got by projecting the object’s 3D model onto the
image plane. For good pose estimation result, the value of
A
ratio
would be small. This measure will be used for
convergence determination in the experiment section.
The last problem till now left undiscussed is how to obtain
a good initialisation for the iterative method to ensure
satisfactory pose estimation results. Since it is not the focus
of this article, we now give only a brief discussion about
this problem. To fulfil the initialisation task, many template-
based methods can be used [29 – 31]. Our method is similar
to the work of [20]. A small gallery consisting of normalised
profile images is built beforehand, with a rather coarse
sampling of the possible poses. Normalisation is necessary to
remove the effect of translation parameters. Then after the
input grey-level image is segmented, the result image is
normalised and compared within the profile image gallery,
and the pose of the most matching profile image is chosen to
initialise our iterative pose estimation method.
Table 1 summarises the algorithm flow of the proposed
contour-based iterative pose estimation method for a 3D
rigid object.
3 Experiment results
In this section, we test the performance of our iterative pose
estimation method in convergence speed, convergence
radius and robustness to noise with various-shaped 3D rigid
objects. As a comparison, the pose estimation results
returned by Iwashita’s method [23] and Cui’s method [26]
are also presented. All the codes involved are implemented
in Matlab scripts and run on a PC with 1.8 GHz CPU and
1 GB RAM.
3.1 Convergence speed performance
In this subsection, the convergence speed performance of our
new method is tested with a small model gallery which
consists of an aircraft, a racecar, a house, a desk lamp and
a grand piano. These models are of various shapes, size
scales and detail complexity, providing good sample
diversity for the following experiments.
We run the pose estimation methods five times for each
model, with different pose initialisation each time. For the
pose initialisation required by the iterative methods, we
pollute the true pose values with Gaussian noise, which
produces a total about 308deviation for the 3D rotation
angles. For Iwashita’s method [23], to guarantee the best
convergence radius, at each iteration Armijo rule [32] is
used to search an optimal time step for each of the six pose
parameters. The time cost by this optimal step searching
procedure varies dramatically with different parameter
tuning, in fairness we will not count in the absolute time in
convergence speed comparison for Iwashita’s method. Also
for Cui’s method [26], to guarantee the best convergence
radius, the model simplification procedure proposed in [26]
is not utilised here. For fairness the absolute convergence
time is also not counted in for Cui’s method.
Table 1 Contour-based iterative pose estimation of a 3D rigid object
Preprocessing stage:
1. receive a monocular 2D grey level image as input
2. extract object’s outer contour and contour pixel features
3. build the distance map based on the contour pixel feature extracted in step 2
Iterative stage:
1. generate the projection image using last estimated pose parameters and extract object’s outer contour from the projection image
2. establish the tentative 2D–2D point correspondence relationship between the extracted contour and the projected contour
3. back project the projected contour onto the 3D model’s surface, establish the tentative 2D– 3D point correspondence relationship
between the extracted contour and object’s 3D model
4. solve the point-based pose estimation problem using OI algorithm [1]
5. check convergence condition (7), go to step 4
IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 295
doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011
www.ietdl.org
To measure the fitness of results returned by the two
methods, we adopt A
ratio
defined in Section 2. If A
ratio
is smaller than a given threshold, the pose estimation task is
claimed to be successfully accomplished; otherwise it is
claimed as a failure.
Fig. 4 illustrates the pose estimation results returned by our
method for the house, desk lamp and piano models. The left-
column images show the initial pose configurations and the
right-column images present the final pose estimation results.
For ease of inspection, the effect of translation is removed
for all the images. For all the 5 ×5 test runs, our method
register object’s 3D model (represented by a wire mesh in
the image) precisely with the input 2D grey-level image.
Even though some parts of the initially posed model are far
from their counterparts in the 2D grey-level image, a good
estimation result can still be achieved. This also reveals that
our method has good error-tolerance performance.
Table 2 summarises the convergence speed statistics of the
three methods for the test gallery. The threshold of A
ratio
is set
to a thumb rule value 4e22. For our method, it usually takes
tens of iterations before successfully converging, and the time
cost per iteration is less than 1/3 s in most cases. Whereas for
Iwashita’s method, it usually takes hundreds of iterations
before convergence, which is much slower than ours; Cui’s
method takes fewer iterations than Iwashita’s method, but
still slower than that of ours.
Except for convergence speed, Iwashita’s method failed in
several cases for models of aircraft, house and desk lamp. It
seemed trapped in some local optimal solutions because of
its complicated highly non-linear cost function and error-
prone gradient descent-based optimisation method. Cui’s
method also failed in several cases for models of house,
desk lamp and piano, because it failed to establish the
correct 2D–3D point correspondence for the unevenly
triangulated test models. In contrast, the performance of our
method is more consistent and stable. These statistics also
indicate that our method has a wider convergence radius
that will be examined further in the following subsection.
In Fig. 5, six frames from one pose estimation run are
presented to give a more clear inspection about the
convergence process of our method. These six frames are
extracted from a total of 72 iterations (to make the iteration
last longer, we intentionally initialise the method poorly and
set a much smaller threshold value for A
ratio
than before).
Fig. 6 presents two curves: (i) is the evolution curve of
A
ratio
and (ii) is the evolution curve of residual error of the
point-based pose estimation sub-procedure. As we can see,
these two curves are consistent with each other, and also
demonstrate how well the object’s 3D model is registered to
the input 2D grey-level image along the iteration.
3.2 Convergence radius performance
For iterative methods, their convergence radii are always
expected to be as wide as possible. The wider the
convergence radius is, the easier it is to find a plausible initial
solution. There are two main factors influencing the
convergence radius of our method: model’s shape and
camera’s view angle. For different model shapes and camera
view angles, the actual convergence radius is usually different.
In this subsection, the convergence radius performance of our
method against these two factors will be examined. For
comparison, results obtained with Iwashita’s method and
Cui’s method are also presented.
Fig. 4 Illustration of pose estimation results returned by our
method for the house, desk lamp and piano models
The image in the left column shows the initial pose configuration, and the
image in the right column shows the final pose estimation result. The
projection of object’s 3D model is represented by a wire mesh. The effect
of translation is removed for the ease of inspection
a,cand eInitial pose configurations
b,dand fFinal pose estimation results
Table 2 Convergence speed statistics for the method of [23, 26] and our method
Model Method of [23] Method of [26] Our method
Average
iteration no.
Time/
iteration, s
a
Case
failed
b
Average
iteration no.
Time/
iteration, s
Case
failed
Average
iteration no.
Time/
iteration, s
Case
failed
aircraft 192.6 – 1 53.6 – 0 26.4 0.0786 0
race car 65.8 – 0 16.0 – 0 8.4 0.2246 0
house 124.4 – 2 38.6 – 1 12.6 0.2165 0
desk lamp 221.2 – 2 51.6 – 1 22.4 0.1710 0
piano 126.8 – 0 24.2 – 1 12.8 0.2291 0
a
In fairness, the time cost per iteration is not counted in for methods of [23, 26] here
b
The threshold of A
ratio
is set to 4e22
296 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300
&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098
www.ietdl.org
The biggest problem for executing this subsection’s
experiments is that the number of possible poses for a 3D
object is extremely large, for example, with 18sample
interval for each of the three pose angles, the total pose
sample number would be 360 ×360 ×180 ¼2.3 ×10
7
.
Owing to the computing power limitation, we compromised
with the number of test models and the sample density of
possible poses as follows: for test models we choose two
highly different objects, an aircraft and a dolphin toy, which
are showed in Figs. 7aand b; for pose sample density, we
set the sample intervals for the first two pose angles (yaw
angle
a
and pitch angle
b
) both to 208; for the third pose
angle (roll angle), since its effect is rotation in plane, we
just omit it because it can be recovered simply by a 2D
rotation. The initial pose solution required by the iterative
methods is obtained by offsetting the true pose within the
range
D
a
×D
b
=[−458,458]×[−458,458] (8)
with 58interval for each rotation angle. Thus, for one test
model, the number of sampled poses (i.e. different camera’s
view angles) is 18 ×9, and for each sampled pose, there
will be 19 ×19 test runs to explore its neighbourhood.
These result in a total of 1.17 ×10
5
runs for each method.
To display the experiment results, we calculate the number
of successful runs for each sampled pose and map it onto a
unit sphere, see Fig. 7. Each point on the sphere indicates
the width of convergence radius for the corresponding view
angle. The brighter the colour is, the wider the convergence
radius is. Owing to the left – right symmetry of the test
models, the front view and the back view of the result
sphere are approximately identical, and so only the front
view of sphere is shown.
To quantify the convergence radius results, we introduce
the following equation to convert the number of successful
runs into equivalent radius value
Radius =
Nsuccess
Ntotal
·Rtotal (9)
in which N
success
is the number of successful runs for a given
sampled pose, N
total
is the total number of test runs for the
sampled pose and Rtotal is the radius of the offset range.
This equation is derived from the fact that the area of a
circle is proportional to the square of its radius. In our case,
we have
Ntotal =361, Rtotal =458(10)
Table 3 summarises the convergence radius statistics of the
three methods. There are several conclusions can be derived
from Fig. 7 and Table 3:
1. For different-shaped models and camera view angles, the
convergence radius performance can vary dramatically for
all the three methods;
2. All the three methods exhibit similar convergence radius
variation trends;
3. Our method has much wider convergence radius than
Iwashita’s method and Cui’s method both for the complex
aircraft model and for the relatively simple dolphin toy model.
3.3 Noise robustness
Experiments in the previous two subsections postulate ideal
contour extraction result. In actual applications, the contour
extracted from the input grey-level image is seldom ideal
because of noise, complex image background, lighting
variation etc. This imperfectness will surely bring down the
pose estimation method’s performance. In this subsection,
we study the noise robustness performance of our method
under various signal to noise ratio (SNR) condition.
The test models used in this subsection is the same with
Subsection 3.2, see Figs. 7aand b. To generate simulation
test images, we set the sample intervals of yaw angle
a
and
Fig. 5 Convergence illustration of our method
a–fare the 1st, 2nd, 4th, 10th, 20th and 72nd frames out of a total 72
iterations
Fig. 6 Convergence curves
aEvolution curve of A
ratio
bEvolution curve of residual error of the point-based pose estimation sub-procedure
IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 297
doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011
www.ietdl.org
pitch angle
b
both to 108. Compared with total randomisation,
this setting ensures more accurate and comprehensive
performance evaluation results. The extracted contour is
then contaminated with different levels of Gaussian noise to
simulate real-world defect, see Fig. 8. The initial pose
required by the method is obtained by polluting the true
value by Gaussian noise with 20 dB SNR. The iteration
number is fixed to 100 for each run. To evaluate the
Fig. 7 Convergence radius experiment results mapped onto a unit sphere
aand bTest models: an aircraft and a dolphin toy
cand dFront views of the result spheres for Iwashita’s method
eand fFront views of the result spheres for Cui’s method
gand hFront views of the result spheres for our method
Table 3 Convergence radius statistics for the method of [23, 26] and our method (8)
Model Method of [23] Method of [13] Our method
Min
radius
Max
radius
Mean
radius
Std
variance
Min
radius
Max
radius
Mean
radius
Std
variance
Min
radius
Max
radius
Mean
radius
Std
variance
aircraft 10.86 28 18.9 3.61 12.7 37.1 23.6 4.34 22.8 44.5 33.3 4.71
dolphin toy 13.5 31.6 23.7 3.71 17.2 36.2 28.6 4.09 27.1 45 41.1 3.87
298 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300
&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098
www.ietdl.org
noise-robust performance, we analyse the statistics of final
A
ratio
returned by our method.
Results are shown in Fig. 9. As we can see, for both test
models, the inflection point of the mean A
ratio
variation
curve occurs at 30 dB. When SNR .30 dB, the mean
values of A
ratio
are both lower than 1e21, an experience
value indicating acceptable pose estimation result obtained;
however, when SNR ,30 dB, both curves rise rapidly.
This indicates that for our method to work properly, SNR
above 30 dB is required. Referring to Fig. 8b, which
illustrates what a noise-contaminated contour looks like
under 30 dB SNR, we can conclude that the noise
robustness performance of our method is satisfying.
3.4 Experiments with real-world image data
In this subsection, the effectiveness of our new method will
be verified with experiments using real-world aircraft
images. Aircraft objects are highly manoeuvrable and can
be arbitrarily posed, and thus are more fit to the ‘general
3D rigid object’ concept concentrated on by this article.
Experiment data and results are presented in Fig. 10.
The 1st and 3rd columns of Fig. 10 are the input monocular
images, and the 2nd and 4th are the corresponding pose
estimation results returned by our method. The result image
is rendered with the estimated pose parameters and object’s
3D model. The object aircraft in Figs. 10aand bis the F16
Fighting Falcon, and in Figs. 10cand dis the F22 Raptor.
From Fig. 10, we can see that our new method successfully
accomplished the pose estimation tasks with various object
pose, size and image quality.
4 Conclusions
In this article, we focus on the topic of pose estimation of
general 3D rigid object with no feature correspondence
between the input monocular image and object’s 3D model
is available a priori, and a new contour-based iterative
method is proposed which is fast and has wide convergence
radius. Our new method solves the feature correspondence
problem and pose estimation problem simultaneously and
iteratively, that is, not only the pose parameters of the 3D
object, but also the 2D –3D point correspondence between
the input grey-level image and object’s 3D model can be
retrieved. The tentative point correspondence establishing
scheme keeps our method free of defining the highly non-
linear ensemble cost functions, making our method
computationally more efficient and stable. Experimental
results show that the performance of our method is
promising in convergence speed, convergence radius and
noise robustness.
In this present work, we only adopt the 2D geometrical
properties to build the tentative 2D –3D point
correspondence between the input grey-level image and
object’s 3D model. It would be sensible to incorporate other
image properties to improve the accuracy of the tentative
Fig. 8 Noise-contaminated contour extraction results
aSNR ¼60 dB
bSNR ¼30 dB
cSNR ¼10 dB
Fig. 9 Results of noise robustness experiments
Fig. 10 Pose estimation results for real-world image data by our method
The 1st and 3rd columns are the input monocular images, and the 2nd and 4th are the corresponding pose estimation results. The result image is rendered with the
estimated pose parameters and the object’s 3D model
aand bPose estimation results for the F16 Fighting Falcon
cand dPose estimation results for the F22 Raptor
IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 299
doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011
www.ietdl.org
point correspondence establishing procedure, if not incurring
too much extra computation cost. This will be future work.
5 Acknowledgment
We would like to acknowledge Prof. Iwashita for her help in
understanding their algorithm.
6 References
1 Lu, C.P., Hager, G.D., Mjolsness, E.: ‘Fast and globally convergent pose
estimation from video images’, IEEE Trans. Pattern Anal. Mach. Intell.,
2000, 22, (6), pp. 610– 622
2 Burschka, D., Mair, E.: ‘Direct pose estimation with a monocular
camera’, Robot Vis., 2008, (LNCS,4931), pp. 440– 453
3 Haralick, R.M., Lee, C.N., Ottenberg, K., No¨lle, M.: ‘Review and
analysis of solution of the three point perspective pose estimation
problem’, IJCV, 1994, 13, (3), pp. 331– 356
4 Moreno-Noguer, F., Lepetit, V., Fua, P.: ‘Accurate non-iterative O(n)
solution to the PnP problem’. IEEE ICCV’07, Rio de Janeiro,
pp. 2168– 2175
5 Leng, D.W., Sun, W.D.: ‘Finding all the solutions of PnP problem’.
IEEE IST’09, Shenzhen, pp. 348– 352
6 Ansar, A., Daniilidis, K.: ‘Linear pose estimation from points or lines’,
IEEE Trans. Pattern Anal. Mach. Intell., 2003, 25, (5), pp. 578– 589
7 David, P., DeMenthon, D., Duraiswami, R., Samet, H.: ‘Simultaneous
pose and correspondence determination using line features’.
CVPR’03, 2003, vol. 2, pp. 424– 431
8 Christy, S., Horaud, R.: ‘Iterative pose computation from line
correspondences’, CVIU, 1999, 73, (1), pp. 137– 144
9 Hanning, T., Schoene, R., Graf, S.: ‘A closed form solution for
monocular re-projective 3D pose estimation of regular planar
patterns’, ICIP, 2006, 1–7, pp. 2197– 2200
10 Jacobs, D., Basri, R.: ‘3D to 2D pose determination with regions’, IJCV,
1999, 34, (2– 3), pp. 123 –145
11 Tahri, O., Chaumette, F.: ‘Complex objects pose estimation based on
image moment invariants’. Proc. IEEE Int. Conf. on Robotics and
Automation, Barcelona, Spain, April 2005, pp. 436–441
12 Donoser, M., Bischof, H.: ‘Efficient maximally stable extremal region
(MSER) tracking’. CVPR’06, 2006, vol. 1, pp. 553– 560
13 Kyriakoulis, N., Gasteratos, A.: ‘Color-based monocular visuoinertial
3-D pose estimation of a Volant robot’, IEEE Trans. Instrum. Meas.,
2010, 59, (10), pp. 2706–2715
14 Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’,
IJCV, 2004, 60, (2), pp. 91– 110
15 Bay, H., Tuytelaars, T., Gool, L.V., Zurich, E.: ‘SURF: speeded up
robust features’, CVIU, 2008, 110, (3), pp. 346– 359
16 Smith, S.M., Brady, J.M.: ‘SUSAN – a new approach to low level
image processing’, IJCV, 1997, 23, (1), pp. 45– 78
17 Viksten, F., Forsse´n, P.E., Johansson, B., Moe, A.: ‘Comparison of local
image descriptors for full 6 degree-of-freedom pose estimation’. IEEE
Int. Conf. on Robotics and Automation, Kobe, Japan, 2009,
pp. 1139– 1146
18 Shan, G.L., Ji, B., Zhou, Y.F.: ‘A review of 3D pose estimation from a
monocular image sequence’. CISP’09, 2009, Tianjin, pp. 1 –5
19 Lee, T.K., Drew, M.S.: ‘3D object recognition by eigen-scale-space of
contours’. SSVM ’07, 2007, vol. 4485, pp. 883–894
20 Dunker, J., Hartmann, G., Sto¨hr, M.: ‘Single view recognition and
pose estimation of 3D objects using sets of prototypical views and
spatially tolerant contour representations’. ICPR’96, 1996, vol. 4,
pp. 14–18
21 Dambrevile, S., Sandhu, R., Yezzi, A., Tannenbaum, A.: ‘A geometric
approach to joint 2D region-based segmentation and 3D pose estimation
using a 3D shape prior’, SIAM J. Imaging Sci., 2010, 3, (1),
pp. 110–132
22 Poggio, T., Edelman, S.: ‘A network that learns to recognize three-
dimensional objects’, Nature, 1990, 343, pp. 263– 266
23 Iwashita, Y., Kurazume, R., Konishi, K., Nakamoto, M., Hashizume,
M., Hasegawa, T.: ‘Fast alignment of 3D geometrical models and 2D
grayscale images using 2D distance maps’, Syst. Comput. Jpn., 2007,
38, (14), pp. 1889–1899
24 Chetverikov, D., Stepanov, D., Krsek, P.: ‘Robust Euclidean alignment
of 3D point sets: the trimmed iterative closest point algorithm’, Image
Vis. Comput., 2005, 23, (3), pp. 299– 309
25 DeMenthon, D.F., Davis, L.S.: ‘Model-based object pose in 25 lines of
code’, IJCV, 1995, 15, (1– 2), pp. 123–141
26 Cui, Y., Hildenbrand, D.: ‘Pose estimation based on geometric algebra’.
GraVisMa, 2009, pp. 17– 24
27 Sethian, J.A.: ‘A fast marching level set method for monotonically
advancing fronts’, Proc. Natl. Acad. Sci. USA, 1996, 93, pp. 1591– 1595
28 Felzenszwalb, P.F., Huttenlocher, D.P.: ‘Distance transforms of sampled
functions’, Cornell Computing and Information Science TR2004-1963,
available at: http://ecommons.library.cornell.edu/handle/1813/5663
29 Horaud, R.: ‘New methods for matching 3D objects with single
perspective views’, IEEE Trans. Pattern Anal. Mach. Intell., 1987, 9,
(3), pp. 401–412
30 Dhome, M., Richetin, M., Lapreste´, J.T., Rives, G.: ‘Determination
of the attitude of 3D objects from a single perspective view’,
IEEE Trans. Pattern Anal. Mach. Intell., 1989, 11,(12),
pp. 1265– 1278
31 Gonza´lez, J.M., Sebastia´n, J.M., Garcı´a, D ., Sa´nchez, F., Angel, L.:
‘Recognition of 3D object from one image based on projective
and permutative invariants’. ICIAR’04, 2004, vol. 3211,
pp. 705–712
32 Bertsekas, D.P.: ‘Constrained optimization and Lagrange multiplier
methods’ (Academic Press, 1982)
300 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300
&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098
www.ietdl.org