Content uploaded by WD Sun

Author content

All content in this area was uploaded by WD Sun on Jun 07, 2021

Content may be subject to copyright.

Published in IET Computer Vision

Received on 12th July 2010

Revised on 17th February 2011

doi: 10.1049/iet-cvi.2010.0098

ISSN 1751-9632

Contour-based iterative pose estimation of

3D rigid object

D.W. Leng W.D. Sun

Department of Electronic Engineering, Tsinghua University, Beijing 100084, People’s Republic of China

E-mail: jimdavid126@gmail.com

Abstract: Estimating pose parameters of a 3D rigid object based on a 2D monocular image is a fundamental problem in computer

vision. State-of-the-art methods usually assume that certain feature correspondences are available a priori between the input image

and object’s 3D model. This presumption makes the problem more algebraically tractable. However, when there is no feature

correspondence available a priori, how to estimate the pose of a truly 3D object using just one 2D monocular image is still

not well solved. In this article, a new contour-based method which solves both the pose estimation problem and the feature

correspondence problem simultaneously and iteratively is proposed. The outer contour of the object is ﬁrstly extracted from

the input 2D grey-level image, then a tentative point correspondence relationship is established between the extracted contour

and object’s 3D model, based on which object’s pose parameters will be estimated; the newly estimated pose parameters are

then used to revise the tentative point correspondence relationship, and the process is iterated until convergence. Experiment

results are promising, showing that the authors’ method has fast convergence speed and good convergence radius.

1 Introduction

Pose estimation of a 3D rigid object based on monocular

observation is a fundamental problem in computer vision,

computer graphics, photogrammetry, robotics etc.

Conventional methods often assume that certain feature

correspondence relationship is available a priori between

the input 2D grey-level image and the 3D model of the

object, for example, point correspondence [1 –5], which is

most commonly utilised; line correspondence [6 –8]; plane

correspondence [9, 10] and other feature correspondence

[11–13]. These presumed feature correspondence

relationships provide much algebraic ease for the problem

treatment.

However, under actual application situations, the

presumption that certain feature correspondence should be

given a priori is not always tenable. In fact, the problem of

determining the 2D –3D feature correspondence is even a

much harder problem than pose estimation (when certain

feature correspondence is given a priori) itself. To handle

this problem, different methods have been proposed and

mainly fall into three categories: (i) The main idea of the

ﬁrst category’s methods is direct, that is, to determine the

feature correspondence ﬁrst, then estimate the pose

parameters based on it. These methods heavily rely on

certain features’ extraction, for example, point feature

extraction, such as SIFT [14], SURF [15], SUSAN [16],

line feature extraction [6– 8] and stable region extraction

[12]. Viksten et al. [17, 18] give reviews of these

category’s methods. However, as already pointed out by

Lee et al. [19–21], no feature is stable and reliable enough

in general 3D situations. This highly restricts the

application range of these methods; besides, how to extract

corresponding feature on object’s 3D model is still an open

problem. (ii) The second category’s methods resort to

image recognition techniques to bypass the problem of

determining the 2D –3D feature correspondence [20, 22].A

gallery of proﬁle images of object’s 3D model with

different view angles is created beforehand, then the input

image is compared within the gallery, and pose parameters

of the most similar proﬁle image are claimed to be the

object’s pose. The major disadvantage of these methods is

that to obtain a relatively ﬁner parameter approximation, a

much larger proﬁle image gallery would be needed,

whose size grows exponentially. (iii) The third category’s

methods adopt iterative mechanism [21, 23, 24]: certain

holistic cost function is deﬁned with the input image

and object’s 3D model. Pose parameters are estimated

along with the iterative optimisation of the cost function.

The major strengths of these methods are that no feature

correspondence is required a priori, no proﬁle image gallery

is needed and is more stable than the methods of the ﬁrst

category. The major weakness is that these methods often

have limited convergence radius, and a relatively good

initialisation would be necessary for these methods to

converge successfully.

The method proposed in this article generally belongs to

this third category, that is, no feature correspondence

between the input 2D image and object’s 3D model is

required a priori and object’s pose is estimated iteratively.

This method is motivated by the work presented in [23] by

Iwashita which uses only object’s contour to estimate pose

parameters. Iwashita [23] demonstrated the feasibility of

estimating object’s pose using only image contour, but the

IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 291

doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011

www.ietdl.org

pose estimation algorithm proposed in [23] still needs

improvements. To bypass the problem of determining

feature correspondence between the input 2D image and

object’s 3D model, Iwashita et al. [23] resorted to deﬁning

a holistic ensemble cost function on all contour pixels:

complicated forces and torsion moments are introduced to

drive the object’s model towards aligning with the input

image. The ensemble cost function is highly non-linear

which casts a heavy burden on the subsequent optimisation

process and makes the algorithm prone to being trapped

into local minimum. Besides non-linearity, the ensemble

cost function of [23] did not differentiate between different

pose parameters but treated them equally. However, the six

pose parameters (three rotation angle components and three

translation components) are heterogeneous and affect the

object’s imaging process in different ways. To achieve good

pose estimation result, the algorithm should embody these

differences [1, 25]. The third problem is that [23] utilised

gradient descent method to optimise its cost function, but

owing to the high non-linearity of the cost function, explicit

partial differential equations cannot be derived, and

numerical approximation had to be used, which is

computationally expensive and error prone.

In this article, we propose a new scheme of iterative pose

estimation method for 3D rigid object based on object’s

contour. As in [23], here we also use object’s contour as

algorithm’s start point, but instead of deﬁning ensemble

cost function then estimating pose parameters by optimising

it, we exploit the information provided by object’s contour

in a different way. After extracting object’s outer contour

from the input image, we extract contour pixel feature and

try to establish a tentative 2D –3D point correspondence

relationship between the input 2D grey-level image and

object’s 3D model. This tentative point correspondence

relationship is then used to estimate object’s pose

parameters with well-developed point-based pose estimation

algorithm [1], and the newly estimated pose parameters are

then fed back to revise the previously established tentative

point correspondence. This process is iterated until correct

point correspondence relationship is established and pose

parameters of the 3D object are successfully retrieved. The

main feature of our method is that both the pose estimation

problem and the feature correspondence problem are solved

simultaneously and iteratively. None of the three problems

mentioned of [23] exists in our method, making it

computationally much more efﬁcient, faster and stabler.

Experiments show that our method has faster convergence

speed and wider convergence radius.

In a very recently published conference paper [26], Cui and

Hildenbrand proposed an iterative method with an idea

similar to that of ours, that is, iteratively establishing the

point correspondence relationship between the image

contour and object’s 3D model and estimating the pose

parameters based on it. The major difference between our

work and [26] lies in the way of establishing the contour

point correspondence. The way of [26] is 3D–3D-wise:

directly retrieve the nearest vertex point of object’s 3D

model to the 3D sight line of a contour pixel. For a

complex object whose 3D model contains thousands of

vertices, the computational burden of this 3D –3D-wise way

is rather heavy. To decrease the computational complexity,

[26] proposed to simplify the object’s 3D model with 3D

Fourier transform, but this in turn lowers the accuracy of

the established point correspondence relationship. Instead,

we adopt a 2D –2D– 3D-wise way: ﬁrst establish the 2D –

2D point correspondence between input image contour and

projection image contour (see Section 2.2), then back-

project the 2D–2D correspondence onto the surface of

object’s 3D model to obtain the ﬁnal 2D–3D point

correspondence relationship. The advantage of this indirect

way is that both the 2D– 2D point correspondence

establishing step and the back-projecting step can be

accelerated by fast algorithms, which are described in this

paper. Besides, the point correspondence established by Cui

and Hildenbrand [26] will never be exactly accurate since

only vertex information of object’s 3D model is used there,

causing poor convergence performance for an unevenly

triangulated object model. This problem is also solved by

our 2D–2D – 3D correspondence establishing method.

The remainder of this article is organised as follows: in

Section 2 the details of our method are described and

analysed theoretically; in Section 3 the performance of our

method is studied experimentally, and as will be shown, the

results of convergence speed and convergence radius of our

method are very promising; Section 4 concludes the article.

2 Iterative pose estimation based on the

object’s contour

2.1 Method description

In this section, we address the details of our method and

analyse it theoretically. Before entering into algorithm

speciﬁcs, we ﬁrst describe the outline of the processing

ﬂow of our method, as illustrated in Fig. 1. The whole

processing ﬂow divides into two major stages:

preprocessing stage and iterative stage. The preprocessing

stage (Figs. 1a–c) receives a monocular 2D grey-level

image as input, then the outer contour of the object and

contour pixel features are extracted. The iterative stage

(Figs. 1d–f) ﬁrst establishes a tentative 2D –3D point

correspondence relationship between the extracted contour

and object’s 3D model with pose parameters estimated at

the last iteration. New values of object’s pose parameters

are then re-estimated based on the established tentative

2D–3D point correspondence. This process is iterated until

correct 2D–3D point correspondence is obtained and the

object’s 3D pose parameters are successfully retrieved.

There are two presumptions made by our iterative pose

estimation method which readers should keep in mind:

1. The object’s 3D pose and its outer contour (on the image

plane) are in one-to-one correspondence. This means, for a

speciﬁc contour, there exists only one determinate

corresponding pose of the object;

2. When the object’s pose varies continuously, its

corresponding contour on the image plane also changes

continuously.

Presumption (1) legitimates the object’s outer contour as a

viable feature for pose estimating, and presumption (2)

ensures that the iterative mechanism possesses reasonable

convergence radius. Note that presumption (1) is an ideal

condition, for artiﬁcial 3D objects that are self-symmetric in

certain ways, presumption (1) cannot be satisﬁed strictly.

For example the contours in the top view and the bottom

view of an aircraft are identical, and thus the contour-based

pose estimation algorithm would not be able to differentiate

between these two cases. Fortunately, these cases are

mathematically rare whose integral measure equals to zero

in the possible 3D pose space, so we can still apply the

contour-based pose estimation algorithm to these objects;

292 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300

&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098

www.ietdl.org

and for the odd cases under which object’s contours are

identical, since the possible solutions are known a priori,

they would not be a big problem.

In this article, perspective camera model is used to describe

camera’s imaging process. Camera model and corresponding

coordinate frames conﬁguration are shown in Fig. 2, in which

subscripts u,vand pindicate camera coordinate frame, object

self-centred coordinate frame and image plane coordinate

frame, respectively. Object self-centred coordinate frame

and camera coordinate frame are related by the rigid

transformation as

xu=Rxv+t(1)

or in matrix form as

Xu=RXv+thT(2)

in which his an all-ones vector, Ris the rotation matrix and

tis the translation vector. The image plane coordinate frame

and object self-centred coordinate frame are related by the

imaging equation

xp

1

K(R|t)xv

1

(3)

Symbol ‘’ means equal in homogeneous manner, and Kis

the inner camera parameter matrix.

In the remainder of this article, a calibrated camera is

assumed, that is, the inner camera parameter matrix Kis

known. This provides us the convenience of leaving K

unconsidered, and making the subsequent derivations more

compact.

2.2 Contour extraction and tentative point

correspondence establishing

Our method is based on the outer contour of the object. A

clean, continuous, single-pixel wide contour is necessary

for the method to produce good performance. To extract

object’s contour from the input image, various methods are

available. In our current implementation, we use a modiﬁed

active contour model proposed by the authors, which can

produce a closed and noise-robust contour, and can still be

very fast. The contour of the projection image which is

generated by projecting object’s 3D model onto the

image plane can be extracted by simple binary operations,

which can be implemented efﬁciently with modern central

processing unit–graphics processing unit (CPU–GPU)

computing architecture. For ease of statement, we will refer

the contour extracted from the input 2D grey-level image as

‘extracted contour’, and the contour got from the projection

image as ‘projected contour’ in the remainder of this article.

After extracting the object’s contour from the input image,

the next step is to extract contour pixel feature and establish

tentative point correspondence relationship between the

extracted contour and object’s 3D model. How to determine

the point correspondence between a 2D image and a 3D

model directly is still an open problem in computer vision.

Here we solve this problem indirectly, that is, ﬁrstly project

Fig. 2 Camera model and coordinate frames’ conﬁguration

Fig. 1 Algorithm ﬂowchart of the proposed method

aInput 2D grey-level image

bExtracted contour from the input image

cExtract contour pixel feature

dEstablish tentative point correspondence between the extracted contour and object’s 3D model

eModel wireframe overlaid on the 2D grey-level image with the estimated pose

fFinal pose estimation result

a–cBelong to the preprocessing stage, and d–fbelong to the iterative stage

IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 293

doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011

www.ietdl.org

the object’s 3D model onto the image plane to obtain the

projected contour, secondly determine the 2D–2D point

correspondence between the extracted contour and the

projected contour and thirdly points on the projected contour

which have corresponding point on the extracted contour are

back-projected onto the surface of object’s 3D model,

carrying the established 2D–2D point correspondence

relationship forward to the ﬁnal 2D–3D point correspondence.

The contour pixel feature extracted in our current

implementation is simply the pixel’s position in image. This

is the simplest feature ever possible for contour pixels. As

will be shown in the experiment section, even this simple

feature can provide good convergence performance. More

complex features may improve our method’s convergence

performance, but this is not the focus of the current article.

In this article, we will spare our attention on how far the

proposed scheme can go.

2.2.1 Establishing 2D–2D point correspondence: To

determine the 2D –2D point correspondence between the

extracted contour from the input image and the projected

contour from the projection image, we resort to a very

intuitive observation, that is, when two poses of an object

are close to each other, the correct corresponding points

should also be near geometrically between the

corresponding two contours. Thus, we can ﬁrst establish the

point correspondence tentatively by searching geometrically

nearest point pairs, and revise it iteratively with updated

pose parameters’ estimation results. Geometrically nearest

point pair is not a difﬁcult mathematical concept; however,

to ﬁnd all nearest point pairs between two contours, which

may contain hundreds of points, could be rather

computationally expensive if not implemented properly.

Here we propose to use distance map to establish the 2D –

2D point correspondence between the two contours

efﬁciently. Note that distance map is also used in [23], but

in a very different way. Distance map lies in the kernel of

[23]’s method, it is used to deﬁne the ensemble cost

function; here in our method, a distance map is used to

accelerate the 2D–2D point correspondence establishment.

A distance map describes the shortest distance of image

pixels to the given contour. More mathematically, given the

extracted contour C, for an image pixel x, its distance map

value is given as

D(x)=min

yx−y,∀y[C(4)

This deﬁnition is very similar to the signed distance function

commonly used in level set theory except that no sign

determination is necessary here. So, fast marching method

[27] which is often used to compute the signed distance

function can be used to build the distance map [23].

However, the fast marching method is designed for general

distance measure, and its O(Nlog N)(Nis the total number

of image pixels) computational complexity is still rather

high even for a medium-sized image. In our

implementation, since the distance map is built only based

on the Euclidean distance measure, we adopt the distance

transform method proposed in [28], whose computational

complexity is only O(N), and is much faster than fast

marching method for large Nvalue. It is worth emphasising

that the distance map needs only to be computed once

during the whole pose estimation process.

As its deﬁnition (4), the distance map value of a given

image pixel is the shortest distance from this position to the

given contour; thus, similar to the signed distance function

of level set, pixels that have the same distance map values

will form a closed and continuous isocontour, see Fig. 3.

To retrieve the nearest contour pixel for a given position,

we can simply trace down the minus gradient direction

from this position, and the ﬁrst contour pixel met along this

path will be the nearest contour pixel as required. Fig. 3

illustrates how the nearest contour pixel can be retrieved

efﬁciently by taking advantage of the built distance map.

The greatest advantage of utilising distance map for nearest

point retrieving is that the distance map needs only to be

built once, and once the distance map is built, the

computational complexity of retrieving geometrically

nearest point is remarkably reduced to only O(1).

2.2.2 Establishing 2D–3D point correspondence:

After the 2D–2D point correspondence relationship

between the extracted contour and the projected contour is

established, the next step is to carry the established 2D–2D

point correspondence relationship forward to the ﬁnal 2D –

3D point correspondence required by the subsequent point-

based pose estimation sub-procedure. What we need to do

is to back-project the points of the projected contour, which

have the corresponding point on the extracted contour, onto

the surface of the object’s 3D model to obtain the

corresponding 3D points’ coordinates. To accomplish this

task, we propose the following two-step method.

The ﬁrst step is to retrieve the triangular patches which

correspond to the projected contour from the 3D model. To

fully utilise the great power of modern GPU, we dye each

triangular patch of the 3D model with different colour, then

using this colour attribute as hash index, we can efﬁciently

retrieve the required triangular patches from object’s 3D

model, which usually consists of thousands of triangular

patches.

Since the retrieved triangular patch could be relative large

with respect to the scale of the whole 3D model, it would be

too coarse to be used in the subsequent pose estimation sub-

procedure. The next step is to obtain the precise 3D points’

coordinates with the retrieved triangular patches. Assume

that there is no rotation and translation transform between

camera coordinate frame and object self-centred coordinate

frame, and let the three vertices of a triangular patch be

represented as X

v

¼{x

v1

,x

v2

,x

v3

}, then the plane deﬁned

Fig. 3 Retrieving the nearest contour pixel for a given position

The black dot is the start position, with the help of distance map, the required

nearest contour pixel (shown as grey dot) can be found very efﬁciently

294 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300

&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098

www.ietdl.org

by this triangular patch will be

P=(xv1−xv3)×(xv2−xv3)

−xv3·(xv1×xv2)

(5)

Let x

g

represent the corresponding 3D point’s coordinates,

then it will be given as

xvg =(−P(4)/xg·P(1:3))xg(6)

2.3 Iterative pose estimation and convergence

speciﬁcs

After the tentative 2D–3D point correspondence relationship

is established between the input image and object’s 3D

model, the next step is to execute the point-based pose

estimation sub-procedure. In our current implementation,

we adopt the orthogonal iteration (OI) algorithm proposed

in [1] which is fast and numerically precise. Now with the

tentative 2D–3D point correspondence and the OI

algorithm, new values can be estimated for the 3D object’s

pose parameters. These updated pose parameters are then

fed back to generate new projection image, establish new

tentative point correspondence and then estimate new

values for the object’s 3D pose parameters. This process is

then iterated until correct point correspondence relationship

is established and correct pose estimation result is retrieved;

if the process does not converge after a preassigned

iteration number, we just abort it and return a failure.

To measure the ﬁtness of the pose estimation result

returned by the iterative method, we perform the ‘XOR’

operation with the binary image extracted from the input

grey-level image and the binary projection image. If the

pose estimation result is close to object’s actual pose, then

the area of the result regions after the XOR operation would

be small. So deﬁne a ratio

Aratio =area(BIob ⊕BIpr)

area(BIpr)(7)

in which BI

ob

represents the binary image extracted from the

input grey-level image, BI

pr

represents the binary projection

image got by projecting the object’s 3D model onto the

image plane. For good pose estimation result, the value of

A

ratio

would be small. This measure will be used for

convergence determination in the experiment section.

The last problem till now left undiscussed is how to obtain

a good initialisation for the iterative method to ensure

satisfactory pose estimation results. Since it is not the focus

of this article, we now give only a brief discussion about

this problem. To fulﬁl the initialisation task, many template-

based methods can be used [29 – 31]. Our method is similar

to the work of [20]. A small gallery consisting of normalised

proﬁle images is built beforehand, with a rather coarse

sampling of the possible poses. Normalisation is necessary to

remove the effect of translation parameters. Then after the

input grey-level image is segmented, the result image is

normalised and compared within the proﬁle image gallery,

and the pose of the most matching proﬁle image is chosen to

initialise our iterative pose estimation method.

Table 1 summarises the algorithm ﬂow of the proposed

contour-based iterative pose estimation method for a 3D

rigid object.

3 Experiment results

In this section, we test the performance of our iterative pose

estimation method in convergence speed, convergence

radius and robustness to noise with various-shaped 3D rigid

objects. As a comparison, the pose estimation results

returned by Iwashita’s method [23] and Cui’s method [26]

are also presented. All the codes involved are implemented

in Matlab scripts and run on a PC with 1.8 GHz CPU and

1 GB RAM.

3.1 Convergence speed performance

In this subsection, the convergence speed performance of our

new method is tested with a small model gallery which

consists of an aircraft, a racecar, a house, a desk lamp and

a grand piano. These models are of various shapes, size

scales and detail complexity, providing good sample

diversity for the following experiments.

We run the pose estimation methods ﬁve times for each

model, with different pose initialisation each time. For the

pose initialisation required by the iterative methods, we

pollute the true pose values with Gaussian noise, which

produces a total about 308deviation for the 3D rotation

angles. For Iwashita’s method [23], to guarantee the best

convergence radius, at each iteration Armijo rule [32] is

used to search an optimal time step for each of the six pose

parameters. The time cost by this optimal step searching

procedure varies dramatically with different parameter

tuning, in fairness we will not count in the absolute time in

convergence speed comparison for Iwashita’s method. Also

for Cui’s method [26], to guarantee the best convergence

radius, the model simpliﬁcation procedure proposed in [26]

is not utilised here. For fairness the absolute convergence

time is also not counted in for Cui’s method.

Table 1 Contour-based iterative pose estimation of a 3D rigid object

Preprocessing stage:

1. receive a monocular 2D grey level image as input

2. extract object’s outer contour and contour pixel features

3. build the distance map based on the contour pixel feature extracted in step 2

Iterative stage:

1. generate the projection image using last estimated pose parameters and extract object’s outer contour from the projection image

2. establish the tentative 2D–2D point correspondence relationship between the extracted contour and the projected contour

3. back project the projected contour onto the 3D model’s surface, establish the tentative 2D– 3D point correspondence relationship

between the extracted contour and object’s 3D model

4. solve the point-based pose estimation problem using OI algorithm [1]

5. check convergence condition (7), go to step 4

IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 295

doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011

www.ietdl.org

To measure the ﬁtness of results returned by the two

methods, we adopt A

ratio

deﬁned in Section 2. If A

ratio

is smaller than a given threshold, the pose estimation task is

claimed to be successfully accomplished; otherwise it is

claimed as a failure.

Fig. 4 illustrates the pose estimation results returned by our

method for the house, desk lamp and piano models. The left-

column images show the initial pose conﬁgurations and the

right-column images present the ﬁnal pose estimation results.

For ease of inspection, the effect of translation is removed

for all the images. For all the 5 ×5 test runs, our method

register object’s 3D model (represented by a wire mesh in

the image) precisely with the input 2D grey-level image.

Even though some parts of the initially posed model are far

from their counterparts in the 2D grey-level image, a good

estimation result can still be achieved. This also reveals that

our method has good error-tolerance performance.

Table 2 summarises the convergence speed statistics of the

three methods for the test gallery. The threshold of A

ratio

is set

to a thumb rule value 4e22. For our method, it usually takes

tens of iterations before successfully converging, and the time

cost per iteration is less than 1/3 s in most cases. Whereas for

Iwashita’s method, it usually takes hundreds of iterations

before convergence, which is much slower than ours; Cui’s

method takes fewer iterations than Iwashita’s method, but

still slower than that of ours.

Except for convergence speed, Iwashita’s method failed in

several cases for models of aircraft, house and desk lamp. It

seemed trapped in some local optimal solutions because of

its complicated highly non-linear cost function and error-

prone gradient descent-based optimisation method. Cui’s

method also failed in several cases for models of house,

desk lamp and piano, because it failed to establish the

correct 2D–3D point correspondence for the unevenly

triangulated test models. In contrast, the performance of our

method is more consistent and stable. These statistics also

indicate that our method has a wider convergence radius

that will be examined further in the following subsection.

In Fig. 5, six frames from one pose estimation run are

presented to give a more clear inspection about the

convergence process of our method. These six frames are

extracted from a total of 72 iterations (to make the iteration

last longer, we intentionally initialise the method poorly and

set a much smaller threshold value for A

ratio

than before).

Fig. 6 presents two curves: (i) is the evolution curve of

A

ratio

and (ii) is the evolution curve of residual error of the

point-based pose estimation sub-procedure. As we can see,

these two curves are consistent with each other, and also

demonstrate how well the object’s 3D model is registered to

the input 2D grey-level image along the iteration.

3.2 Convergence radius performance

For iterative methods, their convergence radii are always

expected to be as wide as possible. The wider the

convergence radius is, the easier it is to ﬁnd a plausible initial

solution. There are two main factors inﬂuencing the

convergence radius of our method: model’s shape and

camera’s view angle. For different model shapes and camera

view angles, the actual convergence radius is usually different.

In this subsection, the convergence radius performance of our

method against these two factors will be examined. For

comparison, results obtained with Iwashita’s method and

Cui’s method are also presented.

Fig. 4 Illustration of pose estimation results returned by our

method for the house, desk lamp and piano models

The image in the left column shows the initial pose conﬁguration, and the

image in the right column shows the ﬁnal pose estimation result. The

projection of object’s 3D model is represented by a wire mesh. The effect

of translation is removed for the ease of inspection

a,cand eInitial pose conﬁgurations

b,dand fFinal pose estimation results

Table 2 Convergence speed statistics for the method of [23, 26] and our method

Model Method of [23] Method of [26] Our method

Average

iteration no.

Time/

iteration, s

a

Case

failed

b

Average

iteration no.

Time/

iteration, s

Case

failed

Average

iteration no.

Time/

iteration, s

Case

failed

aircraft 192.6 – 1 53.6 – 0 26.4 0.0786 0

race car 65.8 – 0 16.0 – 0 8.4 0.2246 0

house 124.4 – 2 38.6 – 1 12.6 0.2165 0

desk lamp 221.2 – 2 51.6 – 1 22.4 0.1710 0

piano 126.8 – 0 24.2 – 1 12.8 0.2291 0

a

In fairness, the time cost per iteration is not counted in for methods of [23, 26] here

b

The threshold of A

ratio

is set to 4e22

296 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300

&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098

www.ietdl.org

The biggest problem for executing this subsection’s

experiments is that the number of possible poses for a 3D

object is extremely large, for example, with 18sample

interval for each of the three pose angles, the total pose

sample number would be 360 ×360 ×180 ¼2.3 ×10

7

.

Owing to the computing power limitation, we compromised

with the number of test models and the sample density of

possible poses as follows: for test models we choose two

highly different objects, an aircraft and a dolphin toy, which

are showed in Figs. 7aand b; for pose sample density, we

set the sample intervals for the ﬁrst two pose angles (yaw

angle

a

and pitch angle

b

) both to 208; for the third pose

angle (roll angle), since its effect is rotation in plane, we

just omit it because it can be recovered simply by a 2D

rotation. The initial pose solution required by the iterative

methods is obtained by offsetting the true pose within the

range

D

a

×D

b

=[−458,458]×[−458,458] (8)

with 58interval for each rotation angle. Thus, for one test

model, the number of sampled poses (i.e. different camera’s

view angles) is 18 ×9, and for each sampled pose, there

will be 19 ×19 test runs to explore its neighbourhood.

These result in a total of 1.17 ×10

5

runs for each method.

To display the experiment results, we calculate the number

of successful runs for each sampled pose and map it onto a

unit sphere, see Fig. 7. Each point on the sphere indicates

the width of convergence radius for the corresponding view

angle. The brighter the colour is, the wider the convergence

radius is. Owing to the left – right symmetry of the test

models, the front view and the back view of the result

sphere are approximately identical, and so only the front

view of sphere is shown.

To quantify the convergence radius results, we introduce

the following equation to convert the number of successful

runs into equivalent radius value

Radius =

Nsuccess

Ntotal

·Rtotal (9)

in which N

success

is the number of successful runs for a given

sampled pose, N

total

is the total number of test runs for the

sampled pose and Rtotal is the radius of the offset range.

This equation is derived from the fact that the area of a

circle is proportional to the square of its radius. In our case,

we have

Ntotal =361, Rtotal =458(10)

Table 3 summarises the convergence radius statistics of the

three methods. There are several conclusions can be derived

from Fig. 7 and Table 3:

1. For different-shaped models and camera view angles, the

convergence radius performance can vary dramatically for

all the three methods;

2. All the three methods exhibit similar convergence radius

variation trends;

3. Our method has much wider convergence radius than

Iwashita’s method and Cui’s method both for the complex

aircraft model and for the relatively simple dolphin toy model.

3.3 Noise robustness

Experiments in the previous two subsections postulate ideal

contour extraction result. In actual applications, the contour

extracted from the input grey-level image is seldom ideal

because of noise, complex image background, lighting

variation etc. This imperfectness will surely bring down the

pose estimation method’s performance. In this subsection,

we study the noise robustness performance of our method

under various signal to noise ratio (SNR) condition.

The test models used in this subsection is the same with

Subsection 3.2, see Figs. 7aand b. To generate simulation

test images, we set the sample intervals of yaw angle

a

and

Fig. 5 Convergence illustration of our method

a–fare the 1st, 2nd, 4th, 10th, 20th and 72nd frames out of a total 72

iterations

Fig. 6 Convergence curves

aEvolution curve of A

ratio

bEvolution curve of residual error of the point-based pose estimation sub-procedure

IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 297

doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011

www.ietdl.org

pitch angle

b

both to 108. Compared with total randomisation,

this setting ensures more accurate and comprehensive

performance evaluation results. The extracted contour is

then contaminated with different levels of Gaussian noise to

simulate real-world defect, see Fig. 8. The initial pose

required by the method is obtained by polluting the true

value by Gaussian noise with 20 dB SNR. The iteration

number is ﬁxed to 100 for each run. To evaluate the

Fig. 7 Convergence radius experiment results mapped onto a unit sphere

aand bTest models: an aircraft and a dolphin toy

cand dFront views of the result spheres for Iwashita’s method

eand fFront views of the result spheres for Cui’s method

gand hFront views of the result spheres for our method

Table 3 Convergence radius statistics for the method of [23, 26] and our method (8)

Model Method of [23] Method of [13] Our method

Min

radius

Max

radius

Mean

radius

Std

variance

Min

radius

Max

radius

Mean

radius

Std

variance

Min

radius

Max

radius

Mean

radius

Std

variance

aircraft 10.86 28 18.9 3.61 12.7 37.1 23.6 4.34 22.8 44.5 33.3 4.71

dolphin toy 13.5 31.6 23.7 3.71 17.2 36.2 28.6 4.09 27.1 45 41.1 3.87

298 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300

&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098

www.ietdl.org

noise-robust performance, we analyse the statistics of ﬁnal

A

ratio

returned by our method.

Results are shown in Fig. 9. As we can see, for both test

models, the inﬂection point of the mean A

ratio

variation

curve occurs at 30 dB. When SNR .30 dB, the mean

values of A

ratio

are both lower than 1e21, an experience

value indicating acceptable pose estimation result obtained;

however, when SNR ,30 dB, both curves rise rapidly.

This indicates that for our method to work properly, SNR

above 30 dB is required. Referring to Fig. 8b, which

illustrates what a noise-contaminated contour looks like

under 30 dB SNR, we can conclude that the noise

robustness performance of our method is satisfying.

3.4 Experiments with real-world image data

In this subsection, the effectiveness of our new method will

be veriﬁed with experiments using real-world aircraft

images. Aircraft objects are highly manoeuvrable and can

be arbitrarily posed, and thus are more ﬁt to the ‘general

3D rigid object’ concept concentrated on by this article.

Experiment data and results are presented in Fig. 10.

The 1st and 3rd columns of Fig. 10 are the input monocular

images, and the 2nd and 4th are the corresponding pose

estimation results returned by our method. The result image

is rendered with the estimated pose parameters and object’s

3D model. The object aircraft in Figs. 10aand bis the F16

Fighting Falcon, and in Figs. 10cand dis the F22 Raptor.

From Fig. 10, we can see that our new method successfully

accomplished the pose estimation tasks with various object

pose, size and image quality.

4 Conclusions

In this article, we focus on the topic of pose estimation of

general 3D rigid object with no feature correspondence

between the input monocular image and object’s 3D model

is available a priori, and a new contour-based iterative

method is proposed which is fast and has wide convergence

radius. Our new method solves the feature correspondence

problem and pose estimation problem simultaneously and

iteratively, that is, not only the pose parameters of the 3D

object, but also the 2D –3D point correspondence between

the input grey-level image and object’s 3D model can be

retrieved. The tentative point correspondence establishing

scheme keeps our method free of deﬁning the highly non-

linear ensemble cost functions, making our method

computationally more efﬁcient and stable. Experimental

results show that the performance of our method is

promising in convergence speed, convergence radius and

noise robustness.

In this present work, we only adopt the 2D geometrical

properties to build the tentative 2D –3D point

correspondence between the input grey-level image and

object’s 3D model. It would be sensible to incorporate other

image properties to improve the accuracy of the tentative

Fig. 8 Noise-contaminated contour extraction results

aSNR ¼60 dB

bSNR ¼30 dB

cSNR ¼10 dB

Fig. 9 Results of noise robustness experiments

Fig. 10 Pose estimation results for real-world image data by our method

The 1st and 3rd columns are the input monocular images, and the 2nd and 4th are the corresponding pose estimation results. The result image is rendered with the

estimated pose parameters and the object’s 3D model

aand bPose estimation results for the F16 Fighting Falcon

cand dPose estimation results for the F22 Raptor

IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291– 300 299

doi: 10.1049/iet-cvi.2010.0098 &The Institution of Engineering and Technology 2011

www.ietdl.org

point correspondence establishing procedure, if not incurring

too much extra computation cost. This will be future work.

5 Acknowledgment

We would like to acknowledge Prof. Iwashita for her help in

understanding their algorithm.

6 References

1 Lu, C.P., Hager, G.D., Mjolsness, E.: ‘Fast and globally convergent pose

estimation from video images’, IEEE Trans. Pattern Anal. Mach. Intell.,

2000, 22, (6), pp. 610– 622

2 Burschka, D., Mair, E.: ‘Direct pose estimation with a monocular

camera’, Robot Vis., 2008, (LNCS,4931), pp. 440– 453

3 Haralick, R.M., Lee, C.N., Ottenberg, K., No¨lle, M.: ‘Review and

analysis of solution of the three point perspective pose estimation

problem’, IJCV, 1994, 13, (3), pp. 331– 356

4 Moreno-Noguer, F., Lepetit, V., Fua, P.: ‘Accurate non-iterative O(n)

solution to the PnP problem’. IEEE ICCV’07, Rio de Janeiro,

pp. 2168– 2175

5 Leng, D.W., Sun, W.D.: ‘Finding all the solutions of PnP problem’.

IEEE IST’09, Shenzhen, pp. 348– 352

6 Ansar, A., Daniilidis, K.: ‘Linear pose estimation from points or lines’,

IEEE Trans. Pattern Anal. Mach. Intell., 2003, 25, (5), pp. 578– 589

7 David, P., DeMenthon, D., Duraiswami, R., Samet, H.: ‘Simultaneous

pose and correspondence determination using line features’.

CVPR’03, 2003, vol. 2, pp. 424– 431

8 Christy, S., Horaud, R.: ‘Iterative pose computation from line

correspondences’, CVIU, 1999, 73, (1), pp. 137– 144

9 Hanning, T., Schoene, R., Graf, S.: ‘A closed form solution for

monocular re-projective 3D pose estimation of regular planar

patterns’, ICIP, 2006, 1–7, pp. 2197– 2200

10 Jacobs, D., Basri, R.: ‘3D to 2D pose determination with regions’, IJCV,

1999, 34, (2– 3), pp. 123 –145

11 Tahri, O., Chaumette, F.: ‘Complex objects pose estimation based on

image moment invariants’. Proc. IEEE Int. Conf. on Robotics and

Automation, Barcelona, Spain, April 2005, pp. 436–441

12 Donoser, M., Bischof, H.: ‘Efﬁcient maximally stable extremal region

(MSER) tracking’. CVPR’06, 2006, vol. 1, pp. 553– 560

13 Kyriakoulis, N., Gasteratos, A.: ‘Color-based monocular visuoinertial

3-D pose estimation of a Volant robot’, IEEE Trans. Instrum. Meas.,

2010, 59, (10), pp. 2706–2715

14 Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’,

IJCV, 2004, 60, (2), pp. 91– 110

15 Bay, H., Tuytelaars, T., Gool, L.V., Zurich, E.: ‘SURF: speeded up

robust features’, CVIU, 2008, 110, (3), pp. 346– 359

16 Smith, S.M., Brady, J.M.: ‘SUSAN – a new approach to low level

image processing’, IJCV, 1997, 23, (1), pp. 45– 78

17 Viksten, F., Forsse´n, P.E., Johansson, B., Moe, A.: ‘Comparison of local

image descriptors for full 6 degree-of-freedom pose estimation’. IEEE

Int. Conf. on Robotics and Automation, Kobe, Japan, 2009,

pp. 1139– 1146

18 Shan, G.L., Ji, B., Zhou, Y.F.: ‘A review of 3D pose estimation from a

monocular image sequence’. CISP’09, 2009, Tianjin, pp. 1 –5

19 Lee, T.K., Drew, M.S.: ‘3D object recognition by eigen-scale-space of

contours’. SSVM ’07, 2007, vol. 4485, pp. 883–894

20 Dunker, J., Hartmann, G., Sto¨hr, M.: ‘Single view recognition and

pose estimation of 3D objects using sets of prototypical views and

spatially tolerant contour representations’. ICPR’96, 1996, vol. 4,

pp. 14–18

21 Dambrevile, S., Sandhu, R., Yezzi, A., Tannenbaum, A.: ‘A geometric

approach to joint 2D region-based segmentation and 3D pose estimation

using a 3D shape prior’, SIAM J. Imaging Sci., 2010, 3, (1),

pp. 110–132

22 Poggio, T., Edelman, S.: ‘A network that learns to recognize three-

dimensional objects’, Nature, 1990, 343, pp. 263– 266

23 Iwashita, Y., Kurazume, R., Konishi, K., Nakamoto, M., Hashizume,

M., Hasegawa, T.: ‘Fast alignment of 3D geometrical models and 2D

grayscale images using 2D distance maps’, Syst. Comput. Jpn., 2007,

38, (14), pp. 1889–1899

24 Chetverikov, D., Stepanov, D., Krsek, P.: ‘Robust Euclidean alignment

of 3D point sets: the trimmed iterative closest point algorithm’, Image

Vis. Comput., 2005, 23, (3), pp. 299– 309

25 DeMenthon, D.F., Davis, L.S.: ‘Model-based object pose in 25 lines of

code’, IJCV, 1995, 15, (1– 2), pp. 123–141

26 Cui, Y., Hildenbrand, D.: ‘Pose estimation based on geometric algebra’.

GraVisMa, 2009, pp. 17– 24

27 Sethian, J.A.: ‘A fast marching level set method for monotonically

advancing fronts’, Proc. Natl. Acad. Sci. USA, 1996, 93, pp. 1591– 1595

28 Felzenszwalb, P.F., Huttenlocher, D.P.: ‘Distance transforms of sampled

functions’, Cornell Computing and Information Science TR2004-1963,

available at: http://ecommons.library.cornell.edu/handle/1813/5663

29 Horaud, R.: ‘New methods for matching 3D objects with single

perspective views’, IEEE Trans. Pattern Anal. Mach. Intell., 1987, 9,

(3), pp. 401–412

30 Dhome, M., Richetin, M., Lapreste´, J.T., Rives, G.: ‘Determination

of the attitude of 3D objects from a single perspective view’,

IEEE Trans. Pattern Anal. Mach. Intell., 1989, 11,(12),

pp. 1265– 1278

31 Gonza´lez, J.M., Sebastia´n, J.M., Garcı´a, D ., Sa´nchez, F., Angel, L.:

‘Recognition of 3D object from one image based on projective

and permutative invariants’. ICIAR’04, 2004, vol. 3211,

pp. 705–712

32 Bertsekas, D.P.: ‘Constrained optimization and Lagrange multiplier

methods’ (Academic Press, 1982)

300 IET Comput. Vis., 2011, Vol. 5, Iss. 5, pp. 291 –300

&The Institution of Engineering and Technology 2011 doi: 10.1049/iet-cvi.2010.0098

www.ietdl.org