Page 1

Silhouette and Stereo Fusion for 3D Object Modeling

Carlos Hern´ andez Esteban and Francis Schmitt

Signal and Image Processing Department, CNRS UMR 5141

Ecole Nationale Sup´ erieure des T´ el´ ecommunications, France

Abstract

In this paper, we present a new approach to high quality 3D object reconstruction. Start-

ing from a calibrated sequence of color images, the algorithm is able to reconstruct both

the 3D geometry and the texture. The core of the method is based on a deformable model,

which defines the framework where texture and silhouette information can be fused. This

is achieved by defining two external forces based on the images: a texture driven force and

a silhouette driven force. The texture force is computed in two steps: a multi-stereo corre-

lation voting approach and a gradient vector flow diffusion. Due to the high resolution of

the voting approach, a multi-grid version of the gradient vector flow has been developed.

Concerning the silhouette force, a new formulation of the silhouette constraint is derived.

It provides a robust way to integrate the silhouettes in the evolution algorithm. As a conse-

quence, we are able to recover the contour generators of the model at the end of the iteration

process. Finally, a texture map is computed from the original images for the reconstructed

3D model.

Key words: 3D reconstruction, deformable model, multigrid gradient vector flow, visual

hull, texture.

1Introduction

As computer graphics and technology become more powerful, attention is being

focused on the creation or acquisition of high quality 3D models. As a result, a

great effort is being made to exploit the biggest source of 3D models: the real

world. Among all the possible techniques of 3D acquisition, there is one which

is especially attractive: the image-based modeling. In this kind of approach, the

only input data to the algorithm are a set of images, possibly calibrated. Its main

advantages are the low cost of the system and the possibility of immediate color.

Email addresses: carlos.hernandez@enst.fr, francis.schmitt@enst.fr

(Carlos Hern´ andez Esteban and Francis Schmitt).

Preprint submitted to Computer Vision and Image Understanding

Page 2

The main disadvantage is the quality of the reconstructions compared to the quality

of more active techniques (range scanning or encoded-light techniques). In this

paper we present an image-based modeling approach which affords high quality

reconstructions by mixing two complementary image data into a same framework:

silhouette information and texture information. Our two main contributions are a

new approach to the silhouette constraint definition and the high quality of the

overall system.

2Related Work

Acquiring 3D models is not an easy task and abundant literature exists on this sub-

ject. There are three major approaches to the problem of 3D real model representa-

tion: pure image-based rendering techniques, hybrid image-based techniques, and

3D scanning techniques. Pure image-based rendering techniques as [1,2] try to gen-

erate synthetic views from a given set of original images. They do not estimate the

real 3D structure behind the images, they only interpolate the given set of images

to generate a synthetic view. Hybrid methods as [3–6] make a rough estimation of

the 3D geometry and mix it with a traditional image-based rendering algorithm in

order to obtain more accurate results. In both types of methods, the goal is to gen-

erate coherent views of the real scene, rather than obtain metric measures of it. In

opposition to these techniques, the third class of algorithms tries to recover the full

3D structure. Among the 3D scanning techniques, two main groups are to be distin-

guished: active methods and passive ones. Active methods use a controlled source

of light such as a laser or a coded light in order to recover the 3D information [7–9].

Passive methods use only the information contained in the images of the scene [10].

They can be classified according to the type of information they use. A first class

consists of the shape from silhouette methods [11–16]. They obtain an initial esti-

mation of the 3D model known as visual hull. They are robust and fast, but because

of the type of information used, they are limited to simple shaped objects. We can

find commercial products based on this technique. Another approach includes the

shape from shading methods. They are based on the diffusing properties of Lam-

bertian surfaces. They mainly work for 2.5D surfaces and are very dependent on the

light conditions. A third class of methods use the color information of the scene.

The color information can be used in different ways, depending on the type of scene

we try to reconstruct. A first way is to measure color consistency to carve a voxel

volume [17,18]. But they only provide an output model composed of a set of vox-

els, which makes it difficult to obtain a good 3D mesh representation. In order to

solve this problem, the authors of [19] and [20] propose to use the color consistency

measure to guide a deformable model. An additional problem of color consistency

algorithms is that they compare absolute color values, which makes them sensitive

to light condition variations. A different way of exploiting color is to compare local

variations of the texture, as done in cross-correlation methods [21,22]. As a spe-

2

Page 3

cialization of the color-based group, there are specific methods that try to use at

the same time another type of information such as silhouettes [17,23,24], radiance

[25] or shading [26]. Although very good results are obtained, the quality is still

limited, and the main problem is the way the fusion of different data is done. Some

authors, such as [17,23], use a volume grid for the fusion. Others, like [26,25,24],

use a deformation model framework. The algorithm we present in this paper can be

classified in this latter group. We perform the fusion of both silhouettes and texture

information by a deformation model evolution. The main difference with the meth-

ods mentioned above is the way the fusion is accomplished, which enables us to

obtain very high quality reconstructions. A similar approach to our work has been

recently proposed in [27]. A deformable model is also used to fusion texture and

silhouette information. However, the objectives of their work are not the same as

ours. They are interested in dynamic 3D shape reconstruction of moving persons

while our specific aim is high quality 3D and color reconstructions of museological

objects.

3 Algorithm Overview

The goal of the system is to be able to reconstruct a 3D object from a sequence

of geometrically calibrated images. To do so, we dispose of several types of infor-

mation contained in the images. Among all the information available, shading, sil-

houettes and features of the object are the most useful for shape retrieval. Shading

information needs a calibration of the light sources, which implies an even more

controlled environment for the acquisition. The use of the silhouettes requires a

good extraction of the object from the background, which is not always easy to

accomplish. Finally, of all the features available from an object, such as texture,

points, contours, or more complicated forms, we are mainly interested in texture,

whenever it exists. Since exploiting shading imposes heavy constraints in the ac-

quisition process, the information we will use consists of silhouettes and texture.

The next step is to decide how to mix these two types of information to work to-

gether. As we will see, this is not an easy task because those types of information

are very different, almost ”orthogonal”.

3.1 Classical Snake vs. Level-Set Methods

Deformation models offer a well-known framework to optimize a surface under

several kinds of information. Two different related techniques can be used depend-

ing on the way the problem is posed: a classical snake approach [28] or a level-set

approach [29]. The main advantage of the snake approach is its simplicity of im-

plementation and parameter tuning. Its main drawback is the constant topology

constraint. Level-set based algorithms have the advantage of an intrinsic capabil-

3

Page 4

ity to overcome this problem but its main disadvantages are the computation time

and the difficulty in controlling the topology. Computation time can be addressed

using a narrow band implementation [30]. Controlling the topology is a more dif-

ficult problem but, the authors of [31] have recently proposed an interesting way

of avoiding topology changes in level set methods. Despite these improvements,

level-set methods remain complex and expensive when dealing with high resolu-

tion deformable models (9 to 11 grid levels). Since this is our principal objective,

we have chosen to use the classical snake as framework for the fusion of silhouette

and stereo data. This implies that the topology has to be completely recovered be-

fore the snake evolution occurs as discussed in Section 4. Since the proposed way

to recover the right topology is the visual hull concept, the topology recovery will

depend on the intrinsic limitations of the visual hull. This implies that there ex-

ist objects for which we are unable to recover the correct topology (no silhouettes

seeing a hole) that could be potentially reconstructed using a level-set method (the

correct topology being recovered with the stereo information). We observe that, in

practice, if we dispose of enough views, the visual hull provides the correct topol-

ogy for most of the common objects; therefore, this is not a severe handicap.

3.2 The Classical Snake Approach

The deformable model framework allows us to define an optimal surface which

minimizes a global energy E. In general, this energy will be non-convex with pos-

sible local minima. In our case, the minimization problem is posed as follows: find

the surface S of R3that minimizes the energy E(S) defined as follows:

E(S) = Etex(S)+Esil(S)+Eint(S),

(1)

where Etexis the energy term related to the texture of the object, Esilthe term related

to the silhouettes and Eintis a regularization term of the surface model. Minimizing

Eq. (1) means finding Soptsuch that:

∇E(Sopt)=∇Etex(Sopt)+∇Esil(Sopt)+∇Eint(Sopt)= 0,

= Ftex(Sopt) + Fsil(Sopt) + Fint(Sopt) = 0,

(2)

where ∇ is the gradient operator, and Ftex, Fsiland Fintrepresent the forces that

drive the snake. Equation (2) establishes the equilibrium condition for an optimal

solution, where the three forces cancel each other out. A solution to Eq. (2) can be

found by introducing a time variable t for the surface S and solving the following

differential equation:

St= Ftex(S)+Fsil(S)+Fint(S).

(3)

4

Page 5

The discrete version becomes:

Sk+1= Sk+∆t(Ftex(Sk)+Fsil(Sk)+Fint(Sk)).

(4)

Once we have sketched the energies that will drive the process, we need to make a

choice for the representation of the surface S. This representation defines the way

the deformation of the snake is done at each iteration. Among all the possible sur-

face representations, we have chosen the triangular mesh because of its simplicity

and well known properties.

To completely define the deformation framework, we need an initial value of S, i.e.,

an initial surface S0that will evolve under the different energies until convergence.

In this paper, we describe the snake initialization in Section 4, the force driven by

thetextureoftheobjectinSection5,theforcedrivenbythesilhouettesinSection6,

how we control the mesh evolution in Section 7 and the texture mapping procedure

in Section 8. We finally discuss our results in Section 9.

4 Snake Initialization

The first step in our minimization problem is to find an initial surface close enough

to the object surface in order to guarantee a good convergence of the algorithm.

Close has to be considered in a geometrical and topological sense. The geometric

distance between the initial and the object surfaces has to be reduced in order to

limit the number of iterations in the surface mesh evolution process and thereby the

computation time. The topology of the initial surface is also very important since

classical deformable models maintain the topology of the mesh during its evolu-

tion. On the one hand, this imposes a strong constraint that makes the initialization

a very important step since the initial surface must capture the topology of the ob-

ject surface. On the other hand, the topology-constant property of a classical snake

provides more robustness to the evolution process.

If we make a list of possible initializations, we can establish an ordered list, where

the first and simplest initialization is the bounding box of the object. The next sim-

plest surface is the convex hull of the object. Both the bounding box and the convex

hull are unable to represent surfaces with a genus greater than 0. A more refined

initialization, which lies between the convex hull and the real object surface is the

visual hull [32]. The visual hull can be defined as the intersection of all the possi-

ble cones containing the object. In practice, a discrete version is usually obtained

by intersecting the cones generated by back projecting the object silhouettes of a

given set of views. As a difference with the convex hull, it can represent surfaces

with an arbitrary number of holes. However, this does not imply that it is able to

completely recover the topology of the object and, what is even worse, the topology

of the visual hull depends on the discretization of the views (see Fig. 1).

5