Silhouette and Stereo Fusion for 3D Object Modeling
Carlos Hern´ andez Esteban and Francis Schmitt
Signal and Image Processing Department, CNRS UMR 5141
Ecole Nationale Sup´ erieure des T´ el´ ecommunications, France
In this paper, we present a new approach to high quality 3D object reconstruction. Start-
ing from a calibrated sequence of color images, the algorithm is able to reconstruct both
the 3D geometry and the texture. The core of the method is based on a deformable model,
which defines the framework where texture and silhouette information can be fused. This
is achieved by defining two external forces based on the images: a texture driven force and
a silhouette driven force. The texture force is computed in two steps: a multi-stereo corre-
lation voting approach and a gradient vector flow diffusion. Due to the high resolution of
the voting approach, a multi-grid version of the gradient vector flow has been developed.
Concerning the silhouette force, a new formulation of the silhouette constraint is derived.
It provides a robust way to integrate the silhouettes in the evolution algorithm. As a conse-
quence, we are able to recover the contour generators of the model at the end of the iteration
process. Finally, a texture map is computed from the original images for the reconstructed
Key words: 3D reconstruction, deformable model, multigrid gradient vector flow, visual
As computer graphics and technology become more powerful, attention is being
focused on the creation or acquisition of high quality 3D models. As a result, a
great effort is being made to exploit the biggest source of 3D models: the real
world. Among all the possible techniques of 3D acquisition, there is one which
is especially attractive: the image-based modeling. In this kind of approach, the
only input data to the algorithm are a set of images, possibly calibrated. Its main
advantages are the low cost of the system and the possibility of immediate color.
Email addresses: email@example.com, firstname.lastname@example.org
(Carlos Hern´ andez Esteban and Francis Schmitt).
Preprint submitted to Computer Vision and Image Understanding
The main disadvantage is the quality of the reconstructions compared to the quality
of more active techniques (range scanning or encoded-light techniques). In this
paper we present an image-based modeling approach which affords high quality
reconstructions by mixing two complementary image data into a same framework:
silhouette information and texture information. Our two main contributions are a
new approach to the silhouette constraint definition and the high quality of the
Acquiring 3D models is not an easy task and abundant literature exists on this sub-
ject. There are three major approaches to the problem of 3D real model representa-
tion: pure image-based rendering techniques, hybrid image-based techniques, and
3D scanning techniques. Pure image-based rendering techniques as [1,2] try to gen-
erate synthetic views from a given set of original images. They do not estimate the
real 3D structure behind the images, they only interpolate the given set of images
to generate a synthetic view. Hybrid methods as [3–6] make a rough estimation of
the 3D geometry and mix it with a traditional image-based rendering algorithm in
order to obtain more accurate results. In both types of methods, the goal is to gen-
erate coherent views of the real scene, rather than obtain metric measures of it. In
opposition to these techniques, the third class of algorithms tries to recover the full
3D structure. Among the 3D scanning techniques, two main groups are to be distin-
guished: active methods and passive ones. Active methods use a controlled source
of light such as a laser or a coded light in order to recover the 3D information [7–9].
Passive methods use only the information contained in the images of the scene .
They can be classified according to the type of information they use. A first class
consists of the shape from silhouette methods [11–16]. They obtain an initial esti-
mation of the 3D model known as visual hull. They are robust and fast, but because
of the type of information used, they are limited to simple shaped objects. We can
find commercial products based on this technique. Another approach includes the
shape from shading methods. They are based on the diffusing properties of Lam-
bertian surfaces. They mainly work for 2.5D surfaces and are very dependent on the
light conditions. A third class of methods use the color information of the scene.
The color information can be used in different ways, depending on the type of scene
we try to reconstruct. A first way is to measure color consistency to carve a voxel
volume [17,18]. But they only provide an output model composed of a set of vox-
els, which makes it difficult to obtain a good 3D mesh representation. In order to
solve this problem, the authors of  and  propose to use the color consistency
measure to guide a deformable model. An additional problem of color consistency
algorithms is that they compare absolute color values, which makes them sensitive
to light condition variations. A different way of exploiting color is to compare local
variations of the texture, as done in cross-correlation methods [21,22]. As a spe-
cialization of the color-based group, there are specific methods that try to use at
the same time another type of information such as silhouettes [17,23,24], radiance
 or shading . Although very good results are obtained, the quality is still
limited, and the main problem is the way the fusion of different data is done. Some
authors, such as [17,23], use a volume grid for the fusion. Others, like [26,25,24],
use a deformation model framework. The algorithm we present in this paper can be
classified in this latter group. We perform the fusion of both silhouettes and texture
information by a deformation model evolution. The main difference with the meth-
ods mentioned above is the way the fusion is accomplished, which enables us to
obtain very high quality reconstructions. A similar approach to our work has been
recently proposed in . A deformable model is also used to fusion texture and
silhouette information. However, the objectives of their work are not the same as
ours. They are interested in dynamic 3D shape reconstruction of moving persons
while our specific aim is high quality 3D and color reconstructions of museological
3 Algorithm Overview
The goal of the system is to be able to reconstruct a 3D object from a sequence
of geometrically calibrated images. To do so, we dispose of several types of infor-
mation contained in the images. Among all the information available, shading, sil-
houettes and features of the object are the most useful for shape retrieval. Shading
information needs a calibration of the light sources, which implies an even more
controlled environment for the acquisition. The use of the silhouettes requires a
good extraction of the object from the background, which is not always easy to
accomplish. Finally, of all the features available from an object, such as texture,
points, contours, or more complicated forms, we are mainly interested in texture,
whenever it exists. Since exploiting shading imposes heavy constraints in the ac-
quisition process, the information we will use consists of silhouettes and texture.
The next step is to decide how to mix these two types of information to work to-
gether. As we will see, this is not an easy task because those types of information
are very different, almost ”orthogonal”.
3.1 Classical Snake vs. Level-Set Methods
Deformation models offer a well-known framework to optimize a surface under
several kinds of information. Two different related techniques can be used depend-
ing on the way the problem is posed: a classical snake approach  or a level-set
approach . The main advantage of the snake approach is its simplicity of im-
plementation and parameter tuning. Its main drawback is the constant topology
constraint. Level-set based algorithms have the advantage of an intrinsic capabil-
ity to overcome this problem but its main disadvantages are the computation time
and the difficulty in controlling the topology. Computation time can be addressed
using a narrow band implementation . Controlling the topology is a more dif-
ficult problem but, the authors of  have recently proposed an interesting way
of avoiding topology changes in level set methods. Despite these improvements,
level-set methods remain complex and expensive when dealing with high resolu-
tion deformable models (9 to 11 grid levels). Since this is our principal objective,
we have chosen to use the classical snake as framework for the fusion of silhouette
and stereo data. This implies that the topology has to be completely recovered be-
fore the snake evolution occurs as discussed in Section 4. Since the proposed way
to recover the right topology is the visual hull concept, the topology recovery will
depend on the intrinsic limitations of the visual hull. This implies that there ex-
ist objects for which we are unable to recover the correct topology (no silhouettes
seeing a hole) that could be potentially reconstructed using a level-set method (the
correct topology being recovered with the stereo information). We observe that, in
practice, if we dispose of enough views, the visual hull provides the correct topol-
ogy for most of the common objects; therefore, this is not a severe handicap.
3.2 The Classical Snake Approach
The deformable model framework allows us to define an optimal surface which
minimizes a global energy E. In general, this energy will be non-convex with pos-
sible local minima. In our case, the minimization problem is posed as follows: find
the surface S of R3that minimizes the energy E(S) defined as follows:
E(S) = Etex(S)+Esil(S)+Eint(S),
where Etexis the energy term related to the texture of the object, Esilthe term related
to the silhouettes and Eintis a regularization term of the surface model. Minimizing
Eq. (1) means finding Soptsuch that:
= Ftex(Sopt) + Fsil(Sopt) + Fint(Sopt) = 0,
where ∇ is the gradient operator, and Ftex, Fsiland Fintrepresent the forces that
drive the snake. Equation (2) establishes the equilibrium condition for an optimal
solution, where the three forces cancel each other out. A solution to Eq. (2) can be
found by introducing a time variable t for the surface S and solving the following
The discrete version becomes:
Once we have sketched the energies that will drive the process, we need to make a
choice for the representation of the surface S. This representation defines the way
the deformation of the snake is done at each iteration. Among all the possible sur-
face representations, we have chosen the triangular mesh because of its simplicity
and well known properties.
To completely define the deformation framework, we need an initial value of S, i.e.,
an initial surface S0that will evolve under the different energies until convergence.
In this paper, we describe the snake initialization in Section 4, the force driven by
how we control the mesh evolution in Section 7 and the texture mapping procedure
in Section 8. We finally discuss our results in Section 9.
4 Snake Initialization
The first step in our minimization problem is to find an initial surface close enough
to the object surface in order to guarantee a good convergence of the algorithm.
Close has to be considered in a geometrical and topological sense. The geometric
distance between the initial and the object surfaces has to be reduced in order to
limit the number of iterations in the surface mesh evolution process and thereby the
computation time. The topology of the initial surface is also very important since
classical deformable models maintain the topology of the mesh during its evolu-
tion. On the one hand, this imposes a strong constraint that makes the initialization
a very important step since the initial surface must capture the topology of the ob-
ject surface. On the other hand, the topology-constant property of a classical snake
provides more robustness to the evolution process.
If we make a list of possible initializations, we can establish an ordered list, where
the first and simplest initialization is the bounding box of the object. The next sim-
plest surface is the convex hull of the object. Both the bounding box and the convex
hull are unable to represent surfaces with a genus greater than 0. A more refined
initialization, which lies between the convex hull and the real object surface is the
visual hull . The visual hull can be defined as the intersection of all the possi-
ble cones containing the object. In practice, a discrete version is usually obtained
by intersecting the cones generated by back projecting the object silhouettes of a
given set of views. As a difference with the convex hull, it can represent surfaces
with an arbitrary number of holes. However, this does not imply that it is able to
completely recover the topology of the object and, what is even worse, the topology
of the visual hull depends on the discretization of the views (see Fig. 1).
Fig. 17. Twins model reconstruction using only 12 equally spaced cameras. From left to
right: visual hull initialization, final model, texture mapping and concavity recovery.
Fig. 18. Roman statues reconstructed from 36 images of 16 Mpixels. The octree used to
store the correlation hits has 11 levels of depth. From left to right, models have respectively
223308, 211714 and 204220 vertices.
ready very close to the real surface, only some iterations would suffice to converge.
The second drawback is the topology constant evolution. It allows a guaranteed
topology of the final model but it is also a limitation for some kind of objects where
the topology cannot be captured by the visual hull. A feasible solution would be to
detect self collisions of the snake , and to launch a local level-set based method
in order to recover the correct topology. Further work includes: i) the self calibra-
tion of the image sequence using both the silhouettes and traditional methods, ii)
an improved strategy to detect local convergence of the snake in order to freeze
optimized regions and to accelerate the evolution in the empty concavitity regions,
iii) the possible use of the surface curvatures to allow a multi-resolution evolution
of the mesh, iv) some more advanced work in the generation of the texture.
This work has been partly supported by the SCULPTEUR European project IST-
2001-35372. We thank the Thomas Henry museum at Cherbourg for the image
sequences corresponding to Figures 12, 13 and 16.
 S. Chen, L. Williams, View interpolation for image synthesis, in: SIGGRAPH ’93,
1993, pp. 279–288.
 L. McMillan, G. Bishop, Plenoptic modeling: An image-based rendering system, in:
SIGGRAPH ’95, 1995, pp. 39–46.
 P. E. Debevec, C. J. Taylor, J. Malik, Modeling and rendering architecture from
photographs: A hybrid geometry and image-based approach, in: SIGGRAPH ’96,
1996, pp. 11–20.
 W. Matusik, C. Buehler, R. Raskar, S. Gortler, L. McMillan, Image-based visual hulls,
SIGGRAPH 2000 (2000) 369–374.
 G. Slabaugh, R. Schafer, M. Hans, Image-based photo hulls, in: 3DPVT ’02, 2002, pp.
 M. Li, H. Schirmacher, M. Magnor, H. Seidel, Combining stereo and visual
hull information for on-line reconstruction and rendering of dynamic scenes, in:
Proceedings of IEEE 2002 Workshop on Multimedia and Signal Processing, 2002,
 F. Schmitt, B. Barsky, W. Du, An adaptative subdivision method for surface-fitting
from sampled data, in: SIGGRAPH ’86, 1986, pp. 179–188.
 B. Curless, M. Levoy, A volumetric method for building complex models from range
images, in: SIGGRAPH ’96, 1996, pp. 303–312.
 M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton,
3d scanning of large statues, in: SIGGRAPH 2000, 2000, pp. 131–144.
 G. Slabaugh, W. B. Culbertson, T. Malzbender, R. Shafer, A survey of methods for
volumetric scene reconstruction from photographs, in: International Workshop on
Volume Graphics 2001, 2001.
 B. G. Baumgart, Geometric modelling for computer vision, Ph.D. thesis, Standford
 M. Potmesil, Generating octree models of 3d objects from their silhouettes in a
sequence of images, CVGIP 40 (1987) 1–29.
 R. Vaillant, O. Faugeras, Using extremal boundaries for 3d object modelling, IEEE
Trans. Pattern Analysis and Machine Intelligence 14 (2) (1992) 157–173.
 W. Niem, J. Wingbermuhle, Automatic reconstruction of 3d objects using a mobile
monoscopic camera, in: Int. Conf. on Recent Advances in 3D Imaging and Modeling,
1997, pp. 173–181.
 Y. Matsumoto, H. Terasaki, K. Sugimoto, T. Arakawa, A portable three-dimensional
digitizer, in: Int. Conf. on Recent Advances in 3D Imaging and Modeling, 1997, pp.
 S. Sullivan, J. Ponce, Automatic model construction, pose estimation, and object
recognition from photographs using triangular splines, IEEE Trans. Pattern Analysis
and Machine Intelligence 20 (10) (1998) 1091–1096.
 Y. Matsumoto, K. Fujimura, T. Kitamura, Shape-from-silhouette/stereo and its
application to 3-d digitizer, in: Proceedings of Discrete Geometry for Computing
Imagery, 1999, pp. 177–190.
 S. Seitz, C. Dyer, Photorealistic scene reconstruction by voxel coloring, International
Journal of Computer Vision 38 (3) (2000) 197–216.
 L. Zhang, S. M. Seitz, Image-based multiresolution shape recovery by surface
deformation, in: Proc. of SPIE: Videometrics and Optical Methods for 3D Shape
Measurement, 2001, pp. 51–61.
 A. Yezzi, G. Slabaugh, R. Cipolla, R. Schafer, A surface evolution approach of
probabilistic space carving, in: 3DPVT ’02, 2002, pp. 618–621.
 R. Keriven, O. Faugeras, Variational principles, surface evolution, pdes, level set
methods, and the stereo problem, IEEE Transactions on Image Processing 7 (3) (1998)
 A. Sarti, S. Tubaro, Image based multiresolution implicit object modeling, EURASIP
Journal on Applied Signal Processing 2002 (10) (2002) 1053–1066.
 G. Cross, A. Zisserman, Surface reconstruction from multiple views using apparent
contours and surface texture, in: A. Leonardis, F. Solina, R. Bajcsy (Eds.), NATO
Advanced Research Workshop on Confluence of Computer Vision and Computer
Graphics, Ljubljana, Slovenia, 2000, pp. 25–47.
 J. Isidoro, S. Sclaroff, Stochastic refinement of the visual hull to satisfy photometric
and silhouette consistency constraints, in: Proc. ICCV, 2003, pp. 1335 –1342.
 S. Soatto, A. J. Yezzi, H. Jin, Tales of shape and radiance in multi-view stereo, in:
Proc. ICCV, 2003, pp. 974 –981.
 P. Fua, Y. Leclerc, Object-centered surface reconstruction: Combining multi-image
stereo and shading, International Journal of Computer Vision 16 (1995) 35–56.
 S. Nobuhara, T. Matsuyama, Dynamic 3d shape from multi-viewpoint images using
deformable mesh models, in: Proc. of 3rd International Symposium on Image and
Signal Processing and Analysis, 2003, pp. 192–197.
 M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active contour models, International
Journal of Computer Vision 1 (1988) 321–332.
 J. Sethian, Level Set Methods: Evolving Interfaces in Geometry, Fluid Mechanics,
Computer Vision and Materials Sciences, Cambridge University Press, 1996.
of Computational Physics 118 (1995) 269–277.
 X. Han, C. Xu, J. L. Prince, A topology preserving level set method for geometric
deformable models, IEEE Transactions on PAMI 25 (2003) 755–768.
 A. Laurentini, The visual hull concept for silhouette based image understanding, IEEE
Trans. on PAMI 16 (2) (1994) 150–162.
 W. E. Lorensen, H. E. Cline, Marching cubes: A high resolution 3d surface
construction algorithm, in: Proceedings of SIGGRAPH ’87, Vol. 21, 1987, pp. 163–
 G. Medioni, M.-S. Lee, C.-K. Tang, A Computational Framework for Segmentation
and Grouping, Elsevier, 2000.
 A. Broadhurst, T. Drummond, R. Cipolla, A probabilistic framework for the Space
Carving algorithm, in: Proc. 8th ICCV, IEEE Computer Society Press, Vancouver,
Canada, 2001, pp. 388–393.
 C. Hern´ andez, F. Schmitt, Multi-stereo 3d object reconstruction, in: 3DPVT ’02, 2002,
 C. Xu, J. L. Prince, Snakes, shapes, and gradient vector flow, IEEE Transactions on
Image Processing (1998) 359–369.
 B. Fornberg, Generation of finite difference formulas on arbitrarily spaced grids,
Mathematics of Computation 51 (1988) 699–706.
 L. Kobbelt,√3-subdivision, in: SIGGRAPH 2000, 2000, pp. 103–112.
 F. Schmitt, Y. Yemez, 3d color object reconstruction from 2d image sequences, in:
IEEE Interational Conference on Image Processing, Vol. 3, 1999, pp. 65–69.
 H. Lensch, W. Heidrich, H. P. Seidel, A silhouette-based algorithm for texture
registration and stitching, Journal of Graphical Models (2001) 245–262.
 J. M. Lavest, M. Viala, M. Dhome, Do we really need an accurate calibration pattern
to achieve a reliable camera calibration?, in: Proc. ECCV, Vol. 1, 1998, pp. 158–174,
 J. O. Lachaud, A. Montanvert, Deformable meshes with automated topology changes
for coarse-to-fine 3d surface extraction, Medical Image Analysis 3 (2) (1999) 187–