Content uploaded by Mengqi Ji
Author content
All content in this area was uploaded by Mengqi Ji on Nov 17, 2017
Content may be subject to copyright.
International Conference on
Computer Vision 2017
SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, Lu Fang
mji@connect.ust.hk
Contribution
Related Works
The first end-to-end learning frame work for MVS:
•network directly learns the photo-consistency and
geometric relationship.
•new 3D voxel representation encoding the camera
poses.
MVS takes multiple images with camera poses as inputs.
Standard pipelines and drawbacks:
•Volumetric methods [5]: Manually designed graph-based
cost functions.
•Depth map fusion methods [1,2,3,4,6]: Engineered
processing steps.
Reference Ours Camp [1]
Furu [2] Tola[3] Gipuma [4]
SurfaceNet: 2 views SurfaceNet: N views
Evaluation
➢Problem: how to embed the camera parameter into an
end-to-end network?
➢Solution: propose a 3D voxel representation for each
view:colored voxel cube (CVC).
1. Scene overlapping volumes voxel grid
2. Each pixel corresponds to a voxel ray.
3. Colorize different voxels on the same voxel ray as the
same color.
3D SurfaceNet:
1. takes 2 colored voxel cubes from 2 different views as
input, with size of (s,s,s). s=32 for training, s can vary
during inference owing to the fully ConvNet.
2. predicts for each voxel a binary occupancy attribute
indicating if the voxel is on the surface or not.
3. consists of multiple network layer groups: 𝑙𝑖,s𝑖each of
which includes several convolutional layers and pooling
layers.
[1] N. D. Campbell, et al. Using multiple hypotheses to improve depth-maps for multi-view stereo.
ICCV 2008.
[2] Y. Furukawa, et al.Accurate, dense, and robust mul-tiview stereopsis. PAMI 2010.
[3] E. Tola, et al. Efficient large-scale multi-view stereo for ultra high-resolution image sets. MVA
2012.
[4] S. Galliani, et al. Massively parallel multiview stereopsis by surface normal diffusion. ICCV 2015.
[5] http://www.ctralie.com/PrincetonUGRAD/Projects/SpaceCarving/
[6] https://www.cse.wustl.edu/~furukawa/newimages/fnt_mvs.png
[8] Aanæs, Henrik, et al. Large-scale data for multiple-view stereopsis. IJCV 2016.
[9] S. M. Seitz, et al. A comparison and evaluation of multi-view stereo reconstruction algorithms.
CVPR 2006.
➢Problem: fusing all predictions of all the view pairs is
not feasible. Since 50 views 1000+ view pairs.
➢Solution: only use the valuable view pairs.
•relative importance 𝑤for each view pair based on
baseline and the image appearance on both views
(left) Randomly select 5 view pairs out of 1000+.
(Right) Select 5 view pairs with top 𝑤value.
➢DTU dataset [8]:evaluate randomly selected 22
models. The left models are used for training.
➢Comparable results:
➢How many view pairs are needed:
➢Another dataset: use only 6
images of the dinoSparseRing
model in the Middlebury dataset.
training:
•Crop volumes from subset of the scenes from the DTU
dataset [8]. takes 2 colored voxel cubes from 2 different
views as input.