DataPDF Available


This paper proposes an end-to-end learning framework for multiview stereopsis. We term the network SurfaceNet. It takes a set of images and their corresponding camera parameters as input and directly infers the 3D model. The key advantage of the framework is that both photo-consistency as well geometric relations of the surface structure can be directly learned for the purpose of multiview stereopsis in an end-to-end fashion. SurfaceNet is a fully 3D convolutional network which is achieved by encoding the camera parameters together with the images in a 3D voxel representation. We evaluate SurfaceNet on the large-scale DTU benchmark. Code is available in
International Conference on
Computer Vision 2017
SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, Lu Fang
Related Works
The first end-to-end learning frame work for MVS:
network directly learns the photo-consistency and
geometric relationship.
new 3D voxel representation encoding the camera
MVS takes multiple images with camera poses as inputs.
Standard pipelines and drawbacks:
Volumetric methods [5]: Manually designed graph-based
cost functions.
Depth map fusion methods [1,2,3,4,6]: Engineered
processing steps.
Reference Ours Camp [1]
Furu [2] Tola[3] Gipuma [4]
SurfaceNet: 2 views SurfaceNet: N views
Problem: how to embed the camera parameter into an
end-to-end network?
Solution: propose a 3D voxel representation for each
view:colored voxel cube (CVC).
1. Scene overlapping volumes voxel grid
2. Each pixel corresponds to a voxel ray.
3. Colorize different voxels on the same voxel ray as the
same color.
3D SurfaceNet:
1. takes 2 colored voxel cubes from 2 different views as
input, with size of (s,s,s). s=32 for training, s can vary
during inference owing to the fully ConvNet.
2. predicts for each voxel a binary occupancy attribute
indicating if the voxel is on the surface or not.
3. consists of multiple network layer groups: 𝑙𝑖,s𝑖each of
which includes several convolutional layers and pooling
[1] N. D. Campbell, et al. Using multiple hypotheses to improve depth-maps for multi-view stereo.
ICCV 2008.
[2] Y. Furukawa, et al.Accurate, dense, and robust mul-tiview stereopsis. PAMI 2010.
[3] E. Tola, et al. Efficient large-scale multi-view stereo for ultra high-resolution image sets. MVA
[4] S. Galliani, et al. Massively parallel multiview stereopsis by surface normal diffusion. ICCV 2015.
[8] Aanæs, Henrik, et al. Large-scale data for multiple-view stereopsis. IJCV 2016.
[9] S. M. Seitz, et al. A comparison and evaluation of multi-view stereo reconstruction algorithms.
CVPR 2006.
Problem: fusing all predictions of all the view pairs is
not feasible. Since 50 views 1000+ view pairs.
Solution: only use the valuable view pairs.
relative importance 𝑤for each view pair based on
baseline and the image appearance on both views
(left) Randomly select 5 view pairs out of 1000+.
(Right) Select 5 view pairs with top 𝑤value.
DTU dataset [8]:evaluate randomly selected 22
models. The left models are used for training.
Comparable results:
How many view pairs are needed:
Another dataset: use only 6
images of the dinoSparseRing
model in the Middlebury dataset.
Crop volumes from subset of the scenes from the DTU
dataset [8]. takes 2 colored voxel cubes from 2 different
views as input.

File (1)

Content uploaded by Mengqi Ji
Author content

Supplementary resource (1)

ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We present a new approach for large scale multi-view stereo matching, which is designed to operate on ultra high resolution image sets and efficiently compute dense 3D point clouds. We show that, by using a robust descriptor for matching purposes and high resolution images, we can skip the computationally expensive steps other algorithms require. As a result, our method has low memory requirements and low computational complexity while producing 3D point clouds containing virtually no outliers. This makes it exceedingly suitable for large scale reconstruction. The core of our algorithm is the dense matching of image pairs using DAISY descriptors, implemented so as to eliminate redundancies and optimize memory access. We use a variety of challenging data sets to validate and compare our results against other algorithms.
Full-text available
This paper proposes a novel algorithm for multiview stereopsis that outputs a dense set of small rectangular patches covering the surfaces visible in the images. Stereopsis is implemented as a match, expand, and filter procedure, starting from a sparse set of matched keypoints, and repeatedly expanding these before using visibility constraints to filter away false matches. The keys to the performance of the proposed algorithm are effective techniques for enforcing local photometric consistency and global visibility constraints. Simple but effective methods are also proposed to turn the resulting patch model into a mesh which can be further refined by an algorithm that enforces both photometric consistency and regularization constraints. The proposed approach automatically detects and discards outliers and obstacles and does not require any initialization in the form of a visual hull, a bounding box, or valid depth ranges. We have tested our algorithm on various data sets including objects with fine surface details, deep concavities, and thin structures, outdoor scenes observed from a restricted set of viewpoints, and "crowded" scenes where moving obstacles appear in front of a static structure of interest. A quantitative evaluation on the Middlebury benchmark shows that the proposed method outperforms all others submitted so far for four out of the six data sets.
Conference Paper
This paper presents a quantitative comparison of several multi-view stereo reconstruction algorithms. Until now, the lack of suitable calibrated multi-view image datasets with known ground truth (3D shape models) has prevented such direct comparisons. In this paper, we first survey multi-view stereo algorithms and compare them qualitatively using a taxonomy that differentiates their key properties. We then describe our process for acquiring and calibrating multiview image datasets with high-accuracy ground truth and introduce our evaluation methodology. Finally, we present the results of our quantitative comparison of state-of-the-art multi-view stereo reconstruction algorithms on six benchmark datasets. The datasets, evaluation details, and instructions for submitting new models are available online at
Conference Paper
We propose an algorithm to improve the quality of depth-maps used for Multi-View Stereo (MVS). Many existing MVS techniques make use of a two stage approach which estimates depth-maps from neighbouring images and then merges them to extract a final surface. Often the depth-maps used for the merging stage will contain outliers due to errors in the matching process. Traditional systems exploit redundancy in the image sequence (the surface is seen in many views), in order to make the final surface estimate robust to these outliers. In the case of sparse data sets there is often insufficient redundancy and thus performance degrades as the number of images decreases. In order to improve performance in these circumstances it is necessary to remove the outliers from the depth-maps. We identify the two main sources of outliers in a top performing algorithm: (1) spurious matches due to repeated texture and (2) matching failure due to occlusion, distortion and lack of texture. We propose two contributions to tackle these failure modes. Firstly, we store multiple depth hypotheses and use a spatial consistency constraint to extract the true depth. Secondly, we allow the algorithm to return an unknown state when the a true depth estimate cannot be found. By combining these in a discrete label MRF optimisation we are able to obtain high accuracy depth-maps with low numbers of outliers. We evaluate our algorithm in a multi-view stereo framework and find it to confer state-of-the-art performance with the leading techniques, in particular on the standard evaluation sparse data sets.
Large-scale data for multiple-view stereopsis
  • Aanae S
  • Henrik
Aanae s, Henrik, et al. Large-scale data for multiple-view stereopsis. IJCV 2016.