Conference PaperPDF Available

SOS: Stereo Matching in O(1) with Slanted Support Windows

Authors:
  • perceptiveIO, Inc
  • perceptiveIO, Inc.

Abstract and Figures

Depth cameras have accelerated research in many areas of computer vision. Most triangulation-based depth cameras, whether structured light systems like the Kinect or active (assisted) stereo systems, are based on the principle of stereo matching. Depth from stereo is an active research topic dating back 30 years. Despite recent advances, algorithms usually trade-off accuracy for speed. In particular, efficient methods rely on fronto-parallel assumptions to reduce the search space and keep computation low. We present SOS (Slanted O(1) Stereo), the first algorithm capable of leveraging slanted support windows without sacrificing speed or accuracy. We use an active stereo configuration, where an illuminator textures the scene. Under this setting, local methods-such as PatchMatch Stereo-obtain state of the art results by jointly estimating disparities and slant, but at a large computational cost. We observe that these methods typically exploit local smoothness to simplify their initialization strategies. Our key insight is that local smoothness can in fact be used to amortize the computation not only within initialization, but across the entire stereo pipeline. Building on these insights, we propose a novel hierarchical initialization that is able to efficiently perform search over disparity and slants. We then show how this structure can be leveraged to provide high quality depth maps. Extensive quantitative evaluations demonstrate that the proposed technique yields significantly more precise results than current state of the art, but at a fraction of the computational cost. Our prototype implementation runs at 4000 fps on modern GPU architectures.
Content may be subject to copyright.
SOS: Stereo Matching in O(1) with Slanted Support Windows
Vladimir Tankovich1Michael Schoenberg1Sean Ryan Fanello1Adarsh Kowdle1
Christoph Rhemann1Maksym Dzitsiuk1Mirko Schmidt1Julien Valentin1Shahram Izadi1
Abstract Depth cameras have accelerated research in many
areas of computer vision. Most triangulation-based depth
cameras, whether structured light systems like the Kinect or
active (assisted) stereo systems, are based on the principle
of stereo matching. Depth from stereo is an active research
topic dating back 30 years. Despite recent advances, algorithms
usually trade-off accuracy for speed. In particular, efficient
methods rely on fronto-parallel assumptions to reduce the
search space and keep computation low. We present SOS
(Slanted O(1) Stereo), the first algorithm capable of leveraging
slanted support windows without sacrificing speed or accuracy.
We use an active stereo configuration, where an illuminator
textures the scene. Under this setting, local methods - such as
PatchMatch Stereo - obtain state of the art results by jointly
estimating disparities and slant, but at a large computational
cost. We observe that these methods typically exploit local
smoothness to simplify their initialization strategies. Our key
insight is that local smoothness can in fact be used to amortize
the computation not only within initialization, but across the
entire stereo pipeline. Building on these insights, we propose
a novel hierarchical initialization that is able to efficiently
perform search over disparity and slants. We then show how
this structure can be leveraged to provide high quality depth
maps. Extensive quantitative evaluations demonstrate that the
proposed technique yields significantly more precise results than
current state of the art, but at a fraction of the computational
cost. Our prototype implementation runs at 4000 fps on modern
GPU architectures.
I. INTRODUCTION
Since the release of the Microsoft Kinect, 3D cameras
have revolutionized the way we tackle challenging computer
vision problems such as body part classification [43], hand
pose estimation [28], [44], action recognition [13], [11],
3D scanning [25], nonrigid reconstruction [9], [8], and 3D
scene understanding [32], [45], [6]. Depth sensors have also
enabled challenging scenarios in robotics [7], [12], [22] as
well as augmented and virtual reality [38].
The simplest way to estimate the 3D information of a
scene makes use of ‘passive’ sensors and algorithms (e.g.
RGB cameras, Structure from Motion (SfM)). These methods
typically perform poorly when there is no texture in the
scene, or rely on strong prior assumptions about the scene
that fail in general cases [20]. To overcome these limitations,
‘active’ sensors rely on projectors that augment the scene
with additional texture. Such sensors can be categorized into
two areas: time of flight and triangulation based, also known
as structured light. Time of flight systems have a compelling
form factor (no baseline is required), but they suffer from
multipath interference, which renders the measurements un-
usable for precise reconstructions [26], [19], [3], [36].
1Authors are with the Augmented Perception group at Google.
Structured light systems [41], [21] fall into temporal and
spatial categories. Temporal algorithms relying on multi-
ple shots of the same scene under differing illumination
patterns. Such approaches are typically efficient, as their
depth is directly encoded into a lookup table, but they
have serious limitations: short range, motion artifacts, ex-
pensive hardware, and significant power requirements. A
recent example of such an approach is the Intel RealSense
SR300, which works up to 120cm. Spatial structured light
systems have been popular since Kinect V1. Such methods
encode information spatially into a pattern that is projected
into the scene, recovering depth in a single shot. Despite
commercialization, this technology also suffers from major
drawbacks: interference from multiple sources, need for an a-
priori known pattern, and online calibration requirements to
compensate for drift from that pattern. Moreover, structured
light algorithms demand non-trivial computational resources,
with many examples of the technology limited to low fram-
erates (30fps) and VGA resolution.
Active (or assisted) stereo [30], [37] represents a solution
for most of these challenges: they do not require a known
pattern, perform well in textureless regions, and do not
suffer from multipath interference. However, these systems
must solve a correspondence problem per-pixel, making the
computation intractable for many real-time scenarios.
Considerable effort has been spent to design efficient
algorithms for this stereo correspondence problem. The goal
is to infer a disparity dfor each pixel in the image, and such
a problem can be cast as a nearest neighbor problem between
the left and the right image in some cost space. Given the
(fixed) geometry of the cameras, runtime cost can be reduced
by searching across an appropriate epipolar line. A stereo
matching algorithm usually depends on two main variables:
the size of the patch p×pused to compute a correlation
function between image patches, and the number of disparity
hypotheses Lthat needs to be tested during the search. Many
related works have tried to remove the dependency on the
window size [5] or on the range of disparities [40], [34],
[27].
Very recently, many O(1) methods have shown remarkable
results [35], [33], [17], [16] in solving the active stereo
problem in real-time. These approaches do not depend on
the window size or number of disparity hypotheses that
need to be tested. However, they all rely on the so-called
fronto-parallel assumption, which requires that the disparity
be constant for a given patch. In practice, this assumption
is quite restrictive, and represents a compromise between
resulting quality and runtime. Some existing works lift
this constraint [5], [10], but they typically require multiple
seconds to process a single frame.
We present SOS (Slanted O(1) Stereo), the first matching
algorithm with slanted support windows that scales linearly
with the resolution of the image. Our method does not
depend on the window size or disparity space and each
pixel computes 70 intensity differences to predict the
final disparity. By dividing the image into multiple non-
overlapping tiles, we can explore the much-larger cost vol-
ume by amortizing the computation across these tiles. This
permits us to remove the dependency on an explicit window
size used to compute the correlation between left and right
patches. Random initialization followed by multiple hierar-
chical aggregation steps creates an algorithm independent of
the size of the disparity search space. Finally, we explicitly
estimate the patch slant and perform subpixel refinement at
each step of the pipeline, leading to very precise reconstruc-
tions. Multiple experiments show high quality results that
outperform other state of the art O(1) methods. Our GPU
implementation runs at over 4000 fps on a Nvidia Titan X
GPU using 1.3Megapixel images.
II. REL ATED WORK
Stereo matching pipelines are traditionally composed of
three principal steps: matching cost computation, dispar-
ity optimization, and refinement [42]. The matching cost
computation defines the distance function used to compare
two p×ppatches between left and right image. These
correlation functions usually depend on an explicit window
size, making the algorithm challenging to scale when the
resolution increases. Some common correlation functions
used are sum of absolute differences (SAD), normalized
cross-correlation (NCC), and the census transform - see [24]
for a comprehensive evaluation. The second step, disparity
optimization, consists of finding the best disparity hypothe-
sis. A naive exhaustive search approach (e.g. block matching)
would be infeasible for high resolution images (a typical
disparity range for 1.3Megapixel images is between 0256).
Depending on the optimization strategy, methods can be
categorized as local [47], [40], [5], [35], global [18], [2],
[31], [33] or semiglobal [23], [4].
Most of these works have inherent dependencies on the
number of disparity hypotheses or the window size used to
compute the cost. For instance, methods relying on the so-
called cost volume filtering [40], [34], [27], have a running
time independent of the patch size - since they can efficiently
compute the cost using integral images. They trade this
benefit for a linear dependency on the number of disparities
L, which limits their usefulness for realtime applications
to low resolution images or small disparity ranges. Other
methods [5], [2] leverage a PatchMatch-like scheme [1] to
avoid searching the full disparity space. Unfortunately, their
complexity depends on the patch size used to compute the
matching cost. Recently, O(1) methods have reduced the
overall computation required for stereo by making use of
super-pixels and PatchMatch optimization [35], [33]. Very
recent works have used machine learning to derive an O(1)
depth estimation algorithm [14], [15], [46], [17], [16]. For
example, [15] uses a random forest per scanline to predict
depth in O(1) - but the method requires a tedious calibration
and offline learning for each camera. Similarly, in [17], the
authors use a decision tree to learn a mapping from image
patches to a binary space. The learned mapping is sparse,
therefore independent of the window size. A PatchMatch
framework is then used to optimize over the disparities. In
[16] a per-pixel parallel scheme is proposed to solve a CRF
model in O(1).
These O(1) systems make a strong fronto-parallel as-
sumption, which means that the disparity stays constant in
image patches. However in real applications this constraint
is typically violated. Previous work that drops the fronto-
parallel assumption [5] performs randomized search for both
disparity and plane normals, but this radically increases the
size of the search space. This increased search space renders
that method suitable only for off-line applications. Others,
such as [10], test multiple per-pixel correlations to estimate
the right slant, and then perform disparity optimization using
standard block matching search. Recent deep learning archi-
tectures [29], [48] can implicity deal with slanted surfaces
using 3D convolutions in the cost volume, however their
computational requirements are still too demanding even for
high-end GPUs.
III. SOS ALGORITHM
In this section we detail our approach to depth estimation
from triangulation systems. We start from an image pair IL
and IR, and we assume that these images are calibrated and
rectified. We use a similar setup to that used in [17], [16]: we
use a custom Kinect-like DOE projector to generate texture
in the scene. We select Ximea monochrome cameras with a
spatial resolution of 1280 ×1024 pixels. These cameras are
capable of running at 210 fps full resolution.
Since our images are rectified, for each pixel (x, y)in
the left image IL, there is a corresponding pixel (xd, y)
in the right image, where dis the so-called disparity. In a
triangulation system, the disparity is inversely proportional
to the depth Z=bf
d, with band fthe baseline and focal
length of the system, respectively.
In the remainder of the section, we detail our approach to
efficient disparity estimation from stereo. The core insight is
that we can amortize the computation across local image tiles
in a fine-to-coarse pyramid, while estimating both disparity
and slant. We additionally perform multiple subpixel refine-
ment steps across a continuous cost space, producing high
quality results. See Fig. 1for an overview of the method.
A. Initialization
As mentioned previously, performing exhaustive search
in order to initialize our solution is problematic. In addi-
tion to being computationally prohibitive, exhaustive search
methods typically evaluate only a unary cost - that is, they
do not take solution smoothness into account, and so can
end up in incorrect minima. Some works [16], [39] indicate
that exhaustive search is not required to obtain state of
Fig. 1. Overview of the proposed algorithm. The input to our method is a pair of calibrated and rectified images. First, we perform a hierarchical
search that estimates an initial tiled disparity map, where the disparity values within each tile follow a planar equation. These initial estimates are then
refined using an efficient inference that encourages local smoothness accross tiles. Finally, these refined per-tile estimates are used to infer precise per-pixel
disparity. We refer the reader to Sec. III for details.
the art results. Based on these intuitions, we propose a
hierarchical approach which differs from those proposed in
literature. While it is common to perform a coarse-to-fine
search of the disparity space in order to improve performance
(by restricting cost volume search intervals based on the
result of a higher level in the pyramid), such methods
operate by finding a single local minimum for each patch
and refining it as they recurse down to the pixel level.
Consequently, such techniques can miss even relatively large
features in the input. Instead, we propose an inverted, fine-
to-coarse ranking and aggregation scheme. Starting at a
per-pixel level, we evaluate several (in our implementation,
4) weak hypotheses per pixel using a pixel-wise absolute
difference, storing the winning hypothesis as the input to
our initialization algorithm. We then recursively examine
2×2non-overlapping elements of the previous level. Each
such element has one winning hypothesis, which we evaluate
across the entire 2×2tile using sum of absolute differences
(SAD) in pixel space. Candidates are ranked, and the winning
hypothesis per tile provided as input to the next level in the
recursion, which continues in our implementation until these
tiles are 16×16 pixels wide. Consequently, each level of the
recursion doubles the width and height of tiles, but halves the
number of tiles in each dimension that must be processed,
leading to a O(N)runtime. See Sec. III-E for details on
the computational analysis of the proposed algorithm. The
key insight behind this procedure is that reconstruction error
(here, SAD) for a ‘correct’ disparity is low for all tile sizes
- true negative candidates (i.e. poor reconstruction error and
incorrect disparity), can be quickly rejected at low cost,
and we can sample much of the disparity space in this
manner at the finer levels of the pyramid. False positive (good
reconstruction error, but poor disparity) and true positive
(good reconstruction error, and correct disparity) candidates
are propagated up the hierarchy where larger and larger tiles
permit to removal of false positives while retaining true
positives. In practice, due to the sparsity of the structured
light illuminator used in this paper, we found that 16×16 tiles
provide enough spatial support to filter out the vast majority
of false positives. It is interesting to note that for 16 ×16
tiles, a total of 1024 hypothesis (16 ×16 ×4) are evaluated
(including duplicates), which is well above the maximum
disparity (350) that our hardware system supports. Once this
procedure is complete, we have coarse fronto-parallel depth
tiles for the full image. The next section describes how
the fronto-parallel constraint is relaxed to obtain a refined
initialization.
B. Slant Estimation and Subpixel Refinement
At the end of the previous step, each 16 ×16 tile was
assigned a single disparity d. We now refine dvia a standard
parabola fit in the cost space. That is, we evaluate the cost
using SAD of the full 16 ×16 tile with disparity d1,
disparity d, and disparity d+ 1. We fit a parabola to the
local cost space described by those reconstruction errors.
Using the standard closed form expression, the minimum of
this quadratic polynomial is extracted and used as a refined
estimation of d. This refined estimate still corresponds to a
fronto-parallel planar solution, but in the vast majority of
scenes, this fronto-parallel assumption does not hold and
hence must be lifted. We propose increasing the degree of
the model by one, such that geometric structures in disparity
space are represented by planes in disparity space, rather
than by constant integer values. In fact, since depth and
disparity are inversely proportional to each other, a plane
equation in disparity space corresponds to a smooth quadratic
surface in depth. When considering fronto-parallel depth, the
following relationship defines how pixels in the left image
xLare related to pixels in the right image xR:
xL=xRd(1)
If we instead describe a plane l= [d, dx, dy]in disparity
space, the relationship becomes:
xL=xR+S(xL, l)
S(xL, l) = kxdx+kydyd(2)
where kx, kyare any offset from the patch center and dx, dy
are the coefficients controlling the orientation of the plane.
Similarly to the refinement of d, the values of dxand dy
are optimized by fitting a parabola to costs computed by
evaluating 3 plane hypotheses on the tile (fronto-parallel, +30
degree slant and -30 degree slant). The assumption is that
the minimum of the quadratic function is close to the ‘true’
minimum for the vast majority of tiles, which is validated
by the experiments discussed in Section IV-A. Once this
refinement is complete, each tile is associated with a disparity
model that follows a plane equation. The output of this stage
is shown in Fig. 1,initialization column. One can observe
that this coarse initialization provides a solution that already
greatly resembles the final solution in the rightmost column.
Nevertheless, as illustrated in the top right corner of the
16 ×16 result of Fig. 1, some tiles may have arrived at
an incorrect local minimum, an issue which we address with
the following regularization scheme.
C. Propagation and Inference
To tackle the issue of having a few tiles with incoherent
solutions, we resort to using a Conditional Random Field
(CRF). CRFs are in general NP-hard to solve exactly, and
most solvers are iterative by nature, making their compu-
tational requirements de-facto not attractive for real-time
applications. Recently, [16] presented a fast and fully par-
allel inference technique that provides high quality solutions
under the condition that the initial solution are high quality,
which we meet. For completeness, a short description of
their method follows. In a nutshell, the problem is cast in
a probabilitic framework:
P(Y|D) = 1
Z(D)eE(Y|D)(3)
and minimized in the log-space:
E(Y|D) = X
i
ψu(li) + X
i
X
j∈Ni
ψp(li, lj),(4)
In our case, the data-term ψu(li)corresponds to the recon-
struction error for tile iunder the planar hypothesis li, and
Z(D)is the partition function. In more detail, we define
ψu(li) = X
p∈Ti
|IL(p)IR(pxS(px, li), py)|(5)
where the summation is performed over all pixels pfrom
the set of pixels Ticontained in tile i. The function S(px, li)
(Eq. 2) estimates the disparity of pixel punder the planar
hypothesis li. Finally, IL(.)and IR(.)respectively return
the intensity values stored in the left and right images for
the queried pixels. Concretely, the proposed unary potential
evaluates the reconstruction error (SAD) under the planar
hypothesis li. Note that this is different from the reconstruc-
tion error evaluated previously, where the error was evaluated
exclusively using fronto-parallel planes.
Our new pairwise potential ψpis evaluated over the
neighbors Ni, which corresponds to the tiles above, below,
left and right from tile i, and is defined as
ψp(li, lj) = λmin(|ld
iS(c(i)x, lj)|,3),(6)
where c(i)returns the position of the pixel at the center of
tile i, and ld
icorresponds to the disparity component of the
planar hypothesis li. Concretely, this function first evaluates
what the disparity of the center of tile iwould be if it
were to belong to the plane lj. A truncated `1-norm between
that estimated disparity and the current candidate disparity is
then computed. In order to not over-penalize large disparity
changes (e.g. transition from foreground to background), this
distance is truncated. The parameter λcontrols the degree
of smoothness in the solution. The authors of [16] show that
Eq. 4can be efficiently performed though an approximation
of mean-field where each minimization step corresponds to
taking the union of the labels associated with the current tile
and its |Ni|neighbors. Each of these 5 candidates is ranked
by evaluating:
E(Yi|D) = ψu(li) + X
j∈Ni
ψp(li, lj)(7)
The minimizer is used as the new disparity value for tile i. In
practice, we found that performing two steps of minimization
is enough to converge to good solutions. Note that each
iteration of the minimization is extremely fast as it is
performed on only 5120 (1280 ×1024/162) variables. Once
the minimization is performed, the disparity at each tile is
refined using another parabola fit leveraging the estimated li.
We have found empirically that re-estimating first derivative
hypotheses via sub-pixel fitting is not required for a high
quality results.
D. Per-pixel estimation
After propagation is complete, we have a robust estimate
of the disparity and slant for each 16×16 tile. The quadratic
approximation used to estimate the slant of tiles is robust
in the [30,30]and this range of angles provides in-
ferred solutions that are much higher quality than assuming
fronto-parallel tiles. Unfortunately, real surfaces can be much
steeper, and the quadratic approximation becomes weaker
for angles much bigger than 30, as demonstrated in the
experiments. To circumvent this limitation, we compute the
final slant of each patch using central differences built using
the disparity at the center of neighboring tiles.
We now leverage the above initialization to obtain precise
per-pixel results. First, each tile is ‘expanded’ by 50% in
both x and y directions - causing any given pixel (except
at the image boundaries) to overlap 4 expanded tiles. For
each expanded tile, we build an integral ‘tile’ of the re-
construction error (SAD) obtained using the corresponding
plane hypothesis li. We build two other integral ‘tiles’ per
expanded tile, which capture the reconstruction error with a
small delta added to the disparity component of li. For every
pixel, we can now perform 4 parabolae fits of their respective
cost volumes using the integral tiles described above. The
cost of each pixel is again defined as the reconstruction
error, but is computed over 11 ×11 patches centered on the
pixel in question. The solution with the smallest interpolated
reconstruction error is used as final disparity estimate.
a) Invalidation: Given the inherent physical limitations
of an active stereo system (e.g. poor signal to noise ratio
on dark or far surfaces, occlusions), not enough data is
available to perform robust estimates for any given pixel.
Traditional methods of invalidation involve left-right consis-
tency checks, filtering, and/or connected component analysis
- all computationally expensive methods. Instead, we make
use of byproducts of our method to perform fast and precise
invalidation. More precisely, we do not consider tiles that
have slants greater than 75 degrees, and we invalidate pixels
after refinement when the final SAD cost is higher than a pre-
defined threshold θ. We found empirically that this simple
invalidation strategy produces clean results and requires
minimal computation.
E. Computational Analysis
In this section, we evaluate the overall computational com-
plexity of our depth algorithm. Let us assume an input image
containing Npixels, and suppose that Ldiscrete integer
disparity labels are permitted. Typical values in practice for
such algorithms are N= 1280 ×1024 and |L| = 512. We
proceed through all stages of our algorithm and justify why
each is O(N).
Initialization evaluates 4 random hypotheses per pixel at a
cost of O(1) per pixel. Next, we ascend a klevel hierarchy,
accumulating and re-evaluating hypotheses as we go. At each
level l, we process the image in tiles of 2l×2lpixels. Because
the tiles are non-overlapping, and the cost of the cost function
at each level is exactly proportional to the size of the tile,
the computational cost for each tile scales up at the same
rate that the number of tiles scales down. Consequently, we
require O(kN)to perform this step. In our implementation,
we use k= 4 levels, but it is technically the case that a
proper value of kis dependent on log2|L|. This dependency
occurs because we are sampling disparities randomly (and is
the same dependency incurred by any method which samples
the cost space at random [39]) - in order to obtain a fixed
probability of having the correct integer disparity within a
tile, we must increase the number of samples as we increase
the size of the disparity range. Fortunately, for practical
implementations, log2|L| is essentially a small constant (e.g.
if |L| = 512, then log2|L| = 9). Once this initialization is
complete, the highest-level cost function is evaluated again
for fitting dxand dy, a process which is O(1) per tile, and
thus O(N)overall.
Propagation occurs over two passes, each performed on the
coarsest tiles. Here, too, our cost function is proportional
in cost to tile size. We evaluate 4+1 hypotheses per tile and
select the best - again at a constant cost per tile. Consequently
this step requires O(5N)operations.
Per-pixel prediction and invalidation examine four hy-
potheses per pixel, and evaluate the cost for each using SAD
with a 11 ×11 patch per pixel. While naively this approach
would induce an additional patch size dependency, we can
amortize the computation required by generating an integral
image for the plane hypothesis of each tile. Suppose we have
per-pixel patches of width pand tiles of width w. We need
the integral image to be usable by all pixels that count this
tile as one of their four nearest tiles, so we must scale it
to overlap half of each of the nearby tiles. Consequently,
the integral image width per-tile is 2w+p- so under the
assumption (which holds for our implementation) that pis
Fig. 2. Distribution of angular error in performing plane fits for the first
derivative of tiles. See text for details.
Fig. 3. Qualitative comparisons with other state of the art O(1) methods
(UltraStereo [17], HashMatch [16]) and slow slanted windows approaches
(PatchMatch Stereo [5]). Note how we provide smoother results, better
invalidation and less noise.
the same order as w, we remain O(N)for the full image.
Invalidation is a single pass over the image that removes
patches with high-magnitude slants or low cost, therefore is
O(N).
IV. EVALUATIO N
In this section we evaluate the algorithm under various
challenging conditions, and compare the results with state of
the art methods. For all our experiments we use the active
stereo setup described in Sec. III.
A. Cost Space Analysis
Here we validate the use of a quadratic model to estimate
the first derivative dx, dyin each tile. Quadratic model
minimization is well-established in the stereo literature for
subpixel refinement on fronto-parallel disparities d(for in-
stance, in [24]), but it is not immediately obvious that such an
approach would also work for estimating the first derivatives
of the disparity space.
We recall that our method works by evaluating the cost
function on 0tiles, coarsely estimating the first derivatives
of the cost space, and then fitting a quadratic model per-
pixel to estimate the solution with minimum cost. Such
an approach inherently makes the assumption that the cost
space is locally smooth. While a naive approach to this
fitting would result in blocky artifacts around tile boundaries,
our refinement step (Sec. III-D) prevents such artifacts by
permitting multiple local models to be explored per-pixel.
Fig. 4. Qualitative comparisons of single frame reconstruction with state of
the art local methods [5] and [16]. Notice how our method exhibits smoother
single shot reconstruction, less edge fattening and higher level of details at
over 4000 fps.
Intuitively, assigning a slant value to a patch in the
image acts as a shear/stretch on the patch. We find that in
our system, typical shears/stretches are subpixel in size -
which suggests that a cost that is directly associated with
reconstruction error (such as SAD) should perform well.
To evaluate the quality of this approximation, we sweep
either dxor dyacross the full valid range K= [k, k]
to exhaustively sample the cost space. We evaluate the
approximation loss (i.e. the difference between the result
of our parabola fit and the minimum reached by exhaustive
search) for k= 30and k= 75, as shown in Fig. 2. We
can observe that the distribution of angular error improves
when restricting local plane fits to k= 30as opposed to
k= 75, with 90% of tiles tested within 6.3of the true
minimum. While increasing kto 75allows steeper planes to
be initially estimated, this risks attempting to match a highly
distorted patch - such highly oblique surfaces may sample the
same row or column repeatedly, or skip pixels when reading
the associated patch. We find that restricting kto 30does
not significantly damage result quality and, because of the
use of central differences in the refinement step later in the
process, does not prevent generation of steeper planes. Since
any deviation from a convex cost space impacts the quality
of the fit, we additionally examine the convexity of such cost
spaces. We find that true convexity is rare, with only 5.5% of
tiles having perfectly convex cost spaces. We instead observe
empirically that when sampled every 2.5, 82% of tiles have
quasiconvex cost spaces.
B. Qualitative Evaluation
In this section, we provide qualitative comparisons with
state of the art O(1) methods [17], [16], as well as with
algorithms that use slanted support windows [5]. We use the
same data used in [16] in order to provide a fair comparison.
The data consists of multiple shots of complex scenes
including people, objects, furniture and slanted planes. In
Fig. 3, we show results for the baseline methods. Notice
how O(1) methods such as UltraStereo [17] and HashMatch
[16] produce noisier results, as they explicitly make fronto-
parallel assumptions. The state of the art method that uses
slanted windows, PatchMatch Stereo [5], does not use an
explicit smothness term that exploits non-fronto parallel
surfaces. In contrast, our method is able to model slanted
Fig. 5. Quantitative comparisons. Depth bias (average absolute error) and
depth jitter (standard deviation) of different depth algorithms with respect
to the distance from a flat target. Note how the proposed method provides
stronger results across the whole range of depth values.
windows up to 75while producing results containing less
noise. For instance, examine the sofa in the first row of Fig.
3, the floor in the second, and the panels behind the person
in the third.
a) Single Shot Reconstruction: traditional methods typ-
ically require accumulation of observations over time [25],
and also usually rely a moving camera or moving objects in
order to average to the correct depth value. This needs an
additional tracking step, which ultimately may lead to failure
in the reconstruction. To better demonstrate the quality of
our method, we perform a single shot reconstruction of an
object at approximately 1m away from the camera. In Fig.
4, note how SOS generates a mesh that preserves details and
generates smooth results. In comparison, other algorithms
suffer from noisier predictions.
C. Quantitative Analysis
In this section we design quantitative experiments in order
to to precisely evaluate the error of SOS. We first analyze
depth bias and jitter, then we design a more sophisticated ex-
periment using ground truth generated via a LIDAR scanner.
We compare our method with a state of the art O(1) method
(HashMatch [16]) as well as with PatchMatch Stereo [5],
which explicitly model slanted surfaces. Note that the fastest
of these techniques is HashMatch, which runs at 1000fps on
GPU, and that the proposed method runs at 4000fps on the
same GPU.
a) Bias and Jitter: using our calibrated stereo setup, we
recorded multiple shots of a flat target at multiple distances.
We start from 500 mm up to 3500 mm, which is a reasonable
range for indoor scenarios. We can compute ground truth by
robust plane fitting of the depth data, and use the equation
Fig. 6. Single shot scans with LIDAR ground truth. Error maps and RMSE
are reported for each method.
of the fitted planes to compute metrics. We define the depth
bias as the average absolute error over multiple frames, and
the depth jitter as the standard deviation these frames. This
is similar to the setup used in [16] and [17].
In Fig. 5, we show the results of this experiment. Notice
how our error is nearly half that obtained by the baseline
state of the art methods. In addition, our algorithm exhibits
significant lower levels of noise.
b) Single Shot Analysis: in this experiment we eval-
uate single shot depthmaps using groundtruth generated
with a LIDAR scanner. We recorded 4objects placed at
approximately 800 mm from the camera. We align the
groundtruth depthmaps generated with the LIDAR sensor to
our depthmaps using rigid ICP. We compute the root mean
square error (RMSE) between groundtruth and predictions
generated with PatchMatch Stereo [5], Hashmatch [16] and
SOS. We calculate the error only in a small ROI around the
objects. In Fig. 6we report the results. Our method exhibits
the lowest error in most of the objects. This proves that the
plane model we use does not affect the reconstruction of
complex structures and it is 4×faster than the current state
of the art method.
Fig. 7. Slant Experiment. We recorded a sequence of a plane with multiple
orientations and compute the average error. SOS achieves the best results
not only quantitatively but also visually.
TABLE I
ERRO RS IN MM F OR DIFF EREN T SLAN T SUR FACE S.
Slant / Algorithm PM Stereo [5] HashMatch [16] SOS (no slant) SOS
Fronto-Parallel 0.45 0.63 0.44 0.31
25Horizontal 0.53 0.56 0.53 0.28
45Horizontal 0.56 0.47 0.45 0.22
60Horizontal 0.81 0.54 0.61 0.24
75Horizontal 0.82 0.73 0.65 0.52
25Vertical 0.47 0.38 0.45 0.21
45Vertical 0.61 0.43 0.46 0.17
60Vertical 1.1 0.7 0.71 0.26
75Vertical 1.48 1.01 1.10.56
D. Slanted Surfaces Analysis
Our main contribution is the capability of estimating
slanted surfaces in O(1). Here we design an experiment
specifically to evaluate this component. We record multiple
frames of a planar checkerboard at about 500 mm distance
from the camera. We rotate the checkerboard to cover the
full space: from fronto-parallel up to 75slant in both
horizontal and vertical orientations. We use robust plane
fitting to estimate the groundtruth plane selecting a small
ROI around the object. We show some qualitative results in
Fig. 7: we compare the algorithm with the other baselines.
Notice how we are able to support even extreme slants,
whereas the baseline techniques suffer from increased error.
We additionally computed the average error with respect
to the groundtruth planes and report results in Tab. I. Our
algorithm consistently outperforms the other approaches.
Finally, notice the contribution of the explicit slant estimation
(i.e. dxand dy) in Tab. I, third column. One can observe that
optimizing for the slant coefficients significantly improves
the precision of the results obtained by the proposed method,
while only costing a negligible amount of computation.
V. CONCLUSION
In this paper, we presented SOS, the first low computation
algorithm capable of leveraging slanted support windows.
By using the proposed hierarchical initialization scheme, our
technique is capable of quickly extracting high quality initial
disparity estimates per pixel. These initial candidates are
then refined through continuous refinement followed by an
invalidation step. Each of these steps is extremely efficient,
allowing the whole pipeline to run at 4000 fps on GPU.
Through extensive experiments, we have demonstrated that
the proposed method yields solutions that are superior to the
state of the art, while requiring significantly less compute.
REFERENCES
[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patch-
Match: A randomized correspondence algorithm for structural image
editing. ACM SIGGRAPH and Transaction On Graphics, 2009.
[2] F. Besse, C. Rother, A. Fitzgibbon, and J. Kautz. PMBP: Patchmatch
belief propagation for correspondence field estimation. IJCV, 2014.
[3] A. Bhandari, A. Kadambi, R. Whyte, C. Barsi, M. Feigin, A. Dorring-
ton, and R. Raskar. Resolving multi-path interference in time-of-flight
imaging via modulation frequency diversity and sparse regularization.
CoRR, 2014.
[4] M. Bleyer and M. Gelautz. Simple but effective tree structures for
dynamic programming-based stereo matching. In VISAPP, 2008.
[5] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo - Stereo
Matching with Slanted Support Windows. In BMVC, 2011.
[6] M. Bleyer, C. Rhemann, and C. Rother. Extracting 3d scene-consistent
object proposals and depth from stereo images. In ECCV, 2012.
[7] C. Ciliberto, S. R. Fanello, L. Natale, and G. Metta. A heteroscedastic
approach to independent motion detection for actuated visual sensors.
In IROS, 2012.
[8] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhe-
mann, V. Tankovich, and S. Izadi. Motion2fusion: Real-time volumet-
ric performance capture. SIGGRAPH Asia, 2017.
[9] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello,
A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, P. Kohli,
V. Tankovich, and S. Izadi. Fusion4d: Real-time performance capture
of challenging scenes. SIGGRAPH, 2016.
[10] N. Einecke and J. Eggert. Block-matching stereo with relaxed
fronto-parallel assumption. In IEEE Intelligent Vehicles Symposium
Proceedings, 2014.
[11] S. Fanello, I. Gori, G. Metta, and F. Odone. One-shot learning for
real-time action recognition. In IbPRIA, 2013.
[12] S. Fanello, U. Pattacini, I. Gori, V. Tikhanoff, M. Randazzo, A. Ron-
cone, F. Odone, and G. Metta. 3d stereo estimation and fully automated
learning of eye-hand coordination in humanoid robots. In IEEE-RAS
International Conference on Humanoid Robots, 2014.
[13] S. R. Fanello, I. Gori, G. Metta, and F. Odone. Keep it simple and
sparse: Real-time action recognition. JMLR, 2013.
[14] S. R. Fanello, C. Keskin, S. Izadi, P. Kohli, D. Kim, D. Sweeney,
A. Criminisi, J. Shotton, S. Kang, and T. Paek. Learning to be a
depth camera for close-range human capture and interaction. ACM
SIGGRAPH and Transaction On Graphics, 2014.
[15] S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. Orts Es-
colano, D. Kim, and S. Izadi. Hyperdepth: Learning depth from
structured light without matching. In CVPR, 2016.
[16] S. R. Fanello, J. Valentin, A. Kowdle, C. Rhemann, V. Taknkovich,
C. Ciliberto, P. Davidson, and S. Izadi. Low compute and fully parallel
computer vision with hashmatch. ICCV, 2017.
[17] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle, V. Tankovich,
P. Davidson, and S. Izadi. Ultrastereo: Efficient learning-based
matching for active stereo systems. CVPR, 2017.
[18] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation
for early vision. IJCV, 2006.
[19] D. Freedman, E. Krupka, Y. Smolin, I. Leichter, and M. Schmidt.
SRA: fast removal of general multipath for tof sensors. ECCV, 2014.
[20] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Manhattan-
world stereo. In 2009 IEEE Conference on Computer Vision and
Pattern Recognition, 2009.
[21] J. Geng. Structured-light 3d surface imaging: a tutorial. Advances in
Optics and Photonics, 3(2):128–160, 2011.
[22] I. Gori, U. Pattacini, V. Tikhanoff, and G. Metta. Ranking the
good points: A comprehensive method for humanoid robots to grasp
unknown objects. In IEEE ICAR, 2013.
[23] H. Hirschm ¨
uller. Stereo processing by semiglobal matching and
mutual information. PAMI, 2008.
[24] H. Hirschmuller and D. Scharstein. Evaluation of stereo matching
costs on images with radiometric differences. IEEE PAMI, 2009.
[25] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,
J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon.
KinectFusion: Real-time 3D reconstruction and interaction using a
moving depth camera. In ACM UIST, 2011.
[26] D. Jimenez, D. Pizarro, M. Mazo, and S. Palazuelos. Modelling and
correction of multipath interference in time of flight cameras. In
CVPR, 2012.
[27] M. Ju and H. Kang. Constant time stereo matching. In IMVIP, 2009.
[28] C. Keskin, F. Kırac¸, Y. Kara, and L. Akarun. Hand pose estimation
and hand shape classification using multi-layered randomized decision
forests. In ECCV, 2012.
[29] S. Khamis, S. Fanello, C. Rhemann, J. Valentin, A. Kowdle, and
S. Izadi. Stereonet: Guided hierarchical refinement for real-time edge-
aware depth prediction. In ECCV, 2018.
[30] K. Konolige. Projected texture stereo. In ICRA. IEEE, 2010.
[31] P. Kr¨
ahenb¨
uhl and V. Koltun. Efficient inference in fully connected
crfs with gaussian edge potentials. NIPS, 2011.
[32] Y. Li, A. Dai, L. Guibas, and M. Nießner. Database-assisted object
retrieval for real-time 3d reconstruction. In Computer Graphics Forum,
volume 34. Wiley Online Library, 2015.
[33] Y. Li, D. Min, M. S. Brown, M. N. Do, and J. Lu. Spm-bp: Sped-up
patchmatch belief propagation for continuous mrfs. In ICCV, 2015.
[34] J. Lu, K. Shi, D. Min, L. Lin, and M. Do. Cross-based local multipoint
filtering. In Proc. CVPR, 2012.
[35] J. Lu, H. Yang, D. Min, and M. Do. Patchmatch filter: Efficient edge-
aware filtering meets randomized search for fast correspondence field
estimation. In Proc. CVPR, 2013.
[36] N. Naik, A. Kadambi, C. Rhemann, S. Izadi, R. Raskar, and S. Kang.
A light transport model for mitigating multipath interference in TOF
sensors. CVPR, 2015.
[37] H. Nishihara. Prism: A practical real-time imaging stereo matcher mit
ai memo no. 780. Cambridge, Mass., USA, 1984.
[38] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle,
Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou, et al.
Holoportation: Virtual 3d teleportation in real-time. In UIST. ACM,
2016.
[39] V. Pradeep, C. Rhemann, S. Izad, C. Zach, M. Bleyer, and S. Bathiche.
Monofusion: Real-time 3d reconstruction of small scenes with a single
web camera. 2013.
[40] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast
cost-volume filtering for visual correspondence and beyond. In CVPR,
pages 3017–3024, 2011.
[41] J. Salvi, S. Fernandez, T. Pribanic, and X. Llado. A state of the art
in structured light patterns for surface profilometry. PR, 2010.
[42] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms. IJCV, 2002.
[43] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake. Real-time human pose recognition in parts
from single depth images. In CVPR, 2011.
[44] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp,
E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood,
S. Khamis, P. Kohli, S. Izadi, R. Banks, A. Fitzgibbon, and J. Shotton.
Efficient and precise interactive hand tracking through joint, continu-
ous optimization of pose and correspondences. SIGGRAPH, 2016.
[45] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton, P. Kohli,
M. Nießner, A. Criminisi, S. Izadi, and P. Torr. Semanticpaint: Inter-
active 3d labeling and learning at your fingertips. ACM Transactions
on Graphics (TOG), 2015.
[46] S. Wang, S. R. Fanello, C. Rhemann, S. Izadi, and P. Kohli. The
global patch collider. CVPR, 2016.
[47] K.-J. Yoon and I.-S. Kweon. Locally adaptive support-weight approach
for visual correspondence search. In Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 2, pages 924–931. IEEE, 2005.
[48] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle,
V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello.
Activestereonet: End-to-end self-supervised learning for active stereo
systems. In ECCV, 2018.
... Recently, the slanted O(1) stereo (SOS) algorithm is proposed for ultrafast stereo matching [19]. SOS involves grid cell structure like LocalExp. ...
... Moreover, HashMatch [20] based propagation and inference are used to efficiently fix the matching errors in initialization. The computational complexity of SOS is proved to be O(1) and according to [19], the theoretical throughput can be as high as 4000 frames per second (FPS) on modern high-end GPU. However, the algorithm is tested on a self-made dataset with a small region of interest (ROI) in original paper [19]. ...
... The computational complexity of SOS is proved to be O(1) and according to [19], the theoretical throughput can be as high as 4000 frames per second (FPS) on modern high-end GPU. However, the algorithm is tested on a self-made dataset with a small region of interest (ROI) in original paper [19]. It is found that the SOS algorithm generates relatively poor accuracy performance on public datasets such as KITTI2015 [21]. ...
... However, the downsampling of the cost volume comes at the price of sacrificing accuracy. Multiple recent stereo matching methods [49,11,27] have increased the efficiency of disparity estimation for active stereo while maintaining a high level of accuracy. These methods are mainly built on three intuitions: 1 the use of compact/sparse features for fast high resolution matching cost computation; 2 very efficient disparity optimization schemes that do not rely on the full cost volume; 3 iterative image warps using slanted planes to achieve high accuracy by minimizing image dissimilarity. ...
... Our method is highly inspired by classical stereo matching methods, which aim at propagating good sparse matches [11,12,49]. In particular, Tankovich et al. [49] proposed a hierarchical algorithm that makes use of slanted support windows to amortize the matching cost computation in tiles. ...
... Our method is highly inspired by classical stereo matching methods, which aim at propagating good sparse matches [11,12,49]. In particular, Tankovich et al. [49] proposed a hierarchical algorithm that makes use of slanted support windows to amortize the matching cost computation in tiles. Inspired by this work, we propose an end-to-end approach that overcomes the issues of the hand-crafted algorithms, while maintaining computational efficiency. ...
Preprint
This paper presents HITNet, a novel neural network architecture for real-time stereo matching. Contrary to many recent neural network approaches that operate on a full cost volume and rely on 3D convolutions, our approach does not explicitly build a volume and instead relies on a fast multi-resolution initialization step, differentiable 2D geometric propagation and warping mechanisms to infer disparity hypotheses. To achieve a high level of accuracy, our network not only geometrically reasons about disparities but also infers slanted plane hypotheses allowing to more accurately perform geometric warping and upsampling operations. Our architecture is inherently multi-resolution allowing the propagation of information at different levels. Multiple experiments prove the effectiveness of the proposed approach at a fraction of the computation required by recent state-of-the-art methods. At time of writing, HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two view stereo and ranks 1st on the popular KITTI 2012 and 2015 benchmarks among the published methods faster than 100ms.
... By comparing the left and the warped right feature map or 2D image, 2D convolution predict the current errors, which is subsequently refined [23]. HITNet [30] adopts the slanted-plane principle [31] enhancing the quality during up-sampling through a coarse-to-fine design. Alternatively, a full-cost volume is transformed into a thinner volume, by limiting the disparity of each pixel, which in the end is equivalent to image warping [19,28]. ...
Preprint
Several leading methods on public benchmarks for depth-from-stereo rely on memory-demanding 4D cost volumes and computationally intensive 3D convolutions for feature matching. We suggest a new way to process the 4D cost volume where we merge two different concepts in one deeply integrated framework to achieve a symbiotic relationship. A feature matching part is responsible for identifying matching pixels pairs along the baseline while a concurrent image volume part is inspired by depth-from-mono CNNs. However, instead of predicting depth directly from image features, it provides additional context to resolve ambiguities during pixel matching. More technically, the processing of the 4D cost volume is separated into a 2D propagation and a 3D propagation part. Starting from feature maps of the left image, the 2D propagation assists the 3D propagation part of the cost volume at different layers by adding visual features to the geometric context. By combining both parts, we can safely reduce the scale of 3D convolution layers in the matching part without sacrificing accuracy. Experiments demonstrate that our end-to-end trained CNN is ranked 2nd on KITTI2012 and ETH3D benchmarks while being significantly faster than the 1st-ranked method. Furthermore, we notice that the coupling of image and matching-volume improves fine-scale details as demonstrated by our qualitative analysis.
... There are many planes in urban remote sensing images. Some studies using 'Patch-Match' have considered plane constraints [35][36][37]. These methods estimated the plane using sparse matching in advance, and then the plane constraints were then incorporated into the cost function. ...
Article
Full-text available
Objects in satellite remote sensing image sequences often have large deformations, and the stereo matching of this kind of image is so difficult that the matching rate generally drops. A disparity refinement method is needed to correct and fill the disparity. A method for disparity refinement based on the results of plane segmentation is proposed in this paper. The plane segmentation algorithm includes two steps: Initial segmentation based on mean-shift and alpha-expansion-based energy minimization. According to the results of plane segmentation and fitting, the disparity is refined by filling missed matching regions and removing outliers. The experimental results showed that the proposed plane segmentation method could not only accurately fit the plane in the presence of noise but also approximate the surface by plane combination. After the proposed plane segmentation method was applied to the disparity refinement of remote sensing images, many missed matches were filled, and the elevation errors were reduced. This proved that the proposed algorithm was effective. For difficult evaluations resulting from significant variations in remote sensing images of different satellites, the edge matching rate and the edge matching map are proposed as new stereo matching evaluation and analysis tools. Experiment results showed that they were easy to use, intuitive, and effective.
Article
Matching cost aggregation plays a critical role for the stereo matching task. Existing CNN-based methods commonly use 3D convolutions to aggregate matching costs from a local 3D space. However, the high computational cost of 3D convolutions limits their applications. Traditional methods show that a slanted support window in 3D space can help to aggregate matching costs from informative regions, i.e., the surfaces of objects. Motivated by this idea, we propose a SPNet with differentiable slanted plane aggregation. Our slanted plane aggregation layers aggregate matching costs from a learnable slanted plane in a local 3D space to reduce computational and memory costs. Experimental results show that our slanted plane aggregation layers can learn to fit the surfaces of objects and effectively aggregate matching costs. Comparison with previous stereo matching methods shows that our network achieves competitive performance with higher efficiency.
Article
This paper presents a field programmable gate array (FPGA)-based, high-performance and energy-resource efficient stereo matching processor. The proposed processor executes block-level PatchMatch-based stereo matching algorithm with a random search strategy to avoid estimation of all disparity levels. To take advantages of different block scales, a coarse-to-fine multi-scale propagation (MSP) scheme is proposed for label update. Based on that, a dedicated hardware architecture is further proposed to explore the benefit of the algorithm. Experimental results show that the proposed FPGA-based processor, running at 350MHz, achieves a peak performance of $1920\times 1080.165$ .7 frame per second (FPS) at 128 disparity levels with 3.35W power dissipation. The energy and resource efficiency of the proposed design outperforms state-of-the-art FPGA-based stereo matching processors. When disparity level increases to 256, the computing resource increment of the proposed design is much less than existing designs because random search instead of winner-takes-all (WTA) is utilized. Moreover, unlike existing dedicated stereo matching processors which output only disparity information, the proposed design is also capable of deriving plane slant. This information can be beneficial for follow-up tasks like 3D reconstruction.
Article
The increasing demand for 3D content in augmented and virtual reality has motivated the development of volumetric performance capture systemsnsuch as the Light Stage. Recent advances are pushing free viewpoint relightable videos of dynamic human performances closer to photorealistic quality. However, despite significant efforts, these sophisticated systems are limited by reconstruction and rendering algorithms which do not fully model complex 3D structures and higher order light transport effects such as global illumination and sub-surface scattering. In this paper, we propose a system that combines traditional geometric pipelines with a neural rendering scheme to generate photorealistic renderings of dynamic performances under desired viewpoint and lighting. Our system leverages deep neural networks that model the classical rendering process to learn implicit features that represent the view-dependent appearance of the subject independent of the geometry layout, allowing for generalization to unseen subject poses and even novel subject identity. Detailed experiments and comparisons demonstrate the efficacy and versatility of our method to generate high-quality results, significantly outperforming the existing state-of-the-art solutions.
Chapter
Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthetic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches to show that our method offers a substantial improvement over previous works.
Conference Paper
Full-text available
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of $1/30th$ of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitly handles occlusions. We introduce a novel reconstruction loss that is more robust to noise and texture-less patches, and is invariant to illumination changes. The proposed loss is optimized using a window-based cost aggregation with an adaptive support weight scheme. This cost aggregation is edge-preserving and smooths the loss function, which is key to allow the network to reach compelling results. Finally we show how the task of predicting invalid regions, such as occlusions, can be trained end-to-end without ground-truth. This component is crucial to reduce blur and particularly improves predictions along depth discontinuities. Extensive quantitatively and qualitatively evaluations on real and synthetic data demonstrate state of the art results in many challenging scenes.
Article
Full-text available
We present Motion2Fusion, a state-of-the-art 360 performance capture system that enables *real-time* reconstruction of arbitrary non-rigid scenes. We provide three major contributions over prior work: 1) a new non-rigid fusion pipeline allowing for far more faithful reconstruction of high frequency geometric details, avoiding the over-smoothing and visual artifacts observed previously. 2) a high speed pipeline coupled with a machine learning technique for 3D correspondence field estimation reducing tracking errors and artifacts that are attributed to fast motions. 3) a backward and forward non-rigid alignment strategy that more robustly deals with topology changes but is still free from scene priors. Our novel performance capture system demonstrates real-time results nearing 3x speed-up from previous state-of-the-art work on the exact same GPU hardware. Extensive quantitative and qualitative comparisons show more precise geometric and texturing results with less artifacts due to fast motions or topology changes than prior art.
Conference Paper
Full-text available
We present an end-to-end system for augmented and virtual reality telepresence, called Holoportation. Our system demonstrates high-quality, real-time 3D reconstructions of an entire space, including people, furniture and objects, using a set of new depth cameras. These 3D models can also be transmitted in real-time to remote users. This allows users wearing virtual or augmented reality displays to see, hear and interact with remote participants in 3D ,almost as if they were present in the same physical space. From an audio-visual perspective, communicating and interacting with remote users edges closer to face-to-face communication. This paper describes the Holoportation technical system in full, its key interactive capabilities, the application scenarios it enables, and an initial qualitative study of using this new communication medium.
Chapter
This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free disparity maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of traditional stereo matching approaches. This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision. Spatial precision is achieved by employing a learned edge-aware upsampling function. Our model uses a Siamese network to extract features from the left and right image. A first estimate of the disparity is computed in a very low resolution cost volume, then hierarchically the model re-introduces high-frequency details through a learned upsampling function that uses compact pixel-to-pixel refinement networks. Leveraging color input as a guide, this function is capable of producing high-quality edge-aware output. We achieve compelling results on multiple benchmarks, showing how the proposed method offers extreme flexibility at an acceptable computational budget.
Chapter
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of 1 / 30th of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitly handles occlusions. We introduce a novel reconstruction loss that is more robust to noise and texture-less patches, and is invariant to illumination changes. The proposed loss is optimized using a window-based cost aggregation with an adaptive support weight scheme. This cost aggregation is edge-preserving and smooths the loss function, which is key to allow the network to reach compelling results. Finally we show how the task of predicting invalid regions, such as occlusions, can be trained end-to-end without ground-truth. This component is crucial to reduce blur and particularly improves predictions along depth discontinuities. Extensive quantitatively and qualitatively evaluations on real and synthetic data demonstrate state of the art results in many challenging scenes.
Chapter
Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective real-time system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different datasets: a benchmark dataset for one-shot action learning (the ChaLearn Gesture Dataset), an in-house dataset acquired by a Kinect sensor including complex actions and gestures differing by small details, and a dataset created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot.
Article
Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today's dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve a detailed pose fit. We follow this paradigm, but make several changes to the model-fitting, namely using: (1) a more discriminative objective function; (2) a smooth-surface model that provides gradients for non-linear optimization; and (3) joint optimization over both the model pose and the correspondences between observed data points and the model surface. While each of these changes may actually increase the cost per fitting iteration, we find a compensating decrease in the number of iterations. Further, the wide basin of convergence means that fewer starting points are needed for successful model fitting. Our system runs in real-time on CPU only, which frees up the commonly over-burdened GPU for experience designers. The hand tracker is efficient enough to run on low-power devices such as tablets. We can track up to several meters from the camera to provide a large working volume for interaction, even using the noisy data from current-generation depth cameras. Quantitative assessments on standard datasets show that the new approach exceeds the state of the art in accuracy. Qualitative results take the form of live recordings of a range of interactive experiences enabled by this new approach.