Content uploaded by Sean Ryan Fanello

Author content

All content in this area was uploaded by Sean Ryan Fanello on Aug 06, 2018

Content may be subject to copyright.

SOS: Stereo Matching in O(1) with Slanted Support Windows

Vladimir Tankovich1Michael Schoenberg1Sean Ryan Fanello1Adarsh Kowdle1

Christoph Rhemann1Maksym Dzitsiuk1Mirko Schmidt1Julien Valentin1Shahram Izadi1

Abstract— Depth cameras have accelerated research in many

areas of computer vision. Most triangulation-based depth

cameras, whether structured light systems like the Kinect or

active (assisted) stereo systems, are based on the principle

of stereo matching. Depth from stereo is an active research

topic dating back 30 years. Despite recent advances, algorithms

usually trade-off accuracy for speed. In particular, efﬁcient

methods rely on fronto-parallel assumptions to reduce the

search space and keep computation low. We present SOS

(Slanted O(1) Stereo), the ﬁrst algorithm capable of leveraging

slanted support windows without sacriﬁcing speed or accuracy.

We use an active stereo conﬁguration, where an illuminator

textures the scene. Under this setting, local methods - such as

PatchMatch Stereo - obtain state of the art results by jointly

estimating disparities and slant, but at a large computational

cost. We observe that these methods typically exploit local

smoothness to simplify their initialization strategies. Our key

insight is that local smoothness can in fact be used to amortize

the computation not only within initialization, but across the

entire stereo pipeline. Building on these insights, we propose

a novel hierarchical initialization that is able to efﬁciently

perform search over disparity and slants. We then show how

this structure can be leveraged to provide high quality depth

maps. Extensive quantitative evaluations demonstrate that the

proposed technique yields signiﬁcantly more precise results than

current state of the art, but at a fraction of the computational

cost. Our prototype implementation runs at 4000 fps on modern

GPU architectures.

I. INTRODUCTION

Since the release of the Microsoft Kinect, 3D cameras

have revolutionized the way we tackle challenging computer

vision problems such as body part classiﬁcation [43], hand

pose estimation [28], [44], action recognition [13], [11],

3D scanning [25], nonrigid reconstruction [9], [8], and 3D

scene understanding [32], [45], [6]. Depth sensors have also

enabled challenging scenarios in robotics [7], [12], [22] as

well as augmented and virtual reality [38].

The simplest way to estimate the 3D information of a

scene makes use of ‘passive’ sensors and algorithms (e.g.

RGB cameras, Structure from Motion (SfM)). These methods

typically perform poorly when there is no texture in the

scene, or rely on strong prior assumptions about the scene

that fail in general cases [20]. To overcome these limitations,

‘active’ sensors rely on projectors that augment the scene

with additional texture. Such sensors can be categorized into

two areas: time of ﬂight and triangulation based, also known

as structured light. Time of ﬂight systems have a compelling

form factor (no baseline is required), but they suffer from

multipath interference, which renders the measurements un-

usable for precise reconstructions [26], [19], [3], [36].

1Authors are with the Augmented Perception group at Google.

Structured light systems [41], [21] fall into temporal and

spatial categories. Temporal algorithms relying on multi-

ple shots of the same scene under differing illumination

patterns. Such approaches are typically efﬁcient, as their

depth is directly encoded into a lookup table, but they

have serious limitations: short range, motion artifacts, ex-

pensive hardware, and signiﬁcant power requirements. A

recent example of such an approach is the Intel RealSense

SR300, which works up to 120cm. Spatial structured light

systems have been popular since Kinect V1. Such methods

encode information spatially into a pattern that is projected

into the scene, recovering depth in a single shot. Despite

commercialization, this technology also suffers from major

drawbacks: interference from multiple sources, need for an a-

priori known pattern, and online calibration requirements to

compensate for drift from that pattern. Moreover, structured

light algorithms demand non-trivial computational resources,

with many examples of the technology limited to low fram-

erates (30fps) and VGA resolution.

Active (or assisted) stereo [30], [37] represents a solution

for most of these challenges: they do not require a known

pattern, perform well in textureless regions, and do not

suffer from multipath interference. However, these systems

must solve a correspondence problem per-pixel, making the

computation intractable for many real-time scenarios.

Considerable effort has been spent to design efﬁcient

algorithms for this stereo correspondence problem. The goal

is to infer a disparity dfor each pixel in the image, and such

a problem can be cast as a nearest neighbor problem between

the left and the right image in some cost space. Given the

(ﬁxed) geometry of the cameras, runtime cost can be reduced

by searching across an appropriate epipolar line. A stereo

matching algorithm usually depends on two main variables:

the size of the patch p×pused to compute a correlation

function between image patches, and the number of disparity

hypotheses Lthat needs to be tested during the search. Many

related works have tried to remove the dependency on the

window size [5] or on the range of disparities [40], [34],

[27].

Very recently, many O(1) methods have shown remarkable

results [35], [33], [17], [16] in solving the active stereo

problem in real-time. These approaches do not depend on

the window size or number of disparity hypotheses that

need to be tested. However, they all rely on the so-called

fronto-parallel assumption, which requires that the disparity

be constant for a given patch. In practice, this assumption

is quite restrictive, and represents a compromise between

resulting quality and runtime. Some existing works lift

this constraint [5], [10], but they typically require multiple

seconds to process a single frame.

We present SOS (Slanted O(1) Stereo), the ﬁrst matching

algorithm with slanted support windows that scales linearly

with the resolution of the image. Our method does not

depend on the window size or disparity space and each

pixel computes ∼70 intensity differences to predict the

ﬁnal disparity. By dividing the image into multiple non-

overlapping tiles, we can explore the much-larger cost vol-

ume by amortizing the computation across these tiles. This

permits us to remove the dependency on an explicit window

size used to compute the correlation between left and right

patches. Random initialization followed by multiple hierar-

chical aggregation steps creates an algorithm independent of

the size of the disparity search space. Finally, we explicitly

estimate the patch slant and perform subpixel reﬁnement at

each step of the pipeline, leading to very precise reconstruc-

tions. Multiple experiments show high quality results that

outperform other state of the art O(1) methods. Our GPU

implementation runs at over 4000 fps on a Nvidia Titan X

GPU using 1.3Megapixel images.

II. REL ATED WORK

Stereo matching pipelines are traditionally composed of

three principal steps: matching cost computation, dispar-

ity optimization, and reﬁnement [42]. The matching cost

computation deﬁnes the distance function used to compare

two p×ppatches between left and right image. These

correlation functions usually depend on an explicit window

size, making the algorithm challenging to scale when the

resolution increases. Some common correlation functions

used are sum of absolute differences (SAD), normalized

cross-correlation (NCC), and the census transform - see [24]

for a comprehensive evaluation. The second step, disparity

optimization, consists of ﬁnding the best disparity hypothe-

sis. A naive exhaustive search approach (e.g. block matching)

would be infeasible for high resolution images (a typical

disparity range for 1.3Megapixel images is between 0−256).

Depending on the optimization strategy, methods can be

categorized as local [47], [40], [5], [35], global [18], [2],

[31], [33] or semiglobal [23], [4].

Most of these works have inherent dependencies on the

number of disparity hypotheses or the window size used to

compute the cost. For instance, methods relying on the so-

called cost volume ﬁltering [40], [34], [27], have a running

time independent of the patch size - since they can efﬁciently

compute the cost using integral images. They trade this

beneﬁt for a linear dependency on the number of disparities

L, which limits their usefulness for realtime applications

to low resolution images or small disparity ranges. Other

methods [5], [2] leverage a PatchMatch-like scheme [1] to

avoid searching the full disparity space. Unfortunately, their

complexity depends on the patch size used to compute the

matching cost. Recently, O(1) methods have reduced the

overall computation required for stereo by making use of

super-pixels and PatchMatch optimization [35], [33]. Very

recent works have used machine learning to derive an O(1)

depth estimation algorithm [14], [15], [46], [17], [16]. For

example, [15] uses a random forest per scanline to predict

depth in O(1) - but the method requires a tedious calibration

and ofﬂine learning for each camera. Similarly, in [17], the

authors use a decision tree to learn a mapping from image

patches to a binary space. The learned mapping is sparse,

therefore independent of the window size. A PatchMatch

framework is then used to optimize over the disparities. In

[16] a per-pixel parallel scheme is proposed to solve a CRF

model in O(1).

These O(1) systems make a strong fronto-parallel as-

sumption, which means that the disparity stays constant in

image patches. However in real applications this constraint

is typically violated. Previous work that drops the fronto-

parallel assumption [5] performs randomized search for both

disparity and plane normals, but this radically increases the

size of the search space. This increased search space renders

that method suitable only for off-line applications. Others,

such as [10], test multiple per-pixel correlations to estimate

the right slant, and then perform disparity optimization using

standard block matching search. Recent deep learning archi-

tectures [29], [48] can implicity deal with slanted surfaces

using 3D convolutions in the cost volume, however their

computational requirements are still too demanding even for

high-end GPUs.

III. SOS ALGORITHM

In this section we detail our approach to depth estimation

from triangulation systems. We start from an image pair IL

and IR, and we assume that these images are calibrated and

rectiﬁed. We use a similar setup to that used in [17], [16]: we

use a custom Kinect-like DOE projector to generate texture

in the scene. We select Ximea monochrome cameras with a

spatial resolution of 1280 ×1024 pixels. These cameras are

capable of running at 210 fps full resolution.

Since our images are rectiﬁed, for each pixel (x, y)in

the left image IL, there is a corresponding pixel (x−d, y)

in the right image, where dis the so-called disparity. In a

triangulation system, the disparity is inversely proportional

to the depth Z=bf

d, with band fthe baseline and focal

length of the system, respectively.

In the remainder of the section, we detail our approach to

efﬁcient disparity estimation from stereo. The core insight is

that we can amortize the computation across local image tiles

in a ﬁne-to-coarse pyramid, while estimating both disparity

and slant. We additionally perform multiple subpixel reﬁne-

ment steps across a continuous cost space, producing high

quality results. See Fig. 1for an overview of the method.

A. Initialization

As mentioned previously, performing exhaustive search

in order to initialize our solution is problematic. In addi-

tion to being computationally prohibitive, exhaustive search

methods typically evaluate only a unary cost - that is, they

do not take solution smoothness into account, and so can

end up in incorrect minima. Some works [16], [39] indicate

that exhaustive search is not required to obtain state of

Fig. 1. Overview of the proposed algorithm. The input to our method is a pair of calibrated and rectiﬁed images. First, we perform a hierarchical

search that estimates an initial tiled disparity map, where the disparity values within each tile follow a planar equation. These initial estimates are then

reﬁned using an efﬁcient inference that encourages local smoothness accross tiles. Finally, these reﬁned per-tile estimates are used to infer precise per-pixel

disparity. We refer the reader to Sec. III for details.

the art results. Based on these intuitions, we propose a

hierarchical approach which differs from those proposed in

literature. While it is common to perform a coarse-to-ﬁne

search of the disparity space in order to improve performance

(by restricting cost volume search intervals based on the

result of a higher level in the pyramid), such methods

operate by ﬁnding a single local minimum for each patch

and reﬁning it as they recurse down to the pixel level.

Consequently, such techniques can miss even relatively large

features in the input. Instead, we propose an inverted, ﬁne-

to-coarse ranking and aggregation scheme. Starting at a

per-pixel level, we evaluate several (in our implementation,

4) weak hypotheses per pixel using a pixel-wise absolute

difference, storing the winning hypothesis as the input to

our initialization algorithm. We then recursively examine

2×2non-overlapping elements of the previous level. Each

such element has one winning hypothesis, which we evaluate

across the entire 2×2tile using sum of absolute differences

(SAD) in pixel space. Candidates are ranked, and the winning

hypothesis per tile provided as input to the next level in the

recursion, which continues in our implementation until these

tiles are 16×16 pixels wide. Consequently, each level of the

recursion doubles the width and height of tiles, but halves the

number of tiles in each dimension that must be processed,

leading to a O(N)runtime. See Sec. III-E for details on

the computational analysis of the proposed algorithm. The

key insight behind this procedure is that reconstruction error

(here, SAD) for a ‘correct’ disparity is low for all tile sizes

- true negative candidates (i.e. poor reconstruction error and

incorrect disparity), can be quickly rejected at low cost,

and we can sample much of the disparity space in this

manner at the ﬁner levels of the pyramid. False positive (good

reconstruction error, but poor disparity) and true positive

(good reconstruction error, and correct disparity) candidates

are propagated up the hierarchy where larger and larger tiles

permit to removal of false positives while retaining true

positives. In practice, due to the sparsity of the structured

light illuminator used in this paper, we found that 16×16 tiles

provide enough spatial support to ﬁlter out the vast majority

of false positives. It is interesting to note that for 16 ×16

tiles, a total of 1024 hypothesis (16 ×16 ×4) are evaluated

(including duplicates), which is well above the maximum

disparity (350) that our hardware system supports. Once this

procedure is complete, we have coarse fronto-parallel depth

tiles for the full image. The next section describes how

the fronto-parallel constraint is relaxed to obtain a reﬁned

initialization.

B. Slant Estimation and Subpixel Reﬁnement

At the end of the previous step, each 16 ×16 tile was

assigned a single disparity d. We now reﬁne dvia a standard

parabola ﬁt in the cost space. That is, we evaluate the cost

using SAD of the full 16 ×16 tile with disparity d−1,

disparity d, and disparity d+ 1. We ﬁt a parabola to the

local cost space described by those reconstruction errors.

Using the standard closed form expression, the minimum of

this quadratic polynomial is extracted and used as a reﬁned

estimation of d. This reﬁned estimate still corresponds to a

fronto-parallel planar solution, but in the vast majority of

scenes, this fronto-parallel assumption does not hold and

hence must be lifted. We propose increasing the degree of

the model by one, such that geometric structures in disparity

space are represented by planes in disparity space, rather

than by constant integer values. In fact, since depth and

disparity are inversely proportional to each other, a plane

equation in disparity space corresponds to a smooth quadratic

surface in depth. When considering fronto-parallel depth, the

following relationship deﬁnes how pixels in the left image

xLare related to pixels in the right image xR:

xL=xR−d(1)

If we instead describe a plane l= [d, dx, dy]in disparity

space, the relationship becomes:

xL=xR+S(xL, l)

S(xL, l) = kxdx+kydy−d(2)

where kx, kyare any offset from the patch center and dx, dy

are the coefﬁcients controlling the orientation of the plane.

Similarly to the reﬁnement of d, the values of dxand dy

are optimized by ﬁtting a parabola to costs computed by

evaluating 3 plane hypotheses on the tile (fronto-parallel, +30

degree slant and -30 degree slant). The assumption is that

the minimum of the quadratic function is close to the ‘true’

minimum for the vast majority of tiles, which is validated

by the experiments discussed in Section IV-A. Once this

reﬁnement is complete, each tile is associated with a disparity

model that follows a plane equation. The output of this stage

is shown in Fig. 1,initialization column. One can observe

that this coarse initialization provides a solution that already

greatly resembles the ﬁnal solution in the rightmost column.

Nevertheless, as illustrated in the top right corner of the

16 ×16 result of Fig. 1, some tiles may have arrived at

an incorrect local minimum, an issue which we address with

the following regularization scheme.

C. Propagation and Inference

To tackle the issue of having a few tiles with incoherent

solutions, we resort to using a Conditional Random Field

(CRF). CRFs are in general NP-hard to solve exactly, and

most solvers are iterative by nature, making their compu-

tational requirements de-facto not attractive for real-time

applications. Recently, [16] presented a fast and fully par-

allel inference technique that provides high quality solutions

under the condition that the initial solution are high quality,

which we meet. For completeness, a short description of

their method follows. In a nutshell, the problem is cast in

a probabilitic framework:

P(Y|D) = 1

Z(D)e−E(Y|D)(3)

and minimized in the log-space:

E(Y|D) = X

i

ψu(li) + X

i

X

j∈Ni

ψp(li, lj),(4)

In our case, the data-term ψu(li)corresponds to the recon-

struction error for tile iunder the planar hypothesis li, and

Z(D)is the partition function. In more detail, we deﬁne

ψu(li) = X

p∈Ti

|IL(p)−IR(px−S(px, li), py)|(5)

where the summation is performed over all pixels pfrom

the set of pixels Ticontained in tile i. The function S(px, li)

(Eq. 2) estimates the disparity of pixel punder the planar

hypothesis li. Finally, IL(.)and IR(.)respectively return

the intensity values stored in the left and right images for

the queried pixels. Concretely, the proposed unary potential

evaluates the reconstruction error (SAD) under the planar

hypothesis li. Note that this is different from the reconstruc-

tion error evaluated previously, where the error was evaluated

exclusively using fronto-parallel planes.

Our new pairwise potential ψpis evaluated over the

neighbors Ni, which corresponds to the tiles above, below,

left and right from tile i, and is deﬁned as

ψp(li, lj) = λmin(|ld

i−S(c(i)x, lj)|,3),(6)

where c(i)returns the position of the pixel at the center of

tile i, and ld

icorresponds to the disparity component of the

planar hypothesis li. Concretely, this function ﬁrst evaluates

what the disparity of the center of tile iwould be if it

were to belong to the plane lj. A truncated `1-norm between

that estimated disparity and the current candidate disparity is

then computed. In order to not over-penalize large disparity

changes (e.g. transition from foreground to background), this

distance is truncated. The parameter λcontrols the degree

of smoothness in the solution. The authors of [16] show that

Eq. 4can be efﬁciently performed though an approximation

of mean-ﬁeld where each minimization step corresponds to

taking the union of the labels associated with the current tile

and its |Ni|neighbors. Each of these 5 candidates is ranked

by evaluating:

E(Yi|D) = ψu(li) + X

j∈Ni

ψp(li, lj)(7)

The minimizer is used as the new disparity value for tile i. In

practice, we found that performing two steps of minimization

is enough to converge to good solutions. Note that each

iteration of the minimization is extremely fast as it is

performed on only 5120 (1280 ×1024/162) variables. Once

the minimization is performed, the disparity at each tile is

reﬁned using another parabola ﬁt leveraging the estimated li.

We have found empirically that re-estimating ﬁrst derivative

hypotheses via sub-pixel ﬁtting is not required for a high

quality results.

D. Per-pixel estimation

After propagation is complete, we have a robust estimate

of the disparity and slant for each 16×16 tile. The quadratic

approximation used to estimate the slant of tiles is robust

in the [−30◦,30◦]and this range of angles provides in-

ferred solutions that are much higher quality than assuming

fronto-parallel tiles. Unfortunately, real surfaces can be much

steeper, and the quadratic approximation becomes weaker

for angles much bigger than 30◦, as demonstrated in the

experiments. To circumvent this limitation, we compute the

ﬁnal slant of each patch using central differences built using

the disparity at the center of neighboring tiles.

We now leverage the above initialization to obtain precise

per-pixel results. First, each tile is ‘expanded’ by 50% in

both x and y directions - causing any given pixel (except

at the image boundaries) to overlap 4 expanded tiles. For

each expanded tile, we build an integral ‘tile’ of the re-

construction error (SAD) obtained using the corresponding

plane hypothesis li. We build two other integral ‘tiles’ per

expanded tile, which capture the reconstruction error with a

small delta added to the disparity component of li. For every

pixel, we can now perform 4 parabolae ﬁts of their respective

cost volumes using the integral tiles described above. The

cost of each pixel is again deﬁned as the reconstruction

error, but is computed over 11 ×11 patches centered on the

pixel in question. The solution with the smallest interpolated

reconstruction error is used as ﬁnal disparity estimate.

a) Invalidation: Given the inherent physical limitations

of an active stereo system (e.g. poor signal to noise ratio

on dark or far surfaces, occlusions), not enough data is

available to perform robust estimates for any given pixel.

Traditional methods of invalidation involve left-right consis-

tency checks, ﬁltering, and/or connected component analysis

- all computationally expensive methods. Instead, we make

use of byproducts of our method to perform fast and precise

invalidation. More precisely, we do not consider tiles that

have slants greater than 75 degrees, and we invalidate pixels

after reﬁnement when the ﬁnal SAD cost is higher than a pre-

deﬁned threshold θ. We found empirically that this simple

invalidation strategy produces clean results and requires

minimal computation.

E. Computational Analysis

In this section, we evaluate the overall computational com-

plexity of our depth algorithm. Let us assume an input image

containing Npixels, and suppose that Ldiscrete integer

disparity labels are permitted. Typical values in practice for

such algorithms are N= 1280 ×1024 and |L| = 512. We

proceed through all stages of our algorithm and justify why

each is O(N).

Initialization evaluates 4 random hypotheses per pixel at a

cost of O(1) per pixel. Next, we ascend a klevel hierarchy,

accumulating and re-evaluating hypotheses as we go. At each

level l, we process the image in tiles of 2l×2lpixels. Because

the tiles are non-overlapping, and the cost of the cost function

at each level is exactly proportional to the size of the tile,

the computational cost for each tile scales up at the same

rate that the number of tiles scales down. Consequently, we

require O(kN)to perform this step. In our implementation,

we use k= 4 levels, but it is technically the case that a

proper value of kis dependent on log2|L|. This dependency

occurs because we are sampling disparities randomly (and is

the same dependency incurred by any method which samples

the cost space at random [39]) - in order to obtain a ﬁxed

probability of having the correct integer disparity within a

tile, we must increase the number of samples as we increase

the size of the disparity range. Fortunately, for practical

implementations, log2|L| is essentially a small constant (e.g.

if |L| = 512, then log2|L| = 9). Once this initialization is

complete, the highest-level cost function is evaluated again

for ﬁtting dxand dy, a process which is O(1) per tile, and

thus O(N)overall.

Propagation occurs over two passes, each performed on the

coarsest tiles. Here, too, our cost function is proportional

in cost to tile size. We evaluate 4+1 hypotheses per tile and

select the best - again at a constant cost per tile. Consequently

this step requires O(5N)operations.

Per-pixel prediction and invalidation examine four hy-

potheses per pixel, and evaluate the cost for each using SAD

with a 11 ×11 patch per pixel. While naively this approach

would induce an additional patch size dependency, we can

amortize the computation required by generating an integral

image for the plane hypothesis of each tile. Suppose we have

per-pixel patches of width pand tiles of width w. We need

the integral image to be usable by all pixels that count this

tile as one of their four nearest tiles, so we must scale it

to overlap half of each of the nearby tiles. Consequently,

the integral image width per-tile is 2w+p- so under the

assumption (which holds for our implementation) that pis

Fig. 2. Distribution of angular error in performing plane ﬁts for the ﬁrst

derivative of tiles. See text for details.

Fig. 3. Qualitative comparisons with other state of the art O(1) methods

(UltraStereo [17], HashMatch [16]) and slow slanted windows approaches

(PatchMatch Stereo [5]). Note how we provide smoother results, better

invalidation and less noise.

the same order as w, we remain O(N)for the full image.

Invalidation is a single pass over the image that removes

patches with high-magnitude slants or low cost, therefore is

O(N).

IV. EVALUATIO N

In this section we evaluate the algorithm under various

challenging conditions, and compare the results with state of

the art methods. For all our experiments we use the active

stereo setup described in Sec. III.

A. Cost Space Analysis

Here we validate the use of a quadratic model to estimate

the ﬁrst derivative dx, dyin each tile. Quadratic model

minimization is well-established in the stereo literature for

subpixel reﬁnement on fronto-parallel disparities d(for in-

stance, in [24]), but it is not immediately obvious that such an

approach would also work for estimating the ﬁrst derivatives

of the disparity space.

We recall that our method works by evaluating the cost

function on 0◦tiles, coarsely estimating the ﬁrst derivatives

of the cost space, and then ﬁtting a quadratic model per-

pixel to estimate the solution with minimum cost. Such

an approach inherently makes the assumption that the cost

space is locally smooth. While a naive approach to this

ﬁtting would result in blocky artifacts around tile boundaries,

our reﬁnement step (Sec. III-D) prevents such artifacts by

permitting multiple local models to be explored per-pixel.

Fig. 4. Qualitative comparisons of single frame reconstruction with state of

the art local methods [5] and [16]. Notice how our method exhibits smoother

single shot reconstruction, less edge fattening and higher level of details at

over 4000 fps.

Intuitively, assigning a slant value to a patch in the

image acts as a shear/stretch on the patch. We ﬁnd that in

our system, typical shears/stretches are subpixel in size -

which suggests that a cost that is directly associated with

reconstruction error (such as SAD) should perform well.

To evaluate the quality of this approximation, we sweep

either dxor dyacross the full valid range K= [−k, k]

to exhaustively sample the cost space. We evaluate the

approximation loss (i.e. the difference between the result

of our parabola ﬁt and the minimum reached by exhaustive

search) for k= 30◦and k= 75◦, as shown in Fig. 2. We

can observe that the distribution of angular error improves

when restricting local plane ﬁts to k= 30◦as opposed to

k= 75◦, with 90% of tiles tested within 6.3◦of the true

minimum. While increasing kto 75◦allows steeper planes to

be initially estimated, this risks attempting to match a highly

distorted patch - such highly oblique surfaces may sample the

same row or column repeatedly, or skip pixels when reading

the associated patch. We ﬁnd that restricting kto 30◦does

not signiﬁcantly damage result quality and, because of the

use of central differences in the reﬁnement step later in the

process, does not prevent generation of steeper planes. Since

any deviation from a convex cost space impacts the quality

of the ﬁt, we additionally examine the convexity of such cost

spaces. We ﬁnd that true convexity is rare, with only 5.5% of

tiles having perfectly convex cost spaces. We instead observe

empirically that when sampled every 2.5◦, 82% of tiles have

quasiconvex cost spaces.

B. Qualitative Evaluation

In this section, we provide qualitative comparisons with

state of the art O(1) methods [17], [16], as well as with

algorithms that use slanted support windows [5]. We use the

same data used in [16] in order to provide a fair comparison.

The data consists of multiple shots of complex scenes

including people, objects, furniture and slanted planes. In

Fig. 3, we show results for the baseline methods. Notice

how O(1) methods such as UltraStereo [17] and HashMatch

[16] produce noisier results, as they explicitly make fronto-

parallel assumptions. The state of the art method that uses

slanted windows, PatchMatch Stereo [5], does not use an

explicit smothness term that exploits non-fronto parallel

surfaces. In contrast, our method is able to model slanted

Fig. 5. Quantitative comparisons. Depth bias (average absolute error) and

depth jitter (standard deviation) of different depth algorithms with respect

to the distance from a ﬂat target. Note how the proposed method provides

stronger results across the whole range of depth values.

windows up to 75◦while producing results containing less

noise. For instance, examine the sofa in the ﬁrst row of Fig.

3, the ﬂoor in the second, and the panels behind the person

in the third.

a) Single Shot Reconstruction: traditional methods typ-

ically require accumulation of observations over time [25],

and also usually rely a moving camera or moving objects in

order to average to the correct depth value. This needs an

additional tracking step, which ultimately may lead to failure

in the reconstruction. To better demonstrate the quality of

our method, we perform a single shot reconstruction of an

object at approximately 1m away from the camera. In Fig.

4, note how SOS generates a mesh that preserves details and

generates smooth results. In comparison, other algorithms

suffer from noisier predictions.

C. Quantitative Analysis

In this section we design quantitative experiments in order

to to precisely evaluate the error of SOS. We ﬁrst analyze

depth bias and jitter, then we design a more sophisticated ex-

periment using ground truth generated via a LIDAR scanner.

We compare our method with a state of the art O(1) method

(HashMatch [16]) as well as with PatchMatch Stereo [5],

which explicitly model slanted surfaces. Note that the fastest

of these techniques is HashMatch, which runs at 1000fps on

GPU, and that the proposed method runs at 4000fps on the

same GPU.

a) Bias and Jitter: using our calibrated stereo setup, we

recorded multiple shots of a ﬂat target at multiple distances.

We start from 500 mm up to 3500 mm, which is a reasonable

range for indoor scenarios. We can compute ground truth by

robust plane ﬁtting of the depth data, and use the equation

Fig. 6. Single shot scans with LIDAR ground truth. Error maps and RMSE

are reported for each method.

of the ﬁtted planes to compute metrics. We deﬁne the depth

bias as the average absolute error over multiple frames, and

the depth jitter as the standard deviation these frames. This

is similar to the setup used in [16] and [17].

In Fig. 5, we show the results of this experiment. Notice

how our error is nearly half that obtained by the baseline

state of the art methods. In addition, our algorithm exhibits

signiﬁcant lower levels of noise.

b) Single Shot Analysis: in this experiment we eval-

uate single shot depthmaps using groundtruth generated

with a LIDAR scanner. We recorded 4objects placed at

approximately 800 mm from the camera. We align the

groundtruth depthmaps generated with the LIDAR sensor to

our depthmaps using rigid ICP. We compute the root mean

square error (RMSE) between groundtruth and predictions

generated with PatchMatch Stereo [5], Hashmatch [16] and

SOS. We calculate the error only in a small ROI around the

objects. In Fig. 6we report the results. Our method exhibits

the lowest error in most of the objects. This proves that the

plane model we use does not affect the reconstruction of

complex structures and it is 4×faster than the current state

of the art method.

Fig. 7. Slant Experiment. We recorded a sequence of a plane with multiple

orientations and compute the average error. SOS achieves the best results

not only quantitatively but also visually.

TABLE I

ERRO RS IN MM F OR DIFF EREN T SLAN T SUR FACE S.

Slant / Algorithm PM Stereo [5] HashMatch [16] SOS (no slant) SOS

Fronto-Parallel 0.45 0.63 0.44 0.31

25◦Horizontal 0.53 0.56 0.53 0.28

45◦Horizontal 0.56 0.47 0.45 0.22

60◦Horizontal 0.81 0.54 0.61 0.24

75◦Horizontal 0.82 0.73 0.65 0.52

25◦Vertical 0.47 0.38 0.45 0.21

45◦Vertical 0.61 0.43 0.46 0.17

60◦Vertical 1.1 0.7 0.71 0.26

75◦Vertical 1.48 1.01 1.10.56

D. Slanted Surfaces Analysis

Our main contribution is the capability of estimating

slanted surfaces in O(1). Here we design an experiment

speciﬁcally to evaluate this component. We record multiple

frames of a planar checkerboard at about 500 mm distance

from the camera. We rotate the checkerboard to cover the

full space: from fronto-parallel up to 75◦slant in both

horizontal and vertical orientations. We use robust plane

ﬁtting to estimate the groundtruth plane selecting a small

ROI around the object. We show some qualitative results in

Fig. 7: we compare the algorithm with the other baselines.

Notice how we are able to support even extreme slants,

whereas the baseline techniques suffer from increased error.

We additionally computed the average error with respect

to the groundtruth planes and report results in Tab. I. Our

algorithm consistently outperforms the other approaches.

Finally, notice the contribution of the explicit slant estimation

(i.e. dxand dy) in Tab. I, third column. One can observe that

optimizing for the slant coefﬁcients signiﬁcantly improves

the precision of the results obtained by the proposed method,

while only costing a negligible amount of computation.

V. CONCLUSION

In this paper, we presented SOS, the ﬁrst low computation

algorithm capable of leveraging slanted support windows.

By using the proposed hierarchical initialization scheme, our

technique is capable of quickly extracting high quality initial

disparity estimates per pixel. These initial candidates are

then reﬁned through continuous reﬁnement followed by an

invalidation step. Each of these steps is extremely efﬁcient,

allowing the whole pipeline to run at 4000 fps on GPU.

Through extensive experiments, we have demonstrated that

the proposed method yields solutions that are superior to the

state of the art, while requiring signiﬁcantly less compute.

REFERENCES

[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patch-

Match: A randomized correspondence algorithm for structural image

editing. ACM SIGGRAPH and Transaction On Graphics, 2009.

[2] F. Besse, C. Rother, A. Fitzgibbon, and J. Kautz. PMBP: Patchmatch

belief propagation for correspondence ﬁeld estimation. IJCV, 2014.

[3] A. Bhandari, A. Kadambi, R. Whyte, C. Barsi, M. Feigin, A. Dorring-

ton, and R. Raskar. Resolving multi-path interference in time-of-ﬂight

imaging via modulation frequency diversity and sparse regularization.

CoRR, 2014.

[4] M. Bleyer and M. Gelautz. Simple but effective tree structures for

dynamic programming-based stereo matching. In VISAPP, 2008.

[5] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo - Stereo

Matching with Slanted Support Windows. In BMVC, 2011.

[6] M. Bleyer, C. Rhemann, and C. Rother. Extracting 3d scene-consistent

object proposals and depth from stereo images. In ECCV, 2012.

[7] C. Ciliberto, S. R. Fanello, L. Natale, and G. Metta. A heteroscedastic

approach to independent motion detection for actuated visual sensors.

In IROS, 2012.

[8] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhe-

mann, V. Tankovich, and S. Izadi. Motion2fusion: Real-time volumet-

ric performance capture. SIGGRAPH Asia, 2017.

[9] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello,

A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, P. Kohli,

V. Tankovich, and S. Izadi. Fusion4d: Real-time performance capture

of challenging scenes. SIGGRAPH, 2016.

[10] N. Einecke and J. Eggert. Block-matching stereo with relaxed

fronto-parallel assumption. In IEEE Intelligent Vehicles Symposium

Proceedings, 2014.

[11] S. Fanello, I. Gori, G. Metta, and F. Odone. One-shot learning for

real-time action recognition. In IbPRIA, 2013.

[12] S. Fanello, U. Pattacini, I. Gori, V. Tikhanoff, M. Randazzo, A. Ron-

cone, F. Odone, and G. Metta. 3d stereo estimation and fully automated

learning of eye-hand coordination in humanoid robots. In IEEE-RAS

International Conference on Humanoid Robots, 2014.

[13] S. R. Fanello, I. Gori, G. Metta, and F. Odone. Keep it simple and

sparse: Real-time action recognition. JMLR, 2013.

[14] S. R. Fanello, C. Keskin, S. Izadi, P. Kohli, D. Kim, D. Sweeney,

A. Criminisi, J. Shotton, S. Kang, and T. Paek. Learning to be a

depth camera for close-range human capture and interaction. ACM

SIGGRAPH and Transaction On Graphics, 2014.

[15] S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. Orts Es-

colano, D. Kim, and S. Izadi. Hyperdepth: Learning depth from

structured light without matching. In CVPR, 2016.

[16] S. R. Fanello, J. Valentin, A. Kowdle, C. Rhemann, V. Taknkovich,

C. Ciliberto, P. Davidson, and S. Izadi. Low compute and fully parallel

computer vision with hashmatch. ICCV, 2017.

[17] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle, V. Tankovich,

P. Davidson, and S. Izadi. Ultrastereo: Efﬁcient learning-based

matching for active stereo systems. CVPR, 2017.

[18] P. F. Felzenszwalb and D. P. Huttenlocher. Efﬁcient belief propagation

for early vision. IJCV, 2006.

[19] D. Freedman, E. Krupka, Y. Smolin, I. Leichter, and M. Schmidt.

SRA: fast removal of general multipath for tof sensors. ECCV, 2014.

[20] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Manhattan-

world stereo. In 2009 IEEE Conference on Computer Vision and

Pattern Recognition, 2009.

[21] J. Geng. Structured-light 3d surface imaging: a tutorial. Advances in

Optics and Photonics, 3(2):128–160, 2011.

[22] I. Gori, U. Pattacini, V. Tikhanoff, and G. Metta. Ranking the

good points: A comprehensive method for humanoid robots to grasp

unknown objects. In IEEE ICAR, 2013.

[23] H. Hirschm ¨

uller. Stereo processing by semiglobal matching and

mutual information. PAMI, 2008.

[24] H. Hirschmuller and D. Scharstein. Evaluation of stereo matching

costs on images with radiometric differences. IEEE PAMI, 2009.

[25] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,

J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon.

KinectFusion: Real-time 3D reconstruction and interaction using a

moving depth camera. In ACM UIST, 2011.

[26] D. Jimenez, D. Pizarro, M. Mazo, and S. Palazuelos. Modelling and

correction of multipath interference in time of ﬂight cameras. In

CVPR, 2012.

[27] M. Ju and H. Kang. Constant time stereo matching. In IMVIP, 2009.

[28] C. Keskin, F. Kırac¸, Y. Kara, and L. Akarun. Hand pose estimation

and hand shape classiﬁcation using multi-layered randomized decision

forests. In ECCV, 2012.

[29] S. Khamis, S. Fanello, C. Rhemann, J. Valentin, A. Kowdle, and

S. Izadi. Stereonet: Guided hierarchical reﬁnement for real-time edge-

aware depth prediction. In ECCV, 2018.

[30] K. Konolige. Projected texture stereo. In ICRA. IEEE, 2010.

[31] P. Kr¨

ahenb¨

uhl and V. Koltun. Efﬁcient inference in fully connected

crfs with gaussian edge potentials. NIPS, 2011.

[32] Y. Li, A. Dai, L. Guibas, and M. Nießner. Database-assisted object

retrieval for real-time 3d reconstruction. In Computer Graphics Forum,

volume 34. Wiley Online Library, 2015.

[33] Y. Li, D. Min, M. S. Brown, M. N. Do, and J. Lu. Spm-bp: Sped-up

patchmatch belief propagation for continuous mrfs. In ICCV, 2015.

[34] J. Lu, K. Shi, D. Min, L. Lin, and M. Do. Cross-based local multipoint

ﬁltering. In Proc. CVPR, 2012.

[35] J. Lu, H. Yang, D. Min, and M. Do. Patchmatch ﬁlter: Efﬁcient edge-

aware ﬁltering meets randomized search for fast correspondence ﬁeld

estimation. In Proc. CVPR, 2013.

[36] N. Naik, A. Kadambi, C. Rhemann, S. Izadi, R. Raskar, and S. Kang.

A light transport model for mitigating multipath interference in TOF

sensors. CVPR, 2015.

[37] H. Nishihara. Prism: A practical real-time imaging stereo matcher mit

ai memo no. 780. Cambridge, Mass., USA, 1984.

[38] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle,

Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou, et al.

Holoportation: Virtual 3d teleportation in real-time. In UIST. ACM,

2016.

[39] V. Pradeep, C. Rhemann, S. Izad, C. Zach, M. Bleyer, and S. Bathiche.

Monofusion: Real-time 3d reconstruction of small scenes with a single

web camera. 2013.

[40] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast

cost-volume ﬁltering for visual correspondence and beyond. In CVPR,

pages 3017–3024, 2011.

[41] J. Salvi, S. Fernandez, T. Pribanic, and X. Llado. A state of the art

in structured light patterns for surface proﬁlometry. PR, 2010.

[42] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense

two-frame stereo correspondence algorithms. IJCV, 2002.

[43] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,

A. Kipman, and A. Blake. Real-time human pose recognition in parts

from single depth images. In CVPR, 2011.

[44] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp,

E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood,

S. Khamis, P. Kohli, S. Izadi, R. Banks, A. Fitzgibbon, and J. Shotton.

Efﬁcient and precise interactive hand tracking through joint, continu-

ous optimization of pose and correspondences. SIGGRAPH, 2016.

[45] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton, P. Kohli,

M. Nießner, A. Criminisi, S. Izadi, and P. Torr. Semanticpaint: Inter-

active 3d labeling and learning at your ﬁngertips. ACM Transactions

on Graphics (TOG), 2015.

[46] S. Wang, S. R. Fanello, C. Rhemann, S. Izadi, and P. Kohli. The

global patch collider. CVPR, 2016.

[47] K.-J. Yoon and I.-S. Kweon. Locally adaptive support-weight approach

for visual correspondence search. In Computer Vision and Pattern

Recognition, 2005. CVPR 2005. IEEE Computer Society Conference

on, volume 2, pages 924–931. IEEE, 2005.

[48] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle,

V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello.

Activestereonet: End-to-end self-supervised learning for active stereo

systems. In ECCV, 2018.