StereoNet: Guided Hierarchical Reﬁnement for
Real-Time Edge-Aware Depth Prediction
Sameh Khamis, Sean Fanello, Christoph Rhemann,
Adarsh Kowdle, Julien Valentin, Shahram Izadi
Abstract. This paper presents StereoNet, the ﬁrst end-to-end deep ar-
chitecture for real-time stereo matching that runs at 60fps on an NVidia
Titan X, producing high-quality, edge-preserved, quantization-free dis-
parity maps. A key insight of this paper is that the network achieves a
sub-pixel matching precision than is a magnitude higher than those of
traditional stereo matching approaches. This allows us to achieve real-
time performance by using a very low resolution cost volume that en-
codes all the information needed to achieve high disparity precision. Spa-
tial precision is achieved by employing a learned edge-aware upsampling
function. Our model uses a Siamese network to extract features from
the left and right image. A ﬁrst estimate of the disparity is computed
in a very low resolution cost volume, then hierarchically the model re-
introduces high-frequency details through a learned upsampling function
that uses compact pixel-to-pixel reﬁnement networks. Leveraging color
input as a guide, this function is capable of producing high-quality edge-
aware output. We achieve compelling results on multiple benchmarks,
showing how the proposed method oﬀers extreme ﬂexibility at an ac-
ceptable computational budget.
Keywords: Stereo matching, Depth estimation, Edge-aware reﬁnement,
Cost volume ﬁltering, Deep learning
Stereo matching is a classical computer vision problem that is concerned with
estimating depth from two slightly displaced images. Depth estimation has re-
cently been projected to the center stage with the rising interest in virtual and
augmented reality . It is at the heart of many tasks from 3D reconstruction to
localization and tracking . Its applications span otherwise disparate research
and product areas including indoor mapping and architecture, autonomous cars,
and human body and face tracking.
Active depth sensors like the Microsoft Kinect provide high quality depth-
maps and have not only revolutionized computer vision research [12, 11, 41, 16,
55], but also play an important role in consumer level applications. These active
depth sensors have become very popular over the recent years with the release of
many other consumer devices, such as the Intel RealSense series, the structured
2 Khamis et al.
light sensor on iPhone X, as well as time-of-ﬂight cameras such as Kinect V2.
With the rise of Augmented Reality (AR) applications on mobile devices, there
is a growing need of algorithms capable of predicting precise depth under tight
computational budget. With the exception of the iPhone X, all smartphones on
the market can only rely on single or dual RGB streams. The release of sparse
tracking and mapping tools like ARKit and ARCore impressively demonstrate
coarse and sparse geometry estimation on mobile devices. However, they lack
dense depth estimation and therefore cannot enable exciting AR applications
such as occlusion handling or precise interaction of virtual objects with the
real world. Depth estimation using a single moving camera, akin to , or
dual cameras naturally became a requirement from the industry to scale AR to
millions of users.
The state of the art in passive depth relies on stereo triangulation between
two (rectiﬁed) RGB images. This has historically been dominated by CRF-based
approaches. These techniques obtain very good results but are computationally
slow. Inference in these models amounts to solving a generally NP-hard problem,
forcing practitioners in many cases to use solvers whose runtime is in the ranges
of seconds  or resort to approximated solutions [14, 15, 56, 54]. Additionally,
these techniques typically suﬀer in the presence of textureless regions, occlusions,
repetitive patterns, thin-structures, and reﬂective surfaces. The ﬁeld is slowly
transitioning and since , it started to use deep features, mostly as unary
potentials, to further advance the state of the art.
Recently, deep-architectures demonstrated a high level of accuracy at pre-
dicting depth from passive stereo data [37, 26, 29, 42]. Despite these signiﬁcant
advances, the proposed methods require vast amounts of processing power and
memory. For instance,  have 3.5 million parameters in their network and
reach a throughput of about 0.95 image per second on 960 ×540 images, and
 takes 0.5 sec to produce a single disparity on a high end GPU.
In this paper we present StereoNet, a novel deep architecture that generated
state of the art 720p depth maps at 60Hz on high end GPUs. Based on our
insight that deep architectures are very good to infer matches at extremely high
subpixel precision we demonstrate that a very low resolution cost volume is
suﬃcient to achieve a depth precision that is comparable to a traditional stereo
matching system that operates at full resolution. To achieve spatial precision we
apply edge-aware ﬁltering stages in a multi-scale manner to deliver a high quality
output. In summary the main contributions of this work are the following:
1. We show that the subpixel matching precision of a deep architecture is an
order of magnitude higher than those of “traditional” stereo approaches.
2. We demonstrate that the high subpixel precision of the network allows to
achieve the depth precision of traditional stereo matching with a very low
resolution cost volume resulting in an extremely eﬃcient algorithm.
3. We show that previous work that introduced cost-volume in deep architec-
tures was over-parameterized for the task and how this signiﬁcantly help
reducing the run-time and memory footprint of the system at little cost in
4. A new hierarchical depth-reﬁnement layer that is capable of performing high-
quality up-sampling that preserves edges.
5. Finally, we demonstrate that the proposed system reaches compelling results
on several benchmarks while being real-time on high end GPU architectures.
2 Related Work
Depth from stereo has been studied for a long time and we refer the interested
reader to [49, 22] for a survey. Correspondence search for stereo is a challenging
problem and has been traditionally divided into global and local approaches.
Global approaches formulate a cost function over the image that is traditionally
optimized using approaches such as Belief Propagation or Graph Cuts [3,17, 30,
31]. Instead, local stereo matching methods (e.g. ) center a support window
on a pixel in the reference frame and then displace this window in the second
image until the point of highest correlation is found. A major challenge for local
stereo matching is to deﬁne the optimal size for the support window. On the one
hand the window needs to be large to capture a suﬃcient amount of texture but
needs to be small at the same time to avoid aggregating wrong disparity values
that can lead to the well-known edge fattening eﬀect at disparity discontinuities.
To avoid this trade-oﬀ, adaptive support approaches weigh the inﬂuence of each
pixel inside the support region based on e.g. its color similarity to the central
Interestingly adaptive support weight approaches were cast as cost volume
ﬁltering in : a three-dimensional cost volume is constructed by computing the
per-pixel matching costs at all possible disparity levels. This cost volume is then
ﬁltered with a weighted average ﬁlter. This ﬁltering propagates local information
in the spatial and depth domains producing a depth map that preserves edges
across object discontinuities.
For triangulation based stereo matching system the accuracy of depth is
directly linked to the precision to which the corresponding pixel in the other
image can be located. Therefore, previous work strives to do matching with
sub-pixel precision. The complexity of most algorithms scale linearly with the
number of disparities evaluated so while one approach is to build a large cost
volume with very ﬁne grained disparity steps this is computationally in-feasible.
Many algorithms therefore start with discrete matching and then reﬁne these
matches by ﬁtting a local curve such as a parabolic ﬁt to the cost function
between the discrete disparity candidates (see e.g. [59,39]). Other works are
based on continuous optimization strategies  or on phase correlation . It
was shown in  that under realistic conditions the bound for subpixel precision
is 1/10th of a pixel while the theoretical limit under noise free conditions was
found to be 10 times lower . We demonstrate that this traditional wisdom
does not hold true for learning-based approaches and we can achieve a subpixel
precision of 1/30th of a pixel.
Recent work has progressed to using end-to-end learning for stereo match-
ing. Various approaches combined a learned patch embedding or matching cost
4 Khamis et al.
with global optimization approaches like semiglobal matching (SGM) for reﬁne-
ment .  learn a multi-scale embedding model followed by an MRF. [62,
61] learn to match image patches followed by SGM.  learn to match patches
using a Siamese feature network and optimize globally with SGM as well. 
uses a multi-stage approach where a highway network architecture is ﬁrst used to
compute the matching costs and then another network is used in postprocessing
to aggregate and pool costs.
Other works attempted to solve the stereo matching problem end-to-end
without postprocessing. [37, 26] train end-to-end an encoder-decoder network
for disparity and ﬂow estimation achieving state-of-the-art results on existing
and new benchmarks. Other end-to-end approaches used multiple reﬁnement
stages that converge to the right disparity hypotheses.  proposed a generic
architecture for labeling problems, including depth estimation, that is trained
end-to-end to predict and reﬁne the output.  proposed a cascaded approach to
reﬁne predicted depth iteratively. Iterative reﬁnement approaches, while showing
good performance on various benchmarks, tend to require a considerable amount
of computational resources.
More closely related to our work is  who used the concept of cost volume
ﬁltering but trained both the features and the ﬁlters end-to-end achieving im-
pressive results. DeepStereo  used a plane-sweep volume to synthesize novel
views from multi-view stereo input. Contrary to prior work, we are interested
in an end-to-end learning stereo pipeline that can run in real-time, therefore
we start from a very low resolution cost volume, which is then upsampled with
learned, edge aware ﬁlters.
3 StereoNet algorithm
Given pairs of input images we aim to train an end-to-end disparity prediction
pipeline. One approach to train such pipeline is to leverage a generic encoder-
decoder network. An encoder distills the input through a series of contracting
layers to a bottleneck that captures the details most relevant to the task in train-
ing, and the decoder reconstructs the output from the representation captured
in the bottleneck layer through a series of expanding layers. While this approach
is widely successful across various problems, including depth prediction[37,26,
42], they lack several qualities we care about in stereo algorithm.
First of all, this approach does not capture any geometric intuition about
the stereo matching problem. Stereo prediction is ﬁrst-and-foremost a corre-
spondence matching problem, so we aimed to design an algorithm that can be
adapted without retraining to diﬀerent stereo cameras with varying resolutions
and baselines. Secondly, we note that similar approaches are evidently overpa-
rameterized for problems where the prediction is a pixel-to-pixel mapping that
does not involve any warping of the input, and thus likely to overﬁt.
Our approach to stereo matching incorporates a design that leverages the
problem structure and classical approaches to tackle it, akin to , while pro-
Fig. 1. Model architecture. A two stage approach is proposed: ﬁrst we extract image
features at a lower resolution using a Siamese network. We then build a cost volume
at that resolution by matching the features along the scanlines, giving us a coarse
disparity estimate. We ﬁnally reﬁne the results hierarchically to recover small details
and thin structures.
ducing edge-preserving output using compact context-aware pixel-to-pixel re-
ﬁnement networks. An overview of the architecture of our model is illustrated
in Figure 1 and detailed in the following sections.
3.2 Coarse Prediction: Cost Volume Filtering
Stereo system are in general solving a correspondence problem. The problem
classically boils down to forming a disparity map by ﬁnding a pixel-to-pixel
match between two rectiﬁed images along their scanlines. The desire for a smooth
and edge-preserving solution led to approaches like cost volume ﬁltering ,
which explicitly model the matching problem by forming and processing a 3D
volume that jointly solves across all candidate disparities at each pixel. While 
directly used color values for the matching, we compute a feature representation
at each pixel that is used for matching.
Feature Network The ﬁrst step of the pipeline ﬁnds a meaningful represen-
tation of image patches that can be accurately matched in the later stages.
We recall that stereo suﬀer from textureless regions and traditional methods
solve this issue by aggregating the cost using large windows. We replicate the
same behavior in the network by making sure the features are extracted from a
big receptive ﬁeld. In particular, we use a feature network with shared weights
between the two input images (also known as a Siamese network). We ﬁrst ag-
gressively downsample the input images using K 5×5 convolutions with a stride
of 2, keeping the number of channels at 32 throughout the downsampling. In
our experiments we set K to 3 or 4. We then apply 6 residual blocks  that
employ 3 ×3 convolutions, batch-normalization , and leaky ReLu activations
(α= 0.2) . Finally, this is processed using a ﬁnal layer with a 3 ×3 con-
volution that does not use batch-normalization or activation. The output is a
32-dimensional feature vector at each pixel in the downsampled image. This low
resolution representation is important for two reasons: 1) it has a big receptive
ﬁeld, useful for textureless regions. 2) It keeps the feature vectors compact.
6 Khamis et al.
Cost Volume At this point, we form a cost volume at the coarse resolution
by taking the diﬀerence between the feature vector of a pixel and the feature
vectors of the matching candidates. We noted that asymmetric representations
in general performed well, and concatenating the two vectors achieved similar
results in our experiments.
At this stage, a traditional stereo method would use a winner-takes-all (WTA)
approach that picks the disparity with the lowest Euclidean distance between
the two feature vectors. Instead, here we let the network to learn the right metric
by running multiple convolutions followed by non-linearities.
In particular, to aggregate context across the spatial domain as well as the
disparity domain, we ﬁlter the cost volume with four 3D convolutions with a
ﬁlter size of 3 ×3×3, batch-normalization, and leaky ReLu activations. A ﬁnal
3×3×3 convolutional layer that does not use batch-normalization or activation
is then applied, and the ﬁltering layers produce a 1-dimensional output at each
pixel and candidate disparity.
For an input image of size W×Hand evaluating a maximum of Dcandidate
disparities, our cost volume is of size W/2K×H/2K×(D+ 1)/2Kfor Kdown-
sampling layers. In our design of StereoNet we targeted a compact approach with
a small memory footprint that can be potentially deployed to mobile platforms.
Unlike  who form a feature representation at quarter resolution and aggregate
cost volumes across multiple levels, we note that most of the time and compute is
spent matching at higher resolutions, while most of the performance gain comes
from matching at lower resolutions. We validate this claim in our experiments
and show that the performance loss is not signiﬁcant in light of the speed gain.
The reason for this is that the network achieves a magnitude higher sub-pixel
precision than traditional stereo matching approaches. Therefore, matching at
higher resolutions is not needed.
Diﬀerentiable arg min We typically would select the disparity with the min-
imum cost at each pixel in the ﬁltered cost volume using arg min. For a pixel i
and a cost function over disparity values C(d), the selected disparity value diis
di= arg min
This however fails to learn since arg min is a non-diﬀerentiable function. We
considered two diﬀerentiable variants in our approach. The ﬁrst of which is soft
arg min, which was originally proposed in  and was used in . Eﬀectively, the
selected disparity is a softmax-weighted combination of all the disparity values:
The second diﬀerentiable variant is a probabilistic selection that samples from
the softmax distribution over the costs:
di=d, where d∼exp(−Ci(d))
Diﬀerentiating through the sampling process uses gradient estimation techniques
to learn the distribution of disparities by minimizing the expected loss of the
stochastic process. While this technique has roots in policy gradient approaches
in reinforcement learning , it was recently formulated as stochastic compu-
tation graphs in  and applied to RANSAC-based camera localization in .
Additionally, the parallel between the two diﬀerentiable variants we discussed is
akin to that between soft and hard attention networks .
Unfortunately the probabilistic approach signiﬁcantly underperformed in our
experiments, even with various variance reduction techniques . We expect
that this is because it preserves hard selections. This trait is arguably critical
in many applications, but in our model it is superseded by the ability of soft
arg min to regress subpixel-accurate values. This conclusion is supported by the
literature on continuous action spaces in reinforcement learning . The soft
arg min selection was consequently faster to converge and easier to optimize, and
it is what we chose to use in our experiments.
3.3 Hierarchical Reﬁnement: Edge-Aware Upsampling
The downside to relying on coarse matching is that the resulting myopic output
lacks ﬁne details. To maintain our compact design, we approach this problem
by learning an edge-preserving reﬁnement network. We note that the network’s
job at this stage is to dilate or erode the disparity values to blend in high-
frequency details using the color input as guide, so a compact network that learns
a pixel-to-pixel mapping, similar to networks employed in recent computational
photography work [8,7, 20], is an appropriate approach. Speciﬁcally, we task the
reﬁnement network of only ﬁnding a residual (or a delta disparity) to add or
subtract from the coarse prediction.
Our reﬁnement network takes as input the disparity bilinearly upsampled to
the output size as well as the color resized to the same dimensions. Recently
deconvolutions were shown to produce checkerboard artifacts, so we opted to
use bilinear upsampling and convolutions instead . The concatenated color
and disparity ﬁrst pass through a 3 ×3 convolutional layer that outputs a 32-
dimensional representation. This is then passed through 6 residual blocks that,
again, employ 3 ×3 convolutions, batch-normalization, and leaky ReLu activa-
tions (α= 0.2). We use atrous convolutions in these blocks to sample from a
larger context without increasing the network size . We set the dilation fac-
tors for the residual blocks to 1, 2, 4, 8, 1, and 1 respectively. This output is then
processed using a 3×3 convolutional layer that does not use batch-normalization
or activation. The output of this network is a 1-dimensional disparity residual
that is then added to the previous prediction. We apply a ReLu to the sum to
constrain disparities to be positive.
In our experiments we evaluated hierarchically reﬁning the output with a
cascade of the described network, as well as applying a single reﬁnement that
8 Khamis et al.
Fig. 2. Hierarchical reﬁnement results. The result at each stage (top row), starting
with the cost volume output in the top left corner, is updated with the output of
the corresponding reﬁnement network (bottom row). The reﬁnement network output
expectedly dilates and erodes around the edges using the color input as guide. The
groundtruth is shown in the lower right corner. The average endpoint error at each
stage for this example is: 3.27, 2.34, 1.80, and 1.26 respectively. Zoom in for details.
upsamples the coarse output to the full resolution in one-shot. Figure 2 illustrates
the output of the reﬁnement layer at each level of the hierarchy as well as the
residuals added at each level to recover the high-frequency details. The behavior
of this network is reminiscent of joint bilateral upsampling , and indeed we
believe this network is a learned edge-aware upsampling function that leverages
a guide image.
3.4 Loss Function
We train StereoNet in a fully supervised manner using groundtruth-labeled
stereo data. We minimize the hierarchical loss function:
iis the predicted disparity at pixel iat the k-th reﬁnement level, with
k= 0 denoting the output pre-reﬁnement, and ˆ
diis the groundtruth disparity
at the same pixel. The predicted disparity map is always bilinearly upsampled
to match the groundtruth resolution. Finally, ρ(.) is the two-parameter robust
function from  with its parameters set as α= 1 and c= 2, approximating a
smoothed L1 loss.
3.5 Implementation details
We implemented and trained StereoNet using Tensorﬂow . All our experiments
were optimized using RMSProp  with an exponentially-decaying learning rate
initially set to 1e−3. Input data is ﬁrst normalized to the range [−1,1]. We use a
batch size of 1 and we do not crop because of the smaller model size, unlike .
Our network needs around 150kiterations to reach convergence. We found
that, intuitively, training with the left and right disparity maps for an image pair
at the same time signiﬁcantly sped up the training time. On smaller datasets
where training from scratch would be futile, we ﬁne-tuned the pre-trained model
for an additional 50kiterations.
Here, we evaluate our system on several datasets and demonstrate that we
achieve high quality results at a fraction of the computational cost required
by the state of the art.
4.1 Datasets and Setup
We evaluated StereoNet quantitatively and qualitatively on three datasets: Scene
Flow , KITTI 2012  and KITTI 2015 . Scene Flow is a large synthetic
stereo dataset suitable for deep learning models. However, the other two KITTI
datasets, while more comparable to a real-world setting, are too small for full
end-to-end training. We followed previous end-to-end approaches by initially
training on Scene Flow and then individually ﬁne-tuning the resulting model on
the KITTI datasets [29, 42]. Finally, we compare against prominent state-of-the-
art methods in terms of both accuracy and runtime to show the viability of our
approach in real-time scenarios.
Additionally, we performed an ablation study on the Scene Flow dataset using
four variants of our model. We evaluated setting the number of downsampling
convolutions K(detailed in Section 3.2) to 3 and 4. This controls the resolution at
which the cost volume is formed. The cost volume ﬁltering is exponentially faster
with more aggressive downsampling, but comes at the expense of increasingly
losing details around thin structures and small objects. The reﬁnement layer
can bring in a lot of the ﬁne details, but if the signal is completely missing
from the cost volume, it is unlikely to recover them. Additionally we evaluated
using Kreﬁnement layers to hierarchically recover the details at the diﬀerent
scales versus using a single reﬁnement layer to upsample the cost volume output
directly to the desired ﬁnal resolution.
4.2 Subpixel Precision
The precision of a depth system is usually a crucial variable when choosing the
right technology for a given application. A triangulation system with a baseline
b, a focal length fand a subpixel precision δhas an error which increases
quadratically with the distance Z:=δZ2
bf . Competitive technologies such as
Time-of-Flight do not suﬀer from this issue, which makes them appealing for long
range applications such as room scanning and reconstruction. Despite this it has
10 Khamis et al.
Fig. 3. Subpixel precision in stereo matching. We demonstrate that StereoNet achieves
a subpixel precision of 0.03, which is one order of magnitude lower than traditional
stereo approaches. The lower bound of traditional approaches was found to be 1/10th
under realistic conditions (see ) which we indicate by the black line. Moreover, our
method can run in real-time on 720p images.
been demonstrated that multipath eﬀects in ToF systems can distort geometry
even in close-up tasks such as object scanning . Long range precision remains
as one of the main arguments against a stereo system and in favor of ToF.
Here we show that deep architectures are a breakthrough in terms of sub-
pixel precision and therefore they can compete with other technologies not only
for short distances but as well as in long ranges. Traditional stereo matching
methods perform a discrete search and then a parabola interpolation to retrieve
the accurate disparity. This methods usually leads to a subpixel precision ∼0.25
pixels, that roughly correspond to 4.5 cm error at 3m distance for a system with
a 55 cm baseline such as the Intel Realsense D415.
To assess the precision of our method, we used the evaluation set of Scene
Flow and we computed the average error only for those pixels that were correctly
matched at integer locations. Results correspond to the average of over a hundred
million pixels and are reported in Figure 3. From this ﬁgure, it is important to
note that: (1) the proposed method achieves a subpixel precision of 0.03 which is
one order of magnitude lower than traditional stereo matching approaches such
as [4, 14, 15]; (2) the reﬁnement layers are performing very similarly irrespective
of the resolution of the cost volume; (3) without any reﬁnement the downsampled
cost volume can still achieve a subpixel precision of 0.03 in the low resolution
output. However, the error increases, almost linearly, with the downsampling
Note that a subpixel precision of 0.03 means that the expected error is less
than 5mm at 3m distance from the camera (Intel Realsense D415). This result
makes triangulation systems very appealing and comparable with ToF technol-
ogy without suﬀering from multi-path eﬀects.
Fig. 4. Qualitative results on the FlyingThings3D test set. The proposed two-stage
architecture is able to recover very ﬁne details despite the low resolution at which we
form the cost volume.
4.3 Quantitative Results
We now evaluate the model on standard benchmarks proving the eﬀectiveness
of the proposed methods and the diﬀerent trade-oﬀs between the resolution of
the cost volume and the precision obtained.
SceneFlow. Although this data is synthetically generated, the evaluation se-
quences are very challenges due to the presence of occlusions, thin structures and
large disparities. We evaluated our model reporting the end point error (EPE)
in Table 1.
A single, unreﬁned model, i.e. using only the cost volume output at 1/8
of the resolution, achieves an EPE of 2.48 which is better than the full model
presented in , which reaches an EPE of 2.51. Notice that our unreﬁned model
is composed of 360kparameters and runs at 12 msec at the 960 ×540 input
resolution, whereas  uses 3.5 million parameter with a runtime of 950 msec
on the same resolution. Our best, multi-scale architecture achieves the state-
of-the-art error of 1.1, which is also lower than the one reported in very recent
methods such as . Qualitative examples can be found in Figure 4. Notice how
the method recovers very challenging ﬁne details.
One last consideration regards the resolution of the cost volume. On one hand
we proved that a coarse cost volume already carries all the information needed
to retrieve a very high subpixel precision, i.e. high disparity resolution. On the
other hand, downsampling the image may lead to a loss in spatial resolution,
therefore thin structures cannot be reconstructed if the output of the cost vol-
ume is very coarse. Here we demonstrate that a volume at 1/16 of the resolution
is powerful enough to recover very challenging small objects. Indeed in Figure
5, we compare the output of the three cost volumes at 1/4, 1/8, 1/16 resolutions
where we also applied the reﬁnement layers. We can observe that the ﬁne struc-
tures that are missed in the 1/16 resolution disparity map are correctly recovered
12 Khamis et al.
Fig. 5. Cost volume comparisons. A cost volume at 1/16 resolution has already the
information required to produce high quality disparity maps. This is evident in that
post reﬁnement we recover challenging thin structures and the overall end point error
(EPE) is below one pixel.
EPE all EPE nocc EPE all, unref EPE nocc, unref
8x, multi 1.101 0.768 2.512 1.795
8x, single 1.532 1.058 2.486 1.784
16x, multi 1.525 1.140 3.764 2.912
16x, single 1.974 1.476 3.558 2.773
CG-Net Fast  7.27 - - -
CG-Net Full  2.51 - - -
CRL  1.32 - - -
Table 1. Quantitative evaluation on SceneFlow. We achieve state of the art results
compared to recent deep learning methods. We compare four variants of our model
which vary in the resolution at which the cost volume is formed (8x vs 16x) and the
number of reﬁnement layers (multiple vs single).
by the upsampling strategy we propose. The cost volume at 1/4 is not neces-
sary to achieve a compelling results and this is an important ﬁnding for mobile
applications. As showed in the previous subsection, even at low resolution the
network achieves a subpixel precision of 1/30th pixel. However, we want to also
highlight that to achieve state of the art precision on multiple benchmarks, the
cost volume resolution becomes an important factor as demonstrated in Table 1.
Kitti. Kitti is a prominent stereo benchmark that was captured by driving a car
equipped with cameras and a laser scanner . The dataset is very challenging
due to the huge variability, reﬂections, overexposed areas and more importantly,
the lack of a big training set. Despite this, we provide the results on Kitti 2012
in Table 2. Our model uses a downsampling factor of 8 for the cost volume and
3 reﬁnement steps. Among the top-performing methods, we compare to three
signiﬁcant ones. Current state of the art , achieves an EPE of 0.6, but it
has a running time of 0.9 seconds per image and uses a multi-scale cost volume
and several 3D deconvolutions. The earlier deep learning-based stereo matching
approach of  takes 67 seconds per image and has higher error (0.9) compared
to our method that runs at 0.015s per stereo pair. The SGM-net  has an error
Fig. 6. Qualitative Results on Kitti 2012 and Kitti 2015. Notice how our method
preserves edge and recovers details compared to the fast . State of the art methods
are one order of magnitude slower than the proposed approach.
Out-Noc Out-All Avg-Noc Avg-All Runtime
StereoNet 4.91 6.02 0.8 0.9 0.015s
CG-Net  2.71 3.46 0.6 0.7 0.9s
MC-CNN  3.9 5.45 0.7 0.9 67s
SGM-Net  3.6 5.15 0.7 0.9 67s
Table 2. Quantitative evaluation on Kitti 2012. For StereoNet we used a model with
a downsampling factor of 8 and 3 reﬁnement levels. We report the percentage of pixels
with error bigger than 2, as well as the overall EPE in both non occluded (Noc) and
all the pixels (All).
comparable to ours. Although we do not reach state of the art results, we believe
that the produced disparity maps are very compelling as shown in Figure 6,
bottom. We analyzed the source of errors in our model and we found that most
of the wrong estimates are around reﬂections, which result in a wrong disparity
prediction, as well as occluded regions, which do not have a correspondence in
the other view. These areas cannot be explained by the data and the problem
can then be formulated as an inpainting task, which our model is not trained for.
State of the art  uses a hour-glass like architecture in their reﬁnement step,
that has been shown to be really eﬀective for inpainting purposes . This
is certainly a valid solution to handle those invalid areas, however it requires
signiﬁcant additional computational resources. We believe that the simplicity of
the proposed architecture shows important insights and it can lead the way to
interesting directions to overcome the current limitations.
Similarly, we evaluated our algorithm on Kitti 2015 and report the results
in Tab. 3, where similar considerations can be made. In Figure 6 top, we show
some examples from the test data.
14 Khamis et al.
D1-bg D1-fg D1-all Runtime
StereoNet 4.30 7.45 4.83 0.015s
CRL  2.48 3.59 2.67 0.5s
CG-Net Full  2.21 6.16 2.87 0.9s
MC-CNN  2.89 8.88 3.89 67s
SGM-Net  2.66 8.64 3.66 67s
Table 3. Quantitative evaluation on Kitti 2015. For StereoNet we used a model with
a downsampling factor of 8 and 3 reﬁnement levels. We report the percentage of pixels
with error bigger than 1 in background regions (bg), foreground areas (fg), and all.
Fig. 7. Runtime analysis of StereoNet. Breakdown of the running time. Notice how
most of the time is spent at the last level of reﬁnement.
4.4 Running Time Analysis
We conclude this section with a breakdown of the running time of our algorithm.
Readers interested in real-time applications would ﬁnd useful to understand
where the bottlenecks are. The current algorithm runs at 60fps on an NVidia
Titan X and in Fig. 7 of the whole running time. Notice how feature extraction,
volume formation and ﬁltering take less than half of the whole computation
(41%), and the most time consuming steps are the reﬁnement stage: the last
level of reﬁnement done at full resolution is using 38% of the computation.
We presented StereoNet, the ﬁrst real-time, high quality end-to-end architec-
ture for passive stereo matching. We started from the insight that a low reso-
lution cost volume contains most of the information to generate high-precision
disparity maps and to recover thin structures given enough training data. We
demonstrated a subpixel precision of 1/30th pixel, surpassing limits published
in the literature. Our reﬁnement approach hierarchically recovers high-frequency
details using the color input as guide, drawing parallels to a data-driven joint
bilateral upsampling operator. The main limitation of our approach is due to
the lack of supervised training data: indeed we showed that when enough exam-
ples are available, our method reaches state of the art results. To mitigate this
eﬀect, our future work involves a combination of supervised and self-supervised
learning  to augment the training set.
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S.,
Davis, A., Dean, J., Devin, M., et al.: Tensorﬂow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
2. Barron, J.T.: A more general robust loss function. arXiv preprint arXiv:1701.03077
3. Besse, F., Rother, C., Fitzgibbon, A., Kautz, J.: Pmbp: Patchmatch belief propaga-
tion for correspondence ﬁeld estimation. International Journal of Computer Vision
110(1), 2–13 (2014)
4. Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching with
slanted support windows. In: Bmvc. vol. 11, pp. 1–11 (2011)
5. Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S.,
Rother, C.: Dsac-diﬀerentiable ransac for camera localization. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR). vol. 3 (2017)
6. Chapelle, O., Wu, M.: Gradient descent optimization of smoothed information
retrieval metrics. Information retrieval 13(3), 216–235 (2010)
7. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded reﬁnement
networks. In: The IEEE International Conference on Computer Vision (ICCV).
vol. 1 (2017)
8. Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional net-
works. In: IEEE International Conference on Computer Vision. vol. 9 (2017)
9. Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence em-
bedding model for stereo matching costs. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 972–980 (2015)
10. Delon, J., Roug´e, B.: Small baseline stereovision. J. Math. Imaging Vis. (2007)
11. Dou, M., Davidson, P., Fanello, S.R., Khamis, S., Kowdle, A., Rhemann, C.,
Tankovich, V., Izadi, S.: Motion2fusion: Real-time volumetric performance cap-
ture. SIGGRAPH Asia (2017)
12. Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A., Es-
colano, S.O., Rhemann, C., Kim, D., Taylor, J., Kohli, P., Tankovich, V., Izadi,
S.: Fusion4d: Real-time performance capture of challenging scenes. SIGGRAPH
13. Fanello, S.R., Rhemann, C., Tankovich, V., Kowdle, A., Orts Escolano, S., Kim,
D., Izadi, S.: Hyperdepth: Learning depth from structured light without matching.
In: CVPR (2016)
14. Fanello, S.R., Valentin, J., Kowdle, A., Rhemann, C., Tankovich, V., Ciliberto,
C., Davidson, P., Izadi, S.: Low compute and fully parallel computer vision with
hashmatch. In: ICCV (2017)
15. Fanello, S.R., Valentin, J., Rhemann, C., Kowdle, A., Tankovich, V., Davidson, P.,
Izadi, S.: Ultrastereo: Eﬃcient learning-based matching for active stereo systems.
In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.
pp. 6535–6544. IEEE (2017)
16. Fanello, S., Gori, I., Metta, G., Odone, F.: One-shot learning for real-time action
recognition. In: IbPRIA (2013)
17. Felzenszwalb, P.F., Huttenlocher, D.P.: Eﬃcient belief propagation for early vision.
International journal of computer vision 70(1), 41–54 (2006)
18. Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: Learning to predict
new views from the world’s imagery. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 5515–5524 (2016)
16 Khamis et al.
19. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti
vision benchmark suite. In: Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on. pp. 3354–3361. IEEE (2012)
20. Gharbi, M., Chen, J., Barron, J.T., Hasinoﬀ, S.W., Durand, F.: Deep bilateral
learning for real-time image enhancement. ACM Transactions on Graphics (TOG)
36(4), 118 (2017)
21. Gidaris, S., Komodakis, N.: Detect, replace, reﬁne: Deep structured prediction for
pixel wise labeling. In: Proc. of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 5248–5257 (2017)
22. Hamzah, R.A., Ibrahim, H.: Literature survey on stereo vision disparity map al-
gorithms. Journal of Sensors 2016 (2016)
23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
24. Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning-
lecture 6a-overview of mini-batch gradient descent (2012)
25. Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume ﬁl-
tering for visual correspondence and beyond. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 35(2), 504–511 (2013)
26. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0:
Evolution of optical ﬂow estimation with deep networks. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). vol. 2 (2017)
27. Ioﬀe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: International conference on machine learning.
pp. 448–456 (2015)
28. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton,
J., Hodges, S., Freeman, D., Davison, A., Fitzgibbon, A.: Kinectfusion: Real-time
3d reconstruction and interaction using a moving depth camera. In: UIST (2011)
29. Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A.,
Bry, A.: End-to-end learning of geometry and context for deep stereo regression.
CoRR, vol. abs/1703.04309 (2017)
30. Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief
propagation and a self-adapting dissimilarity measure. In: Pattern Recognition,
2006. ICPR 2006. 18th International Conference on. vol. 3, pp. 15–18. IEEE (2006)
31. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions using
graph cuts. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE
International Conference on. vol. 2, pp. 508–515. IEEE (2001)
32. Kopf, J., Cohen, M.F., Lischinski, D., Uyttendaele, M.: Joint bilateral upsampling.
ACM Transactions on Graphics (ToG) 26(3), 96 (2007)
33. Kr¨ahenb¨uhl, P., Koltun, V.: Eﬃcient inference in fully connected crfs with gaussian
edge potentials. In: NIPS (2011)
34. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,
Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint
35. Luo, W., Schwing, A.G., Urtasun, R.: Eﬃcient deep learning for stereo matching.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition. pp. 5695–5703 (2016)
36. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectiﬁer nonlinearities improve neural net-
work acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)
37. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox,
T.: A large dataset to train convolutional networks for disparity, optical ﬂow, and
scene ﬂow estimation. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 4040–4048 (2016)
38. Menze, M., Geiger, A.: Object scene ﬂow for autonomous vehicles. In: Conference
on Computer Vision and Pattern Recognition (CVPR) (2015)
39. Nehab, D., Rusinkiewicz, S., Davis, J.: Improved sub-pixel stereo correspondences
through symmetric reﬁnement. In: International Conference on Computer Vision
40. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Dis-
till (2016). https://doi.org/10.23915/distill.00003, http://distill.pub/2016/deconv-
41. Orts-Escolano, S., Rhemann, C., Fanello, S., Chang, W., Kowdle, A., Degtyarev,
Y., Kim, D., Davidson, P.L., Khamis, S., Dou, M., Tankovich, V., Loop, C., Cai, Q.,
Chou, P.A., Mennicken, S., Valentin, J., Pradeep, V., Wang, S., Kang, S.B., Kohli,
P., Lutchyn, Y., Keskin, C., Izadi, S.: Holoportation: Virtual 3d teleportation in
real-time. In: UIST (2016)
42. Pang, J., Sun, W., Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-
stage convolutional neural network for stereo matching. In: International Conf. on
Computer Vision-Workshop on Geometry Meets Deep Learning (ICCVW 2017).
vol. 3 (2017)
43. Papandreou, G., Kokkinos, I., Savalle, P.A.: Modeling local and global deforma-
tions in deep learning: Epitomic convolution, multiple instance learning, and sliding
window detection. In: Computer Vision and Pattern Recognition (CVPR), 2015
IEEE Conference on. pp. 390–399. IEEE (2015)
44. Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded
image generation network for novel 3d view synthesis. CoRR (2017)
45. Pinggera, P., Pfeiﬀer, D., Franke, U., Mester, R.: Know your limits: Accuracy of
long range stereoscopic object measurements in practice. In: European Conference
on Computer Vision. pp. 96–111. Springer (2014)
46. Pradeep, V., Rhemann, C., Izadi, S., Zach, C., Bleyer, M., Bathiche, S.: Mono-
fusion: Real-time 3d reconstruction of small scenes with a single web camera. In:
47. Ranftl, R., Gehrig, S., Pock, T., Bischof, H.: Pushing the limits of stereo using
variational stereo estimation. In: 2012 IEEE Intelligent Vehicles Symposium (2012)
48. Sanger, T.D.: Stereo disparity computation using gabor ﬁlters. In: Biological Cy-
49. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. International journal of computer vision 47(1-3), 7–42
50. Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochas-
tic computation graphs. In: Advances in Neural Information Processing Systems.
pp. 3528–3536 (2015)
51. Seki, A., Pollefeys, M.: Sgm-nets: Semi-global matching with neural networks. In:
52. Shaked, A., Wolf, L.: Improved stereo matching with constant highway networks
and reﬂective conﬁdence learning. CoRR, vol. abs/1701.00165 (2017)
53. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer-Verlag New
York, Inc., New York, NY, USA, 1st edn. (2010)
18 Khamis et al.
54. Tankovich, V., Schoenberg, M., Fanello, S.R., Kowdle, A., Rhemann, C., Dzitsiuk,
M., Schmidt, M., Valentin, J., Izadi, S.: Sos: Stereo matching in o(1) with slanted
support windows. IROS (2018)
55. Taylor, J., Tankovich, V., Tang, D., Keskin, C., Kim, D., Davidson, P., Kowdle,
A., Izadi, S.: Articulated distance ﬁelds for ultra-fast tracking of hands interacting.
Siggraph Asia (2017)
56. Wang, S., Fanello, S.R., Rhemann, C., Izadi, S., Kohli, P.: The global patch collider.
57. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
reinforcement learning. In: Reinforcement Learning, pp. 5–32. Springer (1992)
58. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: International Conference on Machine Learning. pp. 2048–2057 (2015)
59. Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range
images. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition
60. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolu-
tional neural networks. In: Computer Vision and Pattern Recognition (CVPR),
2015 IEEE Conference on. pp. 4353–4361. IEEE (2015)
61. Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional
neural network. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 1592–1599 (2015)
62. Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network
to compare image patches. Journal of Machine Learning Research 17(1-32), 2
63. Zhang, Y., Khamis, S., Rhemann, C., Valentin, J., Kowdle, A., Tankovich, V.,
Schoenberg, M., Izadi, S., Funkhouser, T., Fanello, S.: Activestereonet: End-to-
end self-supervised learning for active stereo systems. In: ECCV (2018)