Content uploaded by Fengting Yang
Author content
All content in this area was uploaded by Fengting Yang on Mar 29, 2020
Content may be subject to copyright.
Superpixel Segmentation with Fully Convolutional Networks
Fengting Yang Qian Sun
The Pennsylvania State University
fuy34@psu.edu, uestcqs@gmail.com
Hailin Jin
Adobe Research
hljin@adobe.com
Zihan Zhou
The Pennsylvania State University
zzhou@ist.psu.edu
Abstract
In computer vision, superpixels have been widely used as
an effective way to reduce the number of image primitives
for subsequent processing. But only a few attempts have
been made to incorporate them into deep neural networks.
One main reason is that the standard convolution operation
is defined on regular grids and becomes inefficient when
applied to superpixels. Inspired by an initialization strategy
commonly adopted by traditional superpixel algorithms, we
present a novel method that employs a simple fully convo-
lutional network to predict superpixels on a regular image
grid. Experimental results on benchmark datasets show that
our method achieves state-of-the-art superpixel segmenta-
tion performance while running at about 50fps. Based on
the predicted superpixels, we further develop a downsam-
pling/upsampling scheme for deep networks with the goal
of generating high-resolution outputs for dense prediction
tasks. Specifically, we modify a popular network architec-
ture for stereo matching to simultaneously predict super-
pixels and disparities. We show that improved disparity es-
timation accuracy can be obtained on public datasets.
1. Introduction
In recent years, deep neural networks (DNNs) have
achieved great success in a wide range of computer vision
applications. The advance of novel neural architecture de-
sign and training schemes, however, often comes a greater
demand for computational resources in terms of both mem-
ory and time. Consider the stereo matching task as an ex-
ample. It has been empirically shown that, compared to
traditional 2D convolution, 3D convolution on a 4D vol-
ume (height×width×disparity×feature channels) [17] can
better capture context information and learn representations
for each disparity level, resulting in superior disparity es-
timation results. But due to the extra feature dimension,
3D convolution is typically operating on spatial resolutions
that are lower than the original input image size for the
time and memory concern. For example, CSPN [8], the
top-1 method on the KITTI 2015 benchmark, conducts 3D
convolution at 1/4of the input size and uses bilinear in-
terpolation to upsample the predicted disparity volume for
final disparity regression. To handle high resolution images
(e.g., 2000 ×3000), HSM [42], the top-1 method on the
Middlebury-v3 benchmark, uses a multi-scale approach to
compute disparity volume at 1/8,1/16, and 1/32 of the in-
put size. Bilinear upsampling is again applied to generate
disparity maps at the full resolution. In both cases, object
boundaries and fine details are often not well preserved in
final disparity maps due to the upsampling operation.
In computer vision, superpixels provide a compact rep-
resentation of image data by grouping perceptually similar
pixels together. As a way to effectively reduce the num-
ber of image primitives for subsequent processing, super-
pixels have been widely adopted in vision problems such as
saliency detection [41], object detection [32], tracking [37],
and semantic segmentation [12]. However, superpixels are
yet to be widely adopted in the DNNs for dimension re-
duction. One main reason is that, the standard convolution
operation in the convolutional neural networks (CNNs) is
defined on a regular image grid. While a few attempts have
been made to modify deep architectures to incorporate su-
perpixels [14,11,20,34], performing convolution over an
irregular superpixel grid remains challenging.
To overcome this difficulty, we propose a deep learning
method to learn superpixels on the regular grid. Our key in-
sight is that it is possible to associate each superpixel with a
regular image grid cell, a strategy commonly used by tradi-
tional superpixel algorithms [22,36,10,1,23,25,2] as an
initialization step (see Figure 2). Consequently, we cast su-
perpixel segmentation as a task that aims to find association
scores between image pixels and regular grid cells, and use
a fully convolutional network (FCN) to directly predict such
scores. Note that recent work [16] also proposes an end-to-
end trainable network for this task, but this method uses a
deep network to extract pixel features, which are then fed to
a soft K-means clustering module to generate superpixels.
The key motivation for us to choose a standard FCN ar-
chitecture is its simplicity as well as its ability to gener-
ate outputs on the regular grid. With the predicted super-
pixels, we further propose a general framework for down-
FCN
(superpixel seg.) Q
PSMNet
Figure 1. An illustration of our superpixel-based downsampling/upsampling scheme for deep networks. In this figure, we choose PSM-
Net [7] for stereo matching as our task network. The high-res input images are first downsampled using the superpixel association matrix
Qpredicted by our superpixel segmentation network. To generate a high-res disparity map, we use the same matrix Qto upsample the
low-res disparity volume predicted by PSMNet for final disparity regression.
sampling/upsampling in DNNs. As illustrated in Figure 1,
we replace the conventional operations for downsampling
(e.g., stride-2 convolutions) and upsampling (e.g., bilinear
upsampling) in the task network (PSMNet in the figure)
with a superpixel-based downsampling/upsampling scheme
to effectively preserve object boundaries and fine details.
Further, the resulting network is end-to-end trainable. One
advantage of our joint learning framework is that superpixel
segmentation is now directly influenced by the downstream
task, and that the two tasks can naturally benefit from each
other. In this paper, we take stereo matching as an exam-
ple and show how the popular network PSMNet [7], upon
which many of the newest methods such as CSPN [8] and
HSM [42] are built, can be adapted into our framework.
We have conducted extensive experiments to evaluate the
proposed methods. For superpixel segmentation, experi-
ment results on public benchmarks such as BSDS500 [3]
and NYUv2 [28] demonstrate that our method is competi-
tive with or better than the state-of-the-art w.r.t. a variety
of metrics, and is also fast (running at about 50fps). For
disparity estimation, our method outperforms the original
PSMNet on SceneFlow [27] as well as high-res datasets
HR-VS [42] and Middlebury-v3 [30], verifying the benefit
of incorporating superpixels into downstream vision tasks.
In summary, the main contributions of the paper are: 1.
We propose a simple fully convolutional network for super-
pixel segmentation, which achieves state-of-the-art perfor-
mance on benchmark datasets. 2. We introduce a general
superpixel-based downsampling/upsampling framework for
DNNs. We demonstrate improved accuracy in disparity es-
timation by incorporating superpixels into a popular stereo
matching network. To the best of our knowledge, we are the
first to develop a learning-based method that simultaneously
perform superpixel segmentation and dense prediction.
2. Related Work
Superpixel segmentation. There is a long line of research
on superpixel segmentation, now a standard tool for many
vision tasks. For a thorough survey on existing methods,
we refer readers to the recent paper [33]. Here we focus on
methods which use a regular grid in the initialization step.
Turbopixels [22] places initial seeds at regular intervals
based on the desired number of superpixels, and grows them
into regions until superpixels are formed. [36] grows the su-
perpixels by clustering pixels using a geodesic distance that
embeds structure and compactness constraints. SEEDS [10]
initializes the superpixels on a grid, and continuously re-
fines the boundaries by exchanging pixels between neigh-
boring superpixels.
The SLIC algorithm [1] employs K-means clustering
to group nearby pixels into superpixels based on a 5-
dimensional positional and CIELAB color features. Vari-
ants of SLIC include LSC [23] which maps each pixel into
a 10-dimensional feature space and performs weighted K-
means, Manifold SLIC [25] which maps the image to a 2-
dimensional manifold to produce content-sensitive super-
pixels, and SNIC [2] which replaces the iterative K-means
clustering with a non-iterative region growing scheme.
While the above methods rely on hand-crafted features,
recent work [35] proposes to learn pixel affinity from large
data using DNNs. In [16], the authors propose to learn pixel
features which are then fed to a differentiable K-means clus-
tering module for superpixel segmentation. The resulting
method, SSN, is the first end-to-end trainable network for
superpixel segmentation. Different from these methods, we
train a deep neural network to directly predict the pixel-
superpixel association map.
The use of superpixels in deep neural networks. Several
methods propose to integrate superpixels into deep learning
pipelines. These works typically use pre-computed super-
pixels to manipulate learnt features so that important image
properties (e.g., boundaries) can be better preserved. For
example, [14] uses superpixels to convert 2D image patterns
into 1D sequential representations, which allows a DNN
to efficiently explore long-range context for saliency detec-
tion. [11] introduces a “bilateral inception” module which
can be inserted into existing CNNs and perform bilateral fil-
tering across superpixels, and [20,34] employ superpixels
to pool features for semantic segmentation. Instead, we use
superpixels as an effective way to downsample/upsample.
Further, none of these works has attempted to jointly learn
superpixels with the downstream tasks.
Besides, our method is also similar to the deformable
convolutional network (DCN) [9,47] in that both can real-
ize adaptive respective field. However, DCN is mainly de-
signed to better handle geometric transformation and cap-
ture contextual information for feature extraction. Thus,
unlike superpixels, a deformable convolution layer does not
constrain that every pixel has to contribute to (thus is repre-
sented by) the output features.
Stereo matching. Superpixel- or segmentation-based ap-
proach to stereo matching was first introduced in [4], and
have since been widely used [15,5,19,38,6,13]. These
methods first segment the images into regions and fit a para-
metric model (typically a plane) to each region. In [39,40],
Yamaguchi et al. propose an optimization framework to
jointly segments the reference image into superpixels and
estimates the disparity map. [26] trains a CNN to predict
initial pixel-wise disparities, which are refined using the
slanted-plane MRF model. [21] develops an efficient algo-
rithm which computes photoconsistency for only a random
subset of pixels. Our work is fundamentally different from
these optimization-based methods. Instead of fitting para-
metric models to the superpixels, we use superpixels to de-
velop a new downsampling/upsampling scheme for DNNs.
In the past few years, deep networks [45,31,29,44]
taking advantage of large-scale annotated data have gen-
erated impressive stereo matching results. Recent meth-
ods [17,7,8] employing 3D convolution achieve the state-
of-the-art performance on public benchmarks. However,
due to the memory constraints, these methods typically
compute disparity volumes at a lower resolution. [18] bi-
linearly upsamples the disparity to the output size and re-
fine it using an edge-preserving refinement network. Re-
cent work [42] has also explored efficient high-res process-
ing, but its focus is on generating coarse-to-fine results to
meet the need for anytime on-demand depth sensing in au-
tonomous driving applications.
3. Superpixel Segmentation Method
In this section, we introduce our CNN-based superpixel
segmentation method. We first present our idea of directly
predicting pixel-superpixel association on a regular grid in
Section 3.1, followed by a description of our network de-
sign and loss functions in Section 3.2. We further draw a
connection between our superpixel learning regime with the
recent convolutional spatial propagation (CSP) network [8]
for learning pixel affinity in Section 3.3. Finally, in Sec-
tion 3.4, we systematically evaluate our method on public
benchmark datasets.
3.1. Learning Superpixels on a Regular Grid
In the literature, a common strategy adopted [22,36,10,
1,23,25,2,16] for superpixel segmentation is to first par-
tition the H×Wimage using a regular grid of size h×w
and consider each grid cell as an initial superpixel (i.e., a
Figure 2. Illustration of Np. For each pixel pin the green box, we
consider the 9 grid cells in the red box for assignment.
“seed”). Then, the final superpixel segmentation is obtained
by finding a mapping which assigns each pixel p= (u, v)
to one of the seeds s= (i, j). Mathematically, we can write
the mapping as gs(p) = gi,j (u, v) = 1 if the (u, v)-th pixel
belongs to the (i, j)-th superpixel, and 0 otherwise.
In practice, however, it is unnecessary and computation-
ally expensive to compute gi,j(u, v)for all pixel-superpixel
pairs. Instead, for a given pixel p, we constraint the search
to the set of surrounding grid cells Np. This is illus-
trated in Figure 2. For each pixel pin the green box, we
only consider the 9 grid cells in the red box for assign-
ment. Consequently, we can write the mapping as a tensor
G∈ZH×W×|Np|where |Np|= 9.
While several approaches [22,36,10,1,23,25,2,16]
have been proposed to compute G, we take a different route
in the paper. Specifically, we directly learn the mapping us-
ing a deep neural network. To make our objective function
differentiable, we replace the hard assignment Gwith a soft
association map Q∈RH×W×|Np|. Here, the entry qs(p)
represents the probability that a pixel pis assigned to each
s∈ Np, such that Ps∈Npqs(p)=1. Finally, the super-
pixels are obtained by assigning each pixel to the grid cell
with the highest probability: s∗= arg maxsqs(p).
Although it might seem a strong constraint that a pixel
can only be associated to one of the 9 nearby cells, which
leads to the difficulty to generate long/large superpixels, we
want to emphasize the importance of the compactness. Su-
perpixel is inherently an over-segmentation method. As one
of the main purposes of our superpixel method is to perform
the detail-preserved downsampling/upsampling to assist the
downstream network, it is more important to capture spatial
coherence in the local region. For the information goes be-
yond the 9-cell area, there is no problem to segment it into
pieces and leave them for downstream network to aggregate
with convolution operations.
Our method vs. SSN [16]. Recently, [16] proposes SSN,
an end-to-end trainable deep network for superpixel seg-
mentation. Similar to our method, SSN also computes a
soft association map Q. However, unlike our method, SSN
uses the CNN as a means to extract pixel features, which are
then fed to a soft K-means clustering module to compute Q.
We illustrate the algorithmic schemes of the two methods
Input&
image& CNN& Superpixel&
segm.&
Learnt&
features&
Soft
K-means
Input&
image& CNN& Superpixel&
segm.&
(a) SSN (b) Ours
Figure 3. Comparison of algorithmic schemes. SSN trains a CNN to extract pixel features, which are fed to an iterative K-means clustering
module for superpixel segmentation. We train a CNN to directly generate superpixels by predicting a pixel-superpixel association map.
!
"#$%&
'(#)*+,'(#)
-..('"-&"(#/0-$
Figure 4. Our simple encoder-decoder architecture for superpixel
segmentation. Please refer to the supplementary materials for de-
tailed specifications.
in Figure 3. Both SSN and our method can take advantage
of CNN to learn complex features using task-specific loss
functions. But unlike SSN, we combine feature extraction
and superpixel segmentation into a single step. As a result,
our network runs faster and can be easily integrated into ex-
isting CNN frameworks for downstream tasks (Section 4).
3.2. Network Design and Loss Functions
As shown in Figure 4, we use a standard encoder-
decoder design with skip connections to predict superpixel
association map Q. The encoder takes a color image as
input and produces high-level feature maps via a convo-
lutional network. The decoder then gradually upsamples
the feature maps via deconvolutional layers to make final
prediction, taking into account also the features from corre-
sponding encoder layers. We use leaky ReLU for all layers
except for the prediction layer, where softmax is applied.
Similar to SSN [16], one of the main advantages of
our end-to-end trainable superpixel network is its flexibility
w.r.t. the loss functions. Recall that the idea of superpixels
is to group similar pixels together. For different applica-
tions, one may wish to define similarity in different ways.
Generally, let f(p)be the pixel property we want the
superpixels to preserve. Examples of f(p)include a 3-
dimensional CIELAB color vector, and/or a N-dimensional
one-hot encoding vector of semantic labels, where Nis the
number of classes, and many others. We further represent a
pixel’s position by its image coordinates p= [x, y]T.
Given the predicted association map Q, we can compute
the center of any superpixel s,cs= (us,ls)where usis the
property vector and lsis the location vector, as follows:
us=Pp:s∈Npf(p)·qs(p)
Pp:s∈Npqs(p),ls=Pp:s∈Npp·qs(p)
Pp:s∈Npqs(p).
(1)
Here, recall that Npis the set of surrounding superpixels of
p, and qs(p)is the network predicted probability of pbeing
associated with superpixel s. In Eq (1), each sum is taken
over all the pixels with a possibility to be assigned to s.
Then, the reconstructed property and location of any
pixel pare given by:
f0(p) = X
s∈Np
us·qs(p),p0=X
s∈Np
ls·qs(p).(2)
Finally, the general formulation of our loss function has
two terms. The first term encourages the trained model to
group pixels with similar property of interest, and the sec-
ond term enforces the superpixels to be spatially compact:
L(Q) = X
p
dist(f(p),f0(p)) + m
Skp−p0k2,(3)
where dist(·,·)is the task specific distance metric depend-
ing on the pixel property f(p),Sis the superpixel sampling
interval, and mis a weight balancing the two terms.
In this paper, we consider two different choices of f(p).
First, we choose the CIELAB color vector and use the `2
norm as the distance measure. This leads to an objective
function similar to the original SLIC method [1]:
LSLI C (Q) = X
p
kfcol(p)−f0
col(p)k2+m
Skp−p0k2.(4)
Second, following [16], we choose the one-hot encoding
vector of semantic labels and use cross-entropy E(·,·)as
the distance measure:
Lsem(Q) = X
p
E(fsem(p),f0
sem(p)) + m
Skp−p0k2.(5)
3.3. Connection to Spatial Propagation Network
Recently, [8] proposes the convolutional spatial propaga-
tion (CSP) network, which learns an affinity matrix to prop-
agate information to nearly spatial locations. By integrating
the CSP module into existing deep neural networks, [8] has
demonstrated improved performance in affinity-based vi-
sion tasks such as depth completion and refinement. In this
section, we show that the computation of superpixel centers
using learnt association map Qcan be written mathemati-
cally in the form of CSP, thus draw a connection between
learning Qand learning the affinity matrix as in [8].
Given an input feature volume X∈RH×W×C, the con-
volutional spatial propagation (CSP) with a kernel size K
and stride Scan be written as:
yi,j =
K/2
X
a,b=−K/2+1
κi,j (a, b)xi·S+a,j·S+b,(6)
where Y∈Rh×w×Cis an output volume such that h=H
S
and w=W
S,κi,j is the output from an affinity network
such that PK/2
a,b=−K/2+1 κi,j (a, b)=1, and is element-
wise product.
In the meantime, as illustrated in Figure 2, to compute
the superpixel center associated with the (i, j)-th grid cell,
we consider all pixels in the surrounding 3S×3Sregion:
ci,j =
3S/2
X
a,b=−3S/2+1
ˆqi,j (a, b)xi·S+a,j·S+b,(7)
where
ˆqi,j (a, b) = qi,j (u, v)
P3S/2
a,b=−3S/2+1 qi,j (u, v),(8)
and u=i·S+a, v =j·S+b.
Comparing Eq. (6) with Eq. (7), we can see that comput-
ing the center of superpixel of size S×Sis equivalent to
performing CSP with a 3S×3Skernel derived from Q. Fur-
thermore, both κi,j (a, b)and qi,j (u, v)represent the learnt
weight between the spatial location (u, v)in the input vol-
ume and (i, j)in the output volume. In this regard, pre-
dicting Qin our work can be viewed as learning an affinity
matrix as in [8].
Nevertheless, we point out that, while the techniques pre-
sented in this work and [8] share the same mathematical
form, they are developed for very different purposes. In [8],
Eq. (6) is employed repeatedly (with S= 1) to propagate
information to nearby locations, whereas in this work, we
use Eq. (7) to compute superpixel centers (with S > 1).
3.4. Experiments
We train our model with segmentation labels on the stan-
dard benchmark BSDS500 [3] and compare it with state-of-
the-art superpixel methods. To further evaluate the method
generalizability, we also report its performances without
fine-tuning on another benchmark dataset NYUv2 [28].
All evaluations are conducted using the protocols and
codes provided by [33]1. We run LSC [23], ERS [24],
SNIC [2], SEAL [35], and SSN [16] with the original
implementations from the authors, and run SLIC [1] and
ETPS [43] with the codes provided in [33]. For LSC, ERS,
SLIC and ETPS, we use the best parameters reported in
[33], and for the rest, we use the default parameters rec-
ommended by the original authors.
1https://github.com/davidstutz/
superpixel-benchmark
Implementation details. Our model is implemented with
PyTorch, and optimized using Adam with β1= 0.9and
β2= 0.999. We use Lsem in Eq. (5) for this experiment,
with m= 0.003. During training, we randomly crop the
images to size 208 ×208 as input, and perform horizon-
tal/vertical flipping for data augmentation. The initial learn-
ing rate is set to 5×10−5, and is reduced by half after 200k
iterations. Convergence is reached at about 300k iterations.
For training, we use a grid with cell size 16 ×16, which
is equivalent to setting the desired number of superpixels
to 169. At the test time, to generate varying number of su-
perpixels, we simply resize the input image to the appropri-
ate size. For example, by resizing the image to 480 ×320,
our network will generate about 600 superpixels. Further-
more, for fair comparison, most evaluation protocols expect
superpixels to be spatially connected. To enforce that, we
apply an off-the-shelf component connection algorithm to
our output, which merges superpixels that are smaller than
a certain threshold with the surrounding ones.2
Evaluation metrics. We evaluate the superpixel methods
using the popular metrics including achievable segmenta-
tion accuracy (ASA), boundary recall and precision (BR-
BP), and compactness (CO). ASA quantifies the achiev-
able accuracy for segmentation using the superpixels as pre-
processing step, BR and BP measure the boundary adher-
ence of superpixels given the ground truth, whereas CO as-
sesses the compactness of superpixels. The higher these
scores are, the better the segmentation result is. As in [33],
for BR and BP evaluation, we set the boundary tolerance
as 0.0025 times the image diagonal rounded to the closest
integer. We refer readers to [33] for the precise definitions.
Results on BSDS500. BSDS500 contains 200 training, 100
validation, and 200 test images. As multiple labels are avail-
able for each image, we follow [16,35] and treat each anno-
tation as an individual sample, which results in 1633 train-
ing/validation samples and 1063 testing samples. We train
our model using both the training and validation samples.
Figure 5reports the performance of all methods on
BSDS500 test set. Our method outperforms all traditional
methods on all evaluation metrics, except SLIC in term of
CO. Comparing to the other deep learning-based methods,
SEAL and SSN, our method achieves competitive or bet-
ter results in terms of ASA and BR-BP, and significantly
higher scores in term of CO. Figure 8further shows ex-
ample results of different methods. Note that, as discussed
in [33], there is a well-known trade-off between boundary
adherence and compactness. Although our method does not
outperform existing methods on all the metrics, it appears
to strike a better balance among them. It is also worth not-
ing that by achieving higher CO score, our method is able
to better capture spatially coherent information and avoids
2Code and models are available at https://github.com/
fuy34/superpixel_fcn.
200 400 600 800 1000 1200
Number of Superpixels
0.93
0.94
0.95
0.96
0.97
0.98
ASA Score
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
0.8 0.85 0.9 0.95
Boundary Recall
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
Boundary Precision
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
200 400 600 800 1000 1200
Number of Superpixels
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
CO Score
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
Figure 5. Superpixel segmentation results on BSDS500. From left to right: ASA, BR-BP, and CO.
300 700 1100 1500 1900 2300
Number of Superpixels
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
ASA Score
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
0.8 0.85 0.9 0.95
Boundary Recall
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
Boundary Precision
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
300 700 1100 1500 1900 2300
Number of Superpixels
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
CO Score
SLIC
SNIC
LSC
ERS
ETPS
SEAL
SSN
Ours
Figure 6. Superpixel segmentation results on NYUv2. From left to right: ASA, BR-BP, and CO.
200 400 600 800 1000 1200
Number of Superpixels
0.01
0.02
0.05
0.1
0.2
0.5
1
1.5
Avg. Time (log sec.)
SEAL
SSN
Ours
Figure 7. Average runtime of different DL methods w.r.t. number
of superpixels. Note that y-axis is plotted in the logarithmic scale.
paying too much attention to image details and noises.
This characteristic tends to lead to better generalizability,
as shown in the NYUv2 experiment results.
We also compare the runtime difference among deep
learning-based (DL) methods. Figure 7reports the aver-
age runtime w.r.t. the number of generated superpixels on a
NVIDIA GTX 1080Ti GPU device. Our method runs about
3 to 8 times faster than SSN and more than 50 times faster
than SEAL. This is expected as our method uses a simple
encoder-decoder network to directly generate superpixels,
whereas SEAL and SSN first use deep networks to predict
pixel affinity or features, and then apply traditional cluster-
ing methods (i.e., graph cuts or K-means) to get superpixels.
Results on NYUv2. NYUv2 is a RGB-D dataset originally
proposed for indoor scene understanding tasks, which con-
tains 1,449 images with object instance labels. By remov-
ing the unlabelled regions near the image boundary, [33]
has developed a benchmark on a subset of 400 test images
with size 608 ×448 for superpixel evaluation. To test the
generalizability of the learning-based methods, we directly
apply the models of SEAL, SSN, and our method trained on
BSDS500 to this dataset without any fine-tuning.
Figure 6shows the performance of all methods on
NYUv2. Generally, all deep learning-based methods per-
form well as they continue to achieve competitive or better
performance against the traditional methods. Further, our
method is shown to generalize better than SEAL and SSN,
which is evident by comparing the corresponding curves in
Figure 5and 6. Specifically, our method outperforms SEAL
and SSN in terms of BR-BP and CO, and is one of the best
in terms of ASA. The visual results are shown in Figure 8.
4. Application to Stereo Matching
Stereo matching is a classic computer vision task which
aims to find pixel correspondences between a pair of recti-
fied images. Recent literature has shown that deep networks
can boost the matching accuracy by building 4D cost vol-
ume (height×width×disparity×feature channels) and ag-
gregate the information using 3D convolution [7,8,46].
However, such a design consumes large amounts of memory
because of the extra “disparity” dimension, limiting their
ability to generate high-res outputs. A common remedy is to
bilinearly upsample the predicted low-res disparity volumes
for final disparity regression. As a result, object boundaries
often become blur and fine details get lost.
In this section, we propose a downsampling/upsampling
scheme based on the predicted superpixels and show how to
integrate it into existing stereo matching pipelines to gener-
Input GT segments SLIC SEAL SSN Ours
Figure 8. Example superpixel segmentation results. Compared to SEAL and SSN, our method is competitive or better in terms of object
boundary adherence while generating more compact superpixels. Top rows: BSDS500. Bottom rows: NYUv2.
ate high-res outputs that better preserve the object bound-
aries and fine details.
4.1. Network Design and Loss Function
Figure 1provides an overview of our method design. We
choose PSMNet [7] as our task network. In order to in-
corporate our new downsampling/upsampling scheme, we
change all the stride-2 convolutions in its feature extractor
to stride-1, and remove the bilinear upsampling operations
in the spatial dimensions. Given a pair of input images, we
use our superpixel network to predict association maps Ql,
Qrand compute the superpixel center maps using Eq. (1).
The center maps (i.e., downsampled images) are then fed
into the modified PSMNet to get the low-res disparity vol-
ume. Next, the low-res volume is upsampled to original res-
olution with Qlaccording to Eq. (2), and the final disparity
is computed using disparity regression. We refer readers to
the supplementary materials for detailed specification.
Same as PSMNet [7], we use the 3-stage smooth L1loss
with the weights α1= 0.5,α2= 0.7, and α3= 1.0for
disparity prediction. And we use the SLIC loss (Eq. (4)) for
superpixel segmentation. The final loss function is:
L=
3
X
s=1
αs1
N
N
X
p=1
smoothL1(dp−ˆ
dp)+λ
NLSLI C (Q)
(9)
where Nis the total number of pixels, and λis a weight to
balance the two terms. We set λ= 0.1for all experiments.
4.2. Experiments
We have conducted experiments on three public datasets,
SceneFlow [27], HR-VS [42], and Middlebury-v3 [30] to
compared our model with PSMNet. To further verify the
benefit of joint learning for superpixels and disparity esti-
mation, we trained two different models for our method.
In the first model Ours fixed, we fix the parameters in su-
perpixel network and train the rest of the network (i.e., the
modified PSMNet) for disparity estimation. In the second
model Ours joint, we jointly train all networks in Figure 1.
For both models, the superpixel network is pre-trained on
SceneFlow using the SLIC loss. The experiments are con-
ducted on 4 Nvidia TITAN Xp GPUs.
Results on SceneFlow. SceneFlow is a synthetic dataset
contains 35,454 training and 4,370 test frames with dense
ground truth disparity. Following [7], we exclude pixels
with disparities greater than 192 in training and test time.
During training, we set m= 30 in the SLIC loss and
randomly crop the input images into size 512×256. To con-
duct 3D convolution at 1/4of the input resolution as PSM-
Net does, we predict superpixels with grid cell size 4×4
to perform the 4×downsampling/upsampling. We train the
model for 13 epochs with batch size 8. The initial learning
rate is 1×10−3, and is reduced to 5×10−4and 1×10−4af-
ter 11 and 12 epochs, respectively. For PSMNet, we use the
authors’ implementation and train it with the same learning
schedule as our methods.
We use the standard end-point-error (EPE) as the evalua-
tion metric, which measures the mean pixel-wise Euclidean
distance between the predicted disparity and the ground
truth. As shown in Table 1,Ours joint achieves the low-
est EPE. Also note that Ours fixed performs worse than
the original PSMNet, which demonstrates the importance
of joint training. Qualitative results are shown in Figure 9.
One can see that both Ours fixed and Ours joint preserve
fine details better than the original PSMNet.
Results on HR-VS. HR-VS is a synthetic dataset with ur-
Left image GT disparity PSMNet Ours fixed Ours joint
Figure 9. Qualitative results on SceneFlow and HR-VS. Our method is able to better preserve fine details, such as the wires and mirror
frameworks in the highlighted regions. Top rows: SceneFlow. Bottom rows: HR-VS.
Table 1. End-point-error (EPE) on SceneFlow and HR-VS.
Dataset PSMNet [7] Ours fixed Ours joint
SceneFlow 1.04 1.07 0.93
HR-VS 3.83 3.70 2.77
ban driving views. It contains 780 images at 2056 ×2464
resolution. The valid disparity range is [9.66, 768]. Because
no test set is released, we randomly choose 680 frames for
training, and use the rest for testing. Due to the relatively
small data size, we fine-tune all three models trained on
SceneFlow in the previous experiment on this dataset.
Because of the high resolution and large disparity, the
original PSMNet cannot be directly applied to the full size
images. We follow the common practice to downsample
both the input images and disparity maps to 1/4 size for
training, and upsample the result to full resolution for evalu-
ation. For our method, we predict superpixels with grid cell
size 16 ×16 to perform 16×downsampling/upsampling.
During training, we set m= 30, and randomly crop the
images into size 2048 ×1024. We train all methods for
200 epochs with batch size 4. The initial learning rate is
1×10−3and reduced to 1×10−4after 150 epochs.
As shown in Table 1, our models outperform the original
PSMNet. And a significantly lower EPE is achieved by joint
training. Note that, comparing to SceneFlow, we observe a
larger performance gain on this high-res dataset, as we per-
form 16×upsampling on HR-VS but only 4×upsampling
on SceneFlow. Qualitative results are shown in Figure 9.
Results on Middlebury-v3. Middlebury-v3 is a high-res
real-world dataset with 10 training frames, 13 validation
frames3, and 15 test frames. We use both training and val-
idation frames to tune the Our joint model pre-trained on
3Named as additional dataset in the official website.
SceneFlow with 16 ×16 superpixels. We set m= 60 and
train the model for 30 epochs with batch size 4. The initial
learning rate is 1×10−3and divided by 10 after 20 epochs.
Note that, for the experiment, our goal is not to achieve
the highest rank on the official Middlebury-v3 leaderboard.
But instead, to verify the effectiveness of the proposed
superpixel-based downsample/upsampling scheme. Based
on the leaderboard, our model outperforms PSMNet across
all metrics, some of which are presented in Table 2. The
results again verify the benefit of the proposed superpixel-
based downsample/upsampling scheme.
Table 2. Results on Middlebury-v3 benchmark.
Method avgerr rms bad-4.0 A90
PSMNet ROB [7] 8.78 23.3 29.2 22.8
Ours joint 7.11 19.1 27.5 13.8
5. Conclusion
This paper has presented a simple fully convolutional
network for superpixel segmentation. Experiments on
benchmark datasets show that the proposed model is com-
putationally efficient, and can consistently achieve the state-
of-the-art performance with good generalizability. Further,
we have demonstrated that higher disparity estimation ac-
curacy can be obtained by using superpixels to preserve ob-
ject boundaries and fine details in a popular stereo match-
ing network. In the future, we plan to apply the pro-
posed superpixel-based downsampling/upsampling scheme
to other dense prediction tasks, such as object segmentation
and optical flow estimation, and explore different ways to
use superpixels in these applications.
Acknowledgement. This work is supported in part by NSF
award #1815491 and a gift from Adobe.
References
[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aur´
elien
Lucchi, Pascal Fua, and Sabine S¨
usstrunk. SLIC superpix-
els compared to state-of-the-art superpixel methods. IEEE
Trans. Pattern Anal. Mach. Intell., 34(11):2274–2282, 2012.
1,2,3,4,5
[2] Radhakrishna Achanta and Sabine S ¨
usstrunk. Superpix-
els and polygons using simple non-iterative clustering. In
CVPR, pages 4895–4904, 2017. 1,2,3,5
[3] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Ji-
tendra Malik. Contour detection and hierarchical image
segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,
33(5):898–916, 2010. 2,5
[4] Stan Birchfield and Carlo Tomasi. Multiway cut for stereo
and motion with slanted surfaces. In ICCV, pages 489–495,
1999. 3
[5] Michael Bleyer and Margrit Gelautz. A layered stereo al-
gorithm using image segmentation and global visibility con-
straints. In ICIP, pages 2997–3000, 2004. 3
[6] Michael Bleyer, Carsten Rother, and Pushmeet Kohli. Sur-
face stereo with soft segmentation. In CVPR, pages 1570–
1577, 2010. 3
[7] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo
matching network. In CVPR, pages 5410–5418, 2018. 2,
3,6,7,8
[8] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learn-
ing depth with convolutional spatial propagation network.
CoRR, abs/1810.02695, 2018. 1,2,3,4,5,6
[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
Zhang, Han Hu, and Yichen Wei. Deformable convolutional
networks. In ICCV, pages 764–773, 2017. 2
[10] Michael Van den Bergh, Xavier Boix, Gemma Roig, and
Luc J. Van Gool. SEEDS: superpixels extracted via energy-
driven sampling. International Journal of Computer Vision,
111(3):298–314, 2015. 1,2,3
[11] Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel
Kappler, and Peter V. Gehler. Superpixel convolutional net-
works using bilateral inceptions. In ECCV, pages 597–613,
2016. 1,2
[12] Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan,
and Daphne Koller. Multi-class segmentation with relative
location prior. International Journal of Computer Vision,
80(3):300–316, 2008. 1
[13] Fatma G¨
uney and Andreas Geiger. Displets: Resolving
stereo ambiguities using object knowledge. In CVPR, pages
4165–4175, 2015. 3
[14] Shengfeng He, Rynson W. H. Lau, Wenxi Liu, Zhe Huang,
and Qingxiong Yang. Supercnn: A superpixelwise convolu-
tional neural network for salient object detection. Interna-
tional Journal of Computer Vision, 115(3):330–344, 2015.
1,2
[15] Li Hong and George Chen. Segment-based stereo matching
using graph cuts. In CVPR, pages 74–81, 2004. 3
[16] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan
Yang, and Jan Kautz. Superpixel sampling networks. In
ECCV, pages 363–380, 2018. 1,2,3,4,5
[17] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and
Peter Henry. End-to-end learning of geometry and context
for deep stereo regression. In ICCV, pages 66–75, 2017. 1,
3
[18] Sameh Khamis, Sean Ryan Fanello, Christoph Rhemann,
Adarsh Kowdle, Julien P. C. Valentin, and Shahram Izadi.
Stereonet: Guided hierarchical refinement for real-time
edge-aware depth prediction. In ECCV, pages 596–613,
2018. 3
[19] Andreas Klaus, Mario Sormann, and Konrad F. Karner.
Segment-based stereo matching using belief propagation and
a self-adapting dissimilarity measure. In ICPR, pages 15–18,
2006. 3
[20] Suha Kwak, Seunghoon Hong, and Bohyung Han. Weakly
supervised semantic segmentation using superpixel pooling
network. In AAAI, pages 4111–4117, 2017. 1,2
[21] Chloe LeGendre, Konstantinos Batsos, and Philippos Mor-
dohai. High-resolution stereo matching based on sampled
photoconsistency computation. In BMVC, 2017. 3
[22] Alex Levinshtein, Adrian Stere, Kiriakos N. Kutulakos,
David J. Fleet, Sven J. Dickinson, and Kaleem Siddiqi. Tur-
bopixels: Fast superpixels using geometric flows. IEEE
Trans. Pattern Anal. Mach. Intell., 31(12):2290–2297, 2009.
1,2,3
[23] Zhengqin Li and Jiansheng Chen. Superpixel segmentation
using linear spectral clustering. In CVPR, pages 1356–1363,
2015. 1,2,3,5
[24] Ming-Yu Liu, Oncel Tuzel, Srikumar Ramalingam, and
Rama Chellappa. Entropy rate superpixel segmentation. In
CVPR, pages 2097–2104. IEEE, 2011. 5
[25] Yong-Jin Liu, Cheng-Chi Yu, Minjing Yu, and Ying He.
Manifold SLIC: A fast method to compute content-sensitive
superpixels. In CVPR, pages 651–659, 2016. 1,2,3
[26] Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Ef-
ficient deep learning for stereo matching. In CVPR, pages
5695–5703, 2016. 3
[27] Nikolaus Mayer, Eddy Ilg, Philip H¨
ausser, Philipp Fischer,
Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
large dataset to train convolutional networks for disparity,
optical flow, and scene flow estimation. In CVPR, pages
4040–4048, 2016. 2,7
[28] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In ECCV, 2012. 2,5
[29] Jiahao Pang, Wenxiu Sun, Jimmy S. J. Ren, Chengxi Yang,
and Qiong Yan. Cascade residual learning: A two-stage
convolutional neural network for stereo matching. In ICCV
Workshops, pages 878–886, 2017. 3
[30] Daniel Scharstein, Heiko Hirschm ¨
uller, York Kitajima,
Greg Krathwohl, Nera Neˇ
si´
c, Xi Wang, and Porter West-
ling. High-resolution stereo datasets with subpixel-accurate
ground truth. In German conference on pattern recognition,
pages 31–42. Springer, 2014. 2,7
[31] Amit Shaked and Lior Wolf. Improved stereo matching with
constant highway networks and reflective confidence learn-
ing. In CVPR, pages 6901–6910, 2017. 3
[32] Guang Shu, Afshin Dehghan, and Mubarak Shah. Improving
an object detector and extracting regions using superpixels.
In CVPR, pages 3721–3727, 2013. 1
[33] David Stutz, Alexander Hermans, and Bastian Leibe. Su-
perpixels: An evaluation of the state-of-the-art. Computer
Vision and Image Understanding, 166:1–27, 2018. 2,5,6
[34] Teppei Suzuki, Shuichi Akizuki, Naoki Kato, and Yoshim-
itsu Aoki. Superpixel convolution for segmentation. In ICIP,
pages 3249–3253, 2018. 1,2
[35] Wei-Chih Tu, Ming-Yu Liu, Varun Jampani, Deqing Sun,
Shao-Yi Chien, Ming-Hsuan Yang, and Jan Kautz. Learning
superpixels with segmentation-aware affinity loss. In CVPR,
pages 568–576, 2018. 2,5
[36] Peng Wang, Gang Zeng, Rui Gan, Jingdong Wang, and
Hongbin Zha. Structure-sensitive superpixels via geodesic
distance. International Journal of Computer Vision,
103(1):1–21, 2013. 1,2,3
[37] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang.
Superpixel tracking. In ICCV, pages 1323–1330, 2011. 1
[38] Zeng-Fu Wang and Zhi-Gang Zheng. A region based stereo
matching algorithm using cooperative optimization. In
CVPR, 2008. 3
[39] Koichiro Yamaguchi, Tamir Hazan, David A. McAllester,
and Raquel Urtasun. Continuous markov random fields for
robust stereo estimation. In ECCV, pages 45–58, 2012. 3
[40] Koichiro Yamaguchi, David A. McAllester, and Raquel Ur-
tasun. Efficient joint segmentation, occlusion labeling, stereo
and flow estimation. In ECCV, pages 756–771, 2014. 3
[41] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and
Ming-Hsuan Yang. Saliency detection via graph-based man-
ifold ranking. In CVPR, pages 3166–3173, 2013. 1
[42] Gengshan Yang, Joshua Manela, Michael Happold, and
Deva Ramanan. Hierarchical deep stereo matching on high-
resolution images. In CVPR, pages 5515–5524, 2019. 1,2,
3,7
[43] Jian Yao, Marko Boben, Sanja Fidler, and Raquel Urtasun.
Real-time coarse-to-fine topologically preserving segmenta-
tion. In CVPR, pages 2947–2955, 2015. 5
[44] Lidong Yu, Yucheng Wang, Yuwei Wu, and Yunde Jia.
Deep stereo matching with explicit cost aggregation sub-
architecture. In AAAI, pages 7517–7524, 2018. 3
[45] Jure Zbontar and Yann LeCun. Stereo matching by training
a convolutional neural network to compare image patches.
Journal of Machine Learning Research, 17:65:1–65:32,
2016. 3
[46] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and
Philip HS Torr. Ga-net: Guided aggregation net for end-
to-end stereo matching. In CVPR, pages 185–194, 2019. 6
[47] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
formable convnets V2: more deformable, better results. In
CVPR, pages 9308–9316. Computer Vision Foundation /
IEEE, 2019. 2
Supplementary Materials for
Superpixel Segmentation with Fully Convolutional Networks
Fengting Yang Qian Sun
The Pennsylvania State University
fuy34@psu.edu, uestcqs@gmail.com
Hailin Jin
Adobe Research
hljin@adobe.com
Zihan Zhou
The Pennsylvania State University
zzhou@ist.psu.edu
In Section 1and Section 2, we provide the detailed ar-
chitecture designs for the superpixel segmentation network
and the stereo matching network, respectively. In Section 3,
we report additional qualitative results for superpixel seg-
mentation on BSDS500 and NYUv2, disparity estimation
on Sceneflow, HR-VS, and Middlebury-v3, and superpixel
segmentation on HR-VS.
1. Superpixel Segmentation Network
Table 1shows the specific design of our superpixel seg-
mentation network. We use a standard encoder-decoder de-
sign with skip connections to predict the superpixel associa-
tion map Q. Batch normalization and leaky Relu with nega-
tive slope 0.1 are used for all the convolution layers, except
for the association prediction layer (assoc) where softmax
is applied.
Table 1. Specification of our superpixel segmentation network
architecture.
Name Kernel Str. Ch I/O InpRes OutRes Input
cnv0a 3×31 3/16 208 ×208 208 ×208 image
cnv0b 3×31 16/16 208 ×208 208 ×208 cnv0a
cnv1a 3×32 16/32 208 ×208 104 ×104 cnv0b
cnv1b 3×31 32/32 104 ×104 104 ×104 cnv1a
cnv2a 3×32 32/64 104 ×104 52 ×52 cnv1b
cnv2b 3×31 64/64 52 ×52 52 ×52 cnv2a
cnv3a 3×32 64/128 52 ×52 26 ×26 cnv2b
cnv3b 3×31 128/128 26 ×26 26 ×26 cnv3a
cnv4a 3×32 128/256 26 ×26 13 ×13 cnv3b
cnv4b 3×31 256/256 13 ×13 13 ×13 cnv4a
upcnv3 4×42 256/128 13 ×13 26 ×26 cnv4b
icnv3 3×31 256/128 26 ×26 26 ×26 upcnv3+cnv3b
upcnv2 4×42 128/64 26 ×26 52 ×52 icnv3
icnv2 3×31 128/64 52 ×52 52 ×52 upcnv2+cnv2b
upcnv1 4×42 64/32 52 ×52 104 ×104 icnv2
icnv1 3×31 64/32 104 ×104 104 ×104 upcnv1+cnv1b
upcnv0 4×42 32/16 104 ×104 208 ×208 icnv1
icnv0 3×31 32/16 208 ×208 208 ×208 upcnv0+cnv0b
assoc 3×31 16/9 208 ×208 208 ×208 icnv0
2. Stereo Matching Network
Table 2shows the architecture design of stereo match-
ing network, in which we modify PSMNet [1] to per-
form superpixel-based downsampling/upsampling opera-
tions. We name it superpixel-based PSMNet (SPPSMNet).
Table 2. Specification of our stereo matching network (SP-
PSMNet) architecture.
Name Kernel Str. Input OutDim
Input
Img 1/2 H×W×3
Superpixel segmentation and superpixel-based wnwnsampling
assoc 1/2 see Table 1Img 1/2 H×W×9
sImg 1/2 assoc 1/2 4 Img 1/2 1
4H×1
4W×3
PSMNet feature extractor
cnv0 1 3×3,32 1 sImg 1/2 1
4H×1
4W×32
cnv0 2 3×3,32 1 cnv0 1 1
4H×1
4W×32
cnv0 3 3×3,32 1 cnv0 2 1
4H×1
4W×32
cnv1 x 3×3,32
3×3,32×31 cnv0 3 1
4H×1
4W×32
conv2 x 3×3,64
3×3,64×16 1 cnv1 x 1
4H×1
4W×64
cnv3 x 3×3,128
3×3,128×31 cnv2 x 1
4H×1
4W×128
cnv4 x 3×3,128
3×3,128×3, dila = 2 1 cnv3 x 1
4H×1
4W×128
PSMNet SPP module, cost volume, and 3D CNN
output 1
Please refer to [1] for details
1
4H×1
4W×1
4D×1
output 2 1
4H×1
4W×1
4D×1
output 3 1
4H×1
4W×1
4D×1
Superpixel-based upsampling
disp prb1 bilinear upsampling N.A. output 1
1
4H×1
4W×D
assoc 1 4 H×W×D
disp prb2 bilinear upsampling N.A. output 2
1
4H×1
4W×D
assoc 1 4 H×W×D
disp prb3 bilinear upsampling N.A. output 3
1
4H×1
4W×D
assoc 1 4 H×W×D
PSMNet disparity regression
disp 1 disparity regression N.A. disp prb1 H×W
disp 2 disparity regression N.A. disp prb2 H×W
disp 3 disparity regression N.A. disp prb3 H×W
The layers which are different from the orignal PSMNet
have been highlighted in bold face. In Table 2, we use input
image size 256 ×512 with maximum disparity D= 192,
which is the same as the original PSMNet, and we set
superpixel grid cell size 4×4to perform 4×downsam-
pling/upsampling.
For stereo matching tasks with high resolution images
(i.e., HR-VS and Middilebury-v3), we use input image size
1024 ×2048 with maximum disparity D= 768, and we
set superpixel grid cell size 16 ×16 to perform 16×down-
sampling/upsampling. To further reduce the GPU memory
usage, in the high-res stereo matching tasks, we reduce the
channel number of the layers “cnv4a” and “cnv4b” in the su-
perpixel segmentation network from 256 to 128, remove the
batch normalization operation in the superpixel segmenta-
tion network, and perform superpixel-based spatial upsam-
pling after the disparity regression.
3. Additional Qualitative Results
3.1. Superpixel Segmentation
Figure 1and Figure 2show additional qualitative results
for superpixel segmentation on BSDS500 and NYUv2. The
three learning-based methods, SEAL, SSN, and ours, can
recover more detailed boundaries than SLIC, such as the
hub of the windmill in the second row of Figure 1and the
pillow on the right bed in the fourth row of Figure 2. Com-
pared to SEAL and SSN, our method usually generate more
compact superpixels.
3.2. Application to Stereo Matching
Figure 3, Figure 4, and Figure 6show the disparity pre-
diction results on SceneFlow, HR-VS and Middlebury-v3,
respectively. Compared to PSMNet, our methods are able to
better preserve the fine details, such as the headset wire (the
seventh row of Figure 3) , street lamp post (the first row of
Figure 4) and the leaves (the fifth row of Figure 6). We also
observe that our method can better handle textureless areas,
such as the car back in the seventh row of Figure 4. It is
probably because our method directly downsample the im-
ages 16 times before sending them to the modified PSMNet,
while the original PSMNet only downsamples the image 4
times, and uses stride-2 convolution to perform another 4×
downsampling later. The input receptive filed (w.r.t. the
original image) of our method is actually larger than that of
original PSMNet, which enables our method to better lever-
age context information around the textureless area.
Figure 5visualizes the superpixel segmentation results
of Ours fixed and Ours joint methods on HR-VS dataset.
In general, Superpixels generated by Ours joint are more
compact and pay more attentions to the disparity boundary.
The color boundaries that are not aligned with the disparity
boundary, such as the water pit on the road in the second
row of Figure 5, are often ignored by Ours joint. This phe-
nomenon reflects the influence of disparity estimation on
the superpixels in the joint training.
References
[1] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo match-
ing network. In CVPR, pages 5410–5418, 2018. 1
Input GT segments SLIC SEAL SSN Ours
Figure 1. Additional superpixel segmentation results on BSDS500.
Input GT segments SLIC SEAL SSN Ours
Figure 2. Additional superpixel segmentation results on NYUv2.
Left image/GT PSMNet Ours fixed Ours joint
Figure 3. Disparity prediction results on SceneFlow. For each method, we show both the predicted disparity map (top) and the error map
(bottom). For the error map, the darker the color, the lower the end point error (EPE).
Left image/GT PSMNet Ours fixed Ours joint
Figure 4. Disparity prediction results on HR-VS. For each method, we show both the predicted disparity map (top) and the error map
(bottom). For the error map, the darker the color, the lower the end point error (EPE).
Left image Ours fixed Ours joint
Figure 5. Comparsion of superpixel segmentation results on HR-VS. Note we do not enforce the superpixel connectivity here.
Left image PSMNet Ours joint
Figure 6. Disparity estimation results on Middlebury-v3. For each method, we show both the predicted disparity map (top) and the error
map (bottom). For the error map, the darker the color, the lower the error. All the images are from Middlebury-v3 leaderboard.