Content uploaded by Fengting Yang

Author content

All content in this area was uploaded by Fengting Yang on Mar 29, 2020

Content may be subject to copyright.

Superpixel Segmentation with Fully Convolutional Networks

Fengting Yang Qian Sun

The Pennsylvania State University

fuy34@psu.edu, uestcqs@gmail.com

Hailin Jin

Adobe Research

hljin@adobe.com

Zihan Zhou

The Pennsylvania State University

zzhou@ist.psu.edu

Abstract

In computer vision, superpixels have been widely used as

an effective way to reduce the number of image primitives

for subsequent processing. But only a few attempts have

been made to incorporate them into deep neural networks.

One main reason is that the standard convolution operation

is deﬁned on regular grids and becomes inefﬁcient when

applied to superpixels. Inspired by an initialization strategy

commonly adopted by traditional superpixel algorithms, we

present a novel method that employs a simple fully convo-

lutional network to predict superpixels on a regular image

grid. Experimental results on benchmark datasets show that

our method achieves state-of-the-art superpixel segmenta-

tion performance while running at about 50fps. Based on

the predicted superpixels, we further develop a downsam-

pling/upsampling scheme for deep networks with the goal

of generating high-resolution outputs for dense prediction

tasks. Speciﬁcally, we modify a popular network architec-

ture for stereo matching to simultaneously predict super-

pixels and disparities. We show that improved disparity es-

timation accuracy can be obtained on public datasets.

1. Introduction

In recent years, deep neural networks (DNNs) have

achieved great success in a wide range of computer vision

applications. The advance of novel neural architecture de-

sign and training schemes, however, often comes a greater

demand for computational resources in terms of both mem-

ory and time. Consider the stereo matching task as an ex-

ample. It has been empirically shown that, compared to

traditional 2D convolution, 3D convolution on a 4D vol-

ume (height×width×disparity×feature channels) [17] can

better capture context information and learn representations

for each disparity level, resulting in superior disparity es-

timation results. But due to the extra feature dimension,

3D convolution is typically operating on spatial resolutions

that are lower than the original input image size for the

time and memory concern. For example, CSPN [8], the

top-1 method on the KITTI 2015 benchmark, conducts 3D

convolution at 1/4of the input size and uses bilinear in-

terpolation to upsample the predicted disparity volume for

ﬁnal disparity regression. To handle high resolution images

(e.g., 2000 ×3000), HSM [42], the top-1 method on the

Middlebury-v3 benchmark, uses a multi-scale approach to

compute disparity volume at 1/8,1/16, and 1/32 of the in-

put size. Bilinear upsampling is again applied to generate

disparity maps at the full resolution. In both cases, object

boundaries and ﬁne details are often not well preserved in

ﬁnal disparity maps due to the upsampling operation.

In computer vision, superpixels provide a compact rep-

resentation of image data by grouping perceptually similar

pixels together. As a way to effectively reduce the num-

ber of image primitives for subsequent processing, super-

pixels have been widely adopted in vision problems such as

saliency detection [41], object detection [32], tracking [37],

and semantic segmentation [12]. However, superpixels are

yet to be widely adopted in the DNNs for dimension re-

duction. One main reason is that, the standard convolution

operation in the convolutional neural networks (CNNs) is

deﬁned on a regular image grid. While a few attempts have

been made to modify deep architectures to incorporate su-

perpixels [14,11,20,34], performing convolution over an

irregular superpixel grid remains challenging.

To overcome this difﬁculty, we propose a deep learning

method to learn superpixels on the regular grid. Our key in-

sight is that it is possible to associate each superpixel with a

regular image grid cell, a strategy commonly used by tradi-

tional superpixel algorithms [22,36,10,1,23,25,2] as an

initialization step (see Figure 2). Consequently, we cast su-

perpixel segmentation as a task that aims to ﬁnd association

scores between image pixels and regular grid cells, and use

a fully convolutional network (FCN) to directly predict such

scores. Note that recent work [16] also proposes an end-to-

end trainable network for this task, but this method uses a

deep network to extract pixel features, which are then fed to

a soft K-means clustering module to generate superpixels.

The key motivation for us to choose a standard FCN ar-

chitecture is its simplicity as well as its ability to gener-

ate outputs on the regular grid. With the predicted super-

pixels, we further propose a general framework for down-

FCN

(superpixel seg.) Q

PSMNet

Figure 1. An illustration of our superpixel-based downsampling/upsampling scheme for deep networks. In this ﬁgure, we choose PSM-

Net [7] for stereo matching as our task network. The high-res input images are ﬁrst downsampled using the superpixel association matrix

Qpredicted by our superpixel segmentation network. To generate a high-res disparity map, we use the same matrix Qto upsample the

low-res disparity volume predicted by PSMNet for ﬁnal disparity regression.

sampling/upsampling in DNNs. As illustrated in Figure 1,

we replace the conventional operations for downsampling

(e.g., stride-2 convolutions) and upsampling (e.g., bilinear

upsampling) in the task network (PSMNet in the ﬁgure)

with a superpixel-based downsampling/upsampling scheme

to effectively preserve object boundaries and ﬁne details.

Further, the resulting network is end-to-end trainable. One

advantage of our joint learning framework is that superpixel

segmentation is now directly inﬂuenced by the downstream

task, and that the two tasks can naturally beneﬁt from each

other. In this paper, we take stereo matching as an exam-

ple and show how the popular network PSMNet [7], upon

which many of the newest methods such as CSPN [8] and

HSM [42] are built, can be adapted into our framework.

We have conducted extensive experiments to evaluate the

proposed methods. For superpixel segmentation, experi-

ment results on public benchmarks such as BSDS500 [3]

and NYUv2 [28] demonstrate that our method is competi-

tive with or better than the state-of-the-art w.r.t. a variety

of metrics, and is also fast (running at about 50fps). For

disparity estimation, our method outperforms the original

PSMNet on SceneFlow [27] as well as high-res datasets

HR-VS [42] and Middlebury-v3 [30], verifying the beneﬁt

of incorporating superpixels into downstream vision tasks.

In summary, the main contributions of the paper are: 1.

We propose a simple fully convolutional network for super-

pixel segmentation, which achieves state-of-the-art perfor-

mance on benchmark datasets. 2. We introduce a general

superpixel-based downsampling/upsampling framework for

DNNs. We demonstrate improved accuracy in disparity es-

timation by incorporating superpixels into a popular stereo

matching network. To the best of our knowledge, we are the

ﬁrst to develop a learning-based method that simultaneously

perform superpixel segmentation and dense prediction.

2. Related Work

Superpixel segmentation. There is a long line of research

on superpixel segmentation, now a standard tool for many

vision tasks. For a thorough survey on existing methods,

we refer readers to the recent paper [33]. Here we focus on

methods which use a regular grid in the initialization step.

Turbopixels [22] places initial seeds at regular intervals

based on the desired number of superpixels, and grows them

into regions until superpixels are formed. [36] grows the su-

perpixels by clustering pixels using a geodesic distance that

embeds structure and compactness constraints. SEEDS [10]

initializes the superpixels on a grid, and continuously re-

ﬁnes the boundaries by exchanging pixels between neigh-

boring superpixels.

The SLIC algorithm [1] employs K-means clustering

to group nearby pixels into superpixels based on a 5-

dimensional positional and CIELAB color features. Vari-

ants of SLIC include LSC [23] which maps each pixel into

a 10-dimensional feature space and performs weighted K-

means, Manifold SLIC [25] which maps the image to a 2-

dimensional manifold to produce content-sensitive super-

pixels, and SNIC [2] which replaces the iterative K-means

clustering with a non-iterative region growing scheme.

While the above methods rely on hand-crafted features,

recent work [35] proposes to learn pixel afﬁnity from large

data using DNNs. In [16], the authors propose to learn pixel

features which are then fed to a differentiable K-means clus-

tering module for superpixel segmentation. The resulting

method, SSN, is the ﬁrst end-to-end trainable network for

superpixel segmentation. Different from these methods, we

train a deep neural network to directly predict the pixel-

superpixel association map.

The use of superpixels in deep neural networks. Several

methods propose to integrate superpixels into deep learning

pipelines. These works typically use pre-computed super-

pixels to manipulate learnt features so that important image

properties (e.g., boundaries) can be better preserved. For

example, [14] uses superpixels to convert 2D image patterns

into 1D sequential representations, which allows a DNN

to efﬁciently explore long-range context for saliency detec-

tion. [11] introduces a “bilateral inception” module which

can be inserted into existing CNNs and perform bilateral ﬁl-

tering across superpixels, and [20,34] employ superpixels

to pool features for semantic segmentation. Instead, we use

superpixels as an effective way to downsample/upsample.

Further, none of these works has attempted to jointly learn

superpixels with the downstream tasks.

Besides, our method is also similar to the deformable

convolutional network (DCN) [9,47] in that both can real-

ize adaptive respective ﬁeld. However, DCN is mainly de-

signed to better handle geometric transformation and cap-

ture contextual information for feature extraction. Thus,

unlike superpixels, a deformable convolution layer does not

constrain that every pixel has to contribute to (thus is repre-

sented by) the output features.

Stereo matching. Superpixel- or segmentation-based ap-

proach to stereo matching was ﬁrst introduced in [4], and

have since been widely used [15,5,19,38,6,13]. These

methods ﬁrst segment the images into regions and ﬁt a para-

metric model (typically a plane) to each region. In [39,40],

Yamaguchi et al. propose an optimization framework to

jointly segments the reference image into superpixels and

estimates the disparity map. [26] trains a CNN to predict

initial pixel-wise disparities, which are reﬁned using the

slanted-plane MRF model. [21] develops an efﬁcient algo-

rithm which computes photoconsistency for only a random

subset of pixels. Our work is fundamentally different from

these optimization-based methods. Instead of ﬁtting para-

metric models to the superpixels, we use superpixels to de-

velop a new downsampling/upsampling scheme for DNNs.

In the past few years, deep networks [45,31,29,44]

taking advantage of large-scale annotated data have gen-

erated impressive stereo matching results. Recent meth-

ods [17,7,8] employing 3D convolution achieve the state-

of-the-art performance on public benchmarks. However,

due to the memory constraints, these methods typically

compute disparity volumes at a lower resolution. [18] bi-

linearly upsamples the disparity to the output size and re-

ﬁne it using an edge-preserving reﬁnement network. Re-

cent work [42] has also explored efﬁcient high-res process-

ing, but its focus is on generating coarse-to-ﬁne results to

meet the need for anytime on-demand depth sensing in au-

tonomous driving applications.

3. Superpixel Segmentation Method

In this section, we introduce our CNN-based superpixel

segmentation method. We ﬁrst present our idea of directly

predicting pixel-superpixel association on a regular grid in

Section 3.1, followed by a description of our network de-

sign and loss functions in Section 3.2. We further draw a

connection between our superpixel learning regime with the

recent convolutional spatial propagation (CSP) network [8]

for learning pixel afﬁnity in Section 3.3. Finally, in Sec-

tion 3.4, we systematically evaluate our method on public

benchmark datasets.

3.1. Learning Superpixels on a Regular Grid

In the literature, a common strategy adopted [22,36,10,

1,23,25,2,16] for superpixel segmentation is to ﬁrst par-

tition the H×Wimage using a regular grid of size h×w

and consider each grid cell as an initial superpixel (i.e., a

Figure 2. Illustration of Np. For each pixel pin the green box, we

consider the 9 grid cells in the red box for assignment.

“seed”). Then, the ﬁnal superpixel segmentation is obtained

by ﬁnding a mapping which assigns each pixel p= (u, v)

to one of the seeds s= (i, j). Mathematically, we can write

the mapping as gs(p) = gi,j (u, v) = 1 if the (u, v)-th pixel

belongs to the (i, j)-th superpixel, and 0 otherwise.

In practice, however, it is unnecessary and computation-

ally expensive to compute gi,j(u, v)for all pixel-superpixel

pairs. Instead, for a given pixel p, we constraint the search

to the set of surrounding grid cells Np. This is illus-

trated in Figure 2. For each pixel pin the green box, we

only consider the 9 grid cells in the red box for assign-

ment. Consequently, we can write the mapping as a tensor

G∈ZH×W×|Np|where |Np|= 9.

While several approaches [22,36,10,1,23,25,2,16]

have been proposed to compute G, we take a different route

in the paper. Speciﬁcally, we directly learn the mapping us-

ing a deep neural network. To make our objective function

differentiable, we replace the hard assignment Gwith a soft

association map Q∈RH×W×|Np|. Here, the entry qs(p)

represents the probability that a pixel pis assigned to each

s∈ Np, such that Ps∈Npqs(p)=1. Finally, the super-

pixels are obtained by assigning each pixel to the grid cell

with the highest probability: s∗= arg maxsqs(p).

Although it might seem a strong constraint that a pixel

can only be associated to one of the 9 nearby cells, which

leads to the difﬁculty to generate long/large superpixels, we

want to emphasize the importance of the compactness. Su-

perpixel is inherently an over-segmentation method. As one

of the main purposes of our superpixel method is to perform

the detail-preserved downsampling/upsampling to assist the

downstream network, it is more important to capture spatial

coherence in the local region. For the information goes be-

yond the 9-cell area, there is no problem to segment it into

pieces and leave them for downstream network to aggregate

with convolution operations.

Our method vs. SSN [16]. Recently, [16] proposes SSN,

an end-to-end trainable deep network for superpixel seg-

mentation. Similar to our method, SSN also computes a

soft association map Q. However, unlike our method, SSN

uses the CNN as a means to extract pixel features, which are

then fed to a soft K-means clustering module to compute Q.

We illustrate the algorithmic schemes of the two methods

Input&

image& CNN& Superpixel&

segm.&

Learnt&

features&

Soft

K-means

Input&

image& CNN& Superpixel&

segm.&

(a) SSN (b) Ours

Figure 3. Comparison of algorithmic schemes. SSN trains a CNN to extract pixel features, which are fed to an iterative K-means clustering

module for superpixel segmentation. We train a CNN to directly generate superpixels by predicting a pixel-superpixel association map.

!

"#$%&

'(#)*+,'(#)

-..('"-&"(#/0-$

Figure 4. Our simple encoder-decoder architecture for superpixel

segmentation. Please refer to the supplementary materials for de-

tailed speciﬁcations.

in Figure 3. Both SSN and our method can take advantage

of CNN to learn complex features using task-speciﬁc loss

functions. But unlike SSN, we combine feature extraction

and superpixel segmentation into a single step. As a result,

our network runs faster and can be easily integrated into ex-

isting CNN frameworks for downstream tasks (Section 4).

3.2. Network Design and Loss Functions

As shown in Figure 4, we use a standard encoder-

decoder design with skip connections to predict superpixel

association map Q. The encoder takes a color image as

input and produces high-level feature maps via a convo-

lutional network. The decoder then gradually upsamples

the feature maps via deconvolutional layers to make ﬁnal

prediction, taking into account also the features from corre-

sponding encoder layers. We use leaky ReLU for all layers

except for the prediction layer, where softmax is applied.

Similar to SSN [16], one of the main advantages of

our end-to-end trainable superpixel network is its ﬂexibility

w.r.t. the loss functions. Recall that the idea of superpixels

is to group similar pixels together. For different applica-

tions, one may wish to deﬁne similarity in different ways.

Generally, let f(p)be the pixel property we want the

superpixels to preserve. Examples of f(p)include a 3-

dimensional CIELAB color vector, and/or a N-dimensional

one-hot encoding vector of semantic labels, where Nis the

number of classes, and many others. We further represent a

pixel’s position by its image coordinates p= [x, y]T.

Given the predicted association map Q, we can compute

the center of any superpixel s,cs= (us,ls)where usis the

property vector and lsis the location vector, as follows:

us=Pp:s∈Npf(p)·qs(p)

Pp:s∈Npqs(p),ls=Pp:s∈Npp·qs(p)

Pp:s∈Npqs(p).

(1)

Here, recall that Npis the set of surrounding superpixels of

p, and qs(p)is the network predicted probability of pbeing

associated with superpixel s. In Eq (1), each sum is taken

over all the pixels with a possibility to be assigned to s.

Then, the reconstructed property and location of any

pixel pare given by:

f0(p) = X

s∈Np

us·qs(p),p0=X

s∈Np

ls·qs(p).(2)

Finally, the general formulation of our loss function has

two terms. The ﬁrst term encourages the trained model to

group pixels with similar property of interest, and the sec-

ond term enforces the superpixels to be spatially compact:

L(Q) = X

p

dist(f(p),f0(p)) + m

Skp−p0k2,(3)

where dist(·,·)is the task speciﬁc distance metric depend-

ing on the pixel property f(p),Sis the superpixel sampling

interval, and mis a weight balancing the two terms.

In this paper, we consider two different choices of f(p).

First, we choose the CIELAB color vector and use the `2

norm as the distance measure. This leads to an objective

function similar to the original SLIC method [1]:

LSLI C (Q) = X

p

kfcol(p)−f0

col(p)k2+m

Skp−p0k2.(4)

Second, following [16], we choose the one-hot encoding

vector of semantic labels and use cross-entropy E(·,·)as

the distance measure:

Lsem(Q) = X

p

E(fsem(p),f0

sem(p)) + m

Skp−p0k2.(5)

3.3. Connection to Spatial Propagation Network

Recently, [8] proposes the convolutional spatial propaga-

tion (CSP) network, which learns an afﬁnity matrix to prop-

agate information to nearly spatial locations. By integrating

the CSP module into existing deep neural networks, [8] has

demonstrated improved performance in afﬁnity-based vi-

sion tasks such as depth completion and reﬁnement. In this

section, we show that the computation of superpixel centers

using learnt association map Qcan be written mathemati-

cally in the form of CSP, thus draw a connection between

learning Qand learning the afﬁnity matrix as in [8].

Given an input feature volume X∈RH×W×C, the con-

volutional spatial propagation (CSP) with a kernel size K

and stride Scan be written as:

yi,j =

K/2

X

a,b=−K/2+1

κi,j (a, b)xi·S+a,j·S+b,(6)

where Y∈Rh×w×Cis an output volume such that h=H

S

and w=W

S,κi,j is the output from an afﬁnity network

such that PK/2

a,b=−K/2+1 κi,j (a, b)=1, and is element-

wise product.

In the meantime, as illustrated in Figure 2, to compute

the superpixel center associated with the (i, j)-th grid cell,

we consider all pixels in the surrounding 3S×3Sregion:

ci,j =

3S/2

X

a,b=−3S/2+1

ˆqi,j (a, b)xi·S+a,j·S+b,(7)

where

ˆqi,j (a, b) = qi,j (u, v)

P3S/2

a,b=−3S/2+1 qi,j (u, v),(8)

and u=i·S+a, v =j·S+b.

Comparing Eq. (6) with Eq. (7), we can see that comput-

ing the center of superpixel of size S×Sis equivalent to

performing CSP with a 3S×3Skernel derived from Q. Fur-

thermore, both κi,j (a, b)and qi,j (u, v)represent the learnt

weight between the spatial location (u, v)in the input vol-

ume and (i, j)in the output volume. In this regard, pre-

dicting Qin our work can be viewed as learning an afﬁnity

matrix as in [8].

Nevertheless, we point out that, while the techniques pre-

sented in this work and [8] share the same mathematical

form, they are developed for very different purposes. In [8],

Eq. (6) is employed repeatedly (with S= 1) to propagate

information to nearby locations, whereas in this work, we

use Eq. (7) to compute superpixel centers (with S > 1).

3.4. Experiments

We train our model with segmentation labels on the stan-

dard benchmark BSDS500 [3] and compare it with state-of-

the-art superpixel methods. To further evaluate the method

generalizability, we also report its performances without

ﬁne-tuning on another benchmark dataset NYUv2 [28].

All evaluations are conducted using the protocols and

codes provided by [33]1. We run LSC [23], ERS [24],

SNIC [2], SEAL [35], and SSN [16] with the original

implementations from the authors, and run SLIC [1] and

ETPS [43] with the codes provided in [33]. For LSC, ERS,

SLIC and ETPS, we use the best parameters reported in

[33], and for the rest, we use the default parameters rec-

ommended by the original authors.

1https://github.com/davidstutz/

superpixel-benchmark

Implementation details. Our model is implemented with

PyTorch, and optimized using Adam with β1= 0.9and

β2= 0.999. We use Lsem in Eq. (5) for this experiment,

with m= 0.003. During training, we randomly crop the

images to size 208 ×208 as input, and perform horizon-

tal/vertical ﬂipping for data augmentation. The initial learn-

ing rate is set to 5×10−5, and is reduced by half after 200k

iterations. Convergence is reached at about 300k iterations.

For training, we use a grid with cell size 16 ×16, which

is equivalent to setting the desired number of superpixels

to 169. At the test time, to generate varying number of su-

perpixels, we simply resize the input image to the appropri-

ate size. For example, by resizing the image to 480 ×320,

our network will generate about 600 superpixels. Further-

more, for fair comparison, most evaluation protocols expect

superpixels to be spatially connected. To enforce that, we

apply an off-the-shelf component connection algorithm to

our output, which merges superpixels that are smaller than

a certain threshold with the surrounding ones.2

Evaluation metrics. We evaluate the superpixel methods

using the popular metrics including achievable segmenta-

tion accuracy (ASA), boundary recall and precision (BR-

BP), and compactness (CO). ASA quantiﬁes the achiev-

able accuracy for segmentation using the superpixels as pre-

processing step, BR and BP measure the boundary adher-

ence of superpixels given the ground truth, whereas CO as-

sesses the compactness of superpixels. The higher these

scores are, the better the segmentation result is. As in [33],

for BR and BP evaluation, we set the boundary tolerance

as 0.0025 times the image diagonal rounded to the closest

integer. We refer readers to [33] for the precise deﬁnitions.

Results on BSDS500. BSDS500 contains 200 training, 100

validation, and 200 test images. As multiple labels are avail-

able for each image, we follow [16,35] and treat each anno-

tation as an individual sample, which results in 1633 train-

ing/validation samples and 1063 testing samples. We train

our model using both the training and validation samples.

Figure 5reports the performance of all methods on

BSDS500 test set. Our method outperforms all traditional

methods on all evaluation metrics, except SLIC in term of

CO. Comparing to the other deep learning-based methods,

SEAL and SSN, our method achieves competitive or bet-

ter results in terms of ASA and BR-BP, and signiﬁcantly

higher scores in term of CO. Figure 8further shows ex-

ample results of different methods. Note that, as discussed

in [33], there is a well-known trade-off between boundary

adherence and compactness. Although our method does not

outperform existing methods on all the metrics, it appears

to strike a better balance among them. It is also worth not-

ing that by achieving higher CO score, our method is able

to better capture spatially coherent information and avoids

2Code and models are available at https://github.com/

fuy34/superpixel_fcn.

200 400 600 800 1000 1200

Number of Superpixels

0.93

0.94

0.95

0.96

0.97

0.98

ASA Score

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

0.8 0.85 0.9 0.95

Boundary Recall

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

Boundary Precision

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

200 400 600 800 1000 1200

Number of Superpixels

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

CO Score

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

Figure 5. Superpixel segmentation results on BSDS500. From left to right: ASA, BR-BP, and CO.

300 700 1100 1500 1900 2300

Number of Superpixels

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

ASA Score

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

0.8 0.85 0.9 0.95

Boundary Recall

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Boundary Precision

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

300 700 1100 1500 1900 2300

Number of Superpixels

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

CO Score

SLIC

SNIC

LSC

ERS

ETPS

SEAL

SSN

Ours

Figure 6. Superpixel segmentation results on NYUv2. From left to right: ASA, BR-BP, and CO.

200 400 600 800 1000 1200

Number of Superpixels

0.01

0.02

0.05

0.1

0.2

0.5

1

1.5

Avg. Time (log sec.)

SEAL

SSN

Ours

Figure 7. Average runtime of different DL methods w.r.t. number

of superpixels. Note that y-axis is plotted in the logarithmic scale.

paying too much attention to image details and noises.

This characteristic tends to lead to better generalizability,

as shown in the NYUv2 experiment results.

We also compare the runtime difference among deep

learning-based (DL) methods. Figure 7reports the aver-

age runtime w.r.t. the number of generated superpixels on a

NVIDIA GTX 1080Ti GPU device. Our method runs about

3 to 8 times faster than SSN and more than 50 times faster

than SEAL. This is expected as our method uses a simple

encoder-decoder network to directly generate superpixels,

whereas SEAL and SSN ﬁrst use deep networks to predict

pixel afﬁnity or features, and then apply traditional cluster-

ing methods (i.e., graph cuts or K-means) to get superpixels.

Results on NYUv2. NYUv2 is a RGB-D dataset originally

proposed for indoor scene understanding tasks, which con-

tains 1,449 images with object instance labels. By remov-

ing the unlabelled regions near the image boundary, [33]

has developed a benchmark on a subset of 400 test images

with size 608 ×448 for superpixel evaluation. To test the

generalizability of the learning-based methods, we directly

apply the models of SEAL, SSN, and our method trained on

BSDS500 to this dataset without any ﬁne-tuning.

Figure 6shows the performance of all methods on

NYUv2. Generally, all deep learning-based methods per-

form well as they continue to achieve competitive or better

performance against the traditional methods. Further, our

method is shown to generalize better than SEAL and SSN,

which is evident by comparing the corresponding curves in

Figure 5and 6. Speciﬁcally, our method outperforms SEAL

and SSN in terms of BR-BP and CO, and is one of the best

in terms of ASA. The visual results are shown in Figure 8.

4. Application to Stereo Matching

Stereo matching is a classic computer vision task which

aims to ﬁnd pixel correspondences between a pair of recti-

ﬁed images. Recent literature has shown that deep networks

can boost the matching accuracy by building 4D cost vol-

ume (height×width×disparity×feature channels) and ag-

gregate the information using 3D convolution [7,8,46].

However, such a design consumes large amounts of memory

because of the extra “disparity” dimension, limiting their

ability to generate high-res outputs. A common remedy is to

bilinearly upsample the predicted low-res disparity volumes

for ﬁnal disparity regression. As a result, object boundaries

often become blur and ﬁne details get lost.

In this section, we propose a downsampling/upsampling

scheme based on the predicted superpixels and show how to

integrate it into existing stereo matching pipelines to gener-

Input GT segments SLIC SEAL SSN Ours

Figure 8. Example superpixel segmentation results. Compared to SEAL and SSN, our method is competitive or better in terms of object

boundary adherence while generating more compact superpixels. Top rows: BSDS500. Bottom rows: NYUv2.

ate high-res outputs that better preserve the object bound-

aries and ﬁne details.

4.1. Network Design and Loss Function

Figure 1provides an overview of our method design. We

choose PSMNet [7] as our task network. In order to in-

corporate our new downsampling/upsampling scheme, we

change all the stride-2 convolutions in its feature extractor

to stride-1, and remove the bilinear upsampling operations

in the spatial dimensions. Given a pair of input images, we

use our superpixel network to predict association maps Ql,

Qrand compute the superpixel center maps using Eq. (1).

The center maps (i.e., downsampled images) are then fed

into the modiﬁed PSMNet to get the low-res disparity vol-

ume. Next, the low-res volume is upsampled to original res-

olution with Qlaccording to Eq. (2), and the ﬁnal disparity

is computed using disparity regression. We refer readers to

the supplementary materials for detailed speciﬁcation.

Same as PSMNet [7], we use the 3-stage smooth L1loss

with the weights α1= 0.5,α2= 0.7, and α3= 1.0for

disparity prediction. And we use the SLIC loss (Eq. (4)) for

superpixel segmentation. The ﬁnal loss function is:

L=

3

X

s=1

αs1

N

N

X

p=1

smoothL1(dp−ˆ

dp)+λ

NLSLI C (Q)

(9)

where Nis the total number of pixels, and λis a weight to

balance the two terms. We set λ= 0.1for all experiments.

4.2. Experiments

We have conducted experiments on three public datasets,

SceneFlow [27], HR-VS [42], and Middlebury-v3 [30] to

compared our model with PSMNet. To further verify the

beneﬁt of joint learning for superpixels and disparity esti-

mation, we trained two different models for our method.

In the ﬁrst model Ours ﬁxed, we ﬁx the parameters in su-

perpixel network and train the rest of the network (i.e., the

modiﬁed PSMNet) for disparity estimation. In the second

model Ours joint, we jointly train all networks in Figure 1.

For both models, the superpixel network is pre-trained on

SceneFlow using the SLIC loss. The experiments are con-

ducted on 4 Nvidia TITAN Xp GPUs.

Results on SceneFlow. SceneFlow is a synthetic dataset

contains 35,454 training and 4,370 test frames with dense

ground truth disparity. Following [7], we exclude pixels

with disparities greater than 192 in training and test time.

During training, we set m= 30 in the SLIC loss and

randomly crop the input images into size 512×256. To con-

duct 3D convolution at 1/4of the input resolution as PSM-

Net does, we predict superpixels with grid cell size 4×4

to perform the 4×downsampling/upsampling. We train the

model for 13 epochs with batch size 8. The initial learning

rate is 1×10−3, and is reduced to 5×10−4and 1×10−4af-

ter 11 and 12 epochs, respectively. For PSMNet, we use the

authors’ implementation and train it with the same learning

schedule as our methods.

We use the standard end-point-error (EPE) as the evalua-

tion metric, which measures the mean pixel-wise Euclidean

distance between the predicted disparity and the ground

truth. As shown in Table 1,Ours joint achieves the low-

est EPE. Also note that Ours ﬁxed performs worse than

the original PSMNet, which demonstrates the importance

of joint training. Qualitative results are shown in Figure 9.

One can see that both Ours ﬁxed and Ours joint preserve

ﬁne details better than the original PSMNet.

Results on HR-VS. HR-VS is a synthetic dataset with ur-

Left image GT disparity PSMNet Ours ﬁxed Ours joint

Figure 9. Qualitative results on SceneFlow and HR-VS. Our method is able to better preserve ﬁne details, such as the wires and mirror

frameworks in the highlighted regions. Top rows: SceneFlow. Bottom rows: HR-VS.

Table 1. End-point-error (EPE) on SceneFlow and HR-VS.

Dataset PSMNet [7] Ours ﬁxed Ours joint

SceneFlow 1.04 1.07 0.93

HR-VS 3.83 3.70 2.77

ban driving views. It contains 780 images at 2056 ×2464

resolution. The valid disparity range is [9.66, 768]. Because

no test set is released, we randomly choose 680 frames for

training, and use the rest for testing. Due to the relatively

small data size, we ﬁne-tune all three models trained on

SceneFlow in the previous experiment on this dataset.

Because of the high resolution and large disparity, the

original PSMNet cannot be directly applied to the full size

images. We follow the common practice to downsample

both the input images and disparity maps to 1/4 size for

training, and upsample the result to full resolution for evalu-

ation. For our method, we predict superpixels with grid cell

size 16 ×16 to perform 16×downsampling/upsampling.

During training, we set m= 30, and randomly crop the

images into size 2048 ×1024. We train all methods for

200 epochs with batch size 4. The initial learning rate is

1×10−3and reduced to 1×10−4after 150 epochs.

As shown in Table 1, our models outperform the original

PSMNet. And a signiﬁcantly lower EPE is achieved by joint

training. Note that, comparing to SceneFlow, we observe a

larger performance gain on this high-res dataset, as we per-

form 16×upsampling on HR-VS but only 4×upsampling

on SceneFlow. Qualitative results are shown in Figure 9.

Results on Middlebury-v3. Middlebury-v3 is a high-res

real-world dataset with 10 training frames, 13 validation

frames3, and 15 test frames. We use both training and val-

idation frames to tune the Our joint model pre-trained on

3Named as additional dataset in the ofﬁcial website.

SceneFlow with 16 ×16 superpixels. We set m= 60 and

train the model for 30 epochs with batch size 4. The initial

learning rate is 1×10−3and divided by 10 after 20 epochs.

Note that, for the experiment, our goal is not to achieve

the highest rank on the ofﬁcial Middlebury-v3 leaderboard.

But instead, to verify the effectiveness of the proposed

superpixel-based downsample/upsampling scheme. Based

on the leaderboard, our model outperforms PSMNet across

all metrics, some of which are presented in Table 2. The

results again verify the beneﬁt of the proposed superpixel-

based downsample/upsampling scheme.

Table 2. Results on Middlebury-v3 benchmark.

Method avgerr rms bad-4.0 A90

PSMNet ROB [7] 8.78 23.3 29.2 22.8

Ours joint 7.11 19.1 27.5 13.8

5. Conclusion

This paper has presented a simple fully convolutional

network for superpixel segmentation. Experiments on

benchmark datasets show that the proposed model is com-

putationally efﬁcient, and can consistently achieve the state-

of-the-art performance with good generalizability. Further,

we have demonstrated that higher disparity estimation ac-

curacy can be obtained by using superpixels to preserve ob-

ject boundaries and ﬁne details in a popular stereo match-

ing network. In the future, we plan to apply the pro-

posed superpixel-based downsampling/upsampling scheme

to other dense prediction tasks, such as object segmentation

and optical ﬂow estimation, and explore different ways to

use superpixels in these applications.

Acknowledgement. This work is supported in part by NSF

award #1815491 and a gift from Adobe.

References

[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aur´

elien

Lucchi, Pascal Fua, and Sabine S¨

usstrunk. SLIC superpix-

els compared to state-of-the-art superpixel methods. IEEE

Trans. Pattern Anal. Mach. Intell., 34(11):2274–2282, 2012.

1,2,3,4,5

[2] Radhakrishna Achanta and Sabine S ¨

usstrunk. Superpix-

els and polygons using simple non-iterative clustering. In

CVPR, pages 4895–4904, 2017. 1,2,3,5

[3] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Ji-

tendra Malik. Contour detection and hierarchical image

segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,

33(5):898–916, 2010. 2,5

[4] Stan Birchﬁeld and Carlo Tomasi. Multiway cut for stereo

and motion with slanted surfaces. In ICCV, pages 489–495,

1999. 3

[5] Michael Bleyer and Margrit Gelautz. A layered stereo al-

gorithm using image segmentation and global visibility con-

straints. In ICIP, pages 2997–3000, 2004. 3

[6] Michael Bleyer, Carsten Rother, and Pushmeet Kohli. Sur-

face stereo with soft segmentation. In CVPR, pages 1570–

1577, 2010. 3

[7] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo

matching network. In CVPR, pages 5410–5418, 2018. 2,

3,6,7,8

[8] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learn-

ing depth with convolutional spatial propagation network.

CoRR, abs/1810.02695, 2018. 1,2,3,4,5,6

[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong

Zhang, Han Hu, and Yichen Wei. Deformable convolutional

networks. In ICCV, pages 764–773, 2017. 2

[10] Michael Van den Bergh, Xavier Boix, Gemma Roig, and

Luc J. Van Gool. SEEDS: superpixels extracted via energy-

driven sampling. International Journal of Computer Vision,

111(3):298–314, 2015. 1,2,3

[11] Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel

Kappler, and Peter V. Gehler. Superpixel convolutional net-

works using bilateral inceptions. In ECCV, pages 597–613,

2016. 1,2

[12] Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan,

and Daphne Koller. Multi-class segmentation with relative

location prior. International Journal of Computer Vision,

80(3):300–316, 2008. 1

[13] Fatma G¨

uney and Andreas Geiger. Displets: Resolving

stereo ambiguities using object knowledge. In CVPR, pages

4165–4175, 2015. 3

[14] Shengfeng He, Rynson W. H. Lau, Wenxi Liu, Zhe Huang,

and Qingxiong Yang. Supercnn: A superpixelwise convolu-

tional neural network for salient object detection. Interna-

tional Journal of Computer Vision, 115(3):330–344, 2015.

1,2

[15] Li Hong and George Chen. Segment-based stereo matching

using graph cuts. In CVPR, pages 74–81, 2004. 3

[16] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan

Yang, and Jan Kautz. Superpixel sampling networks. In

ECCV, pages 363–380, 2018. 1,2,3,4,5

[17] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and

Peter Henry. End-to-end learning of geometry and context

for deep stereo regression. In ICCV, pages 66–75, 2017. 1,

3

[18] Sameh Khamis, Sean Ryan Fanello, Christoph Rhemann,

Adarsh Kowdle, Julien P. C. Valentin, and Shahram Izadi.

Stereonet: Guided hierarchical reﬁnement for real-time

edge-aware depth prediction. In ECCV, pages 596–613,

2018. 3

[19] Andreas Klaus, Mario Sormann, and Konrad F. Karner.

Segment-based stereo matching using belief propagation and

a self-adapting dissimilarity measure. In ICPR, pages 15–18,

2006. 3

[20] Suha Kwak, Seunghoon Hong, and Bohyung Han. Weakly

supervised semantic segmentation using superpixel pooling

network. In AAAI, pages 4111–4117, 2017. 1,2

[21] Chloe LeGendre, Konstantinos Batsos, and Philippos Mor-

dohai. High-resolution stereo matching based on sampled

photoconsistency computation. In BMVC, 2017. 3

[22] Alex Levinshtein, Adrian Stere, Kiriakos N. Kutulakos,

David J. Fleet, Sven J. Dickinson, and Kaleem Siddiqi. Tur-

bopixels: Fast superpixels using geometric ﬂows. IEEE

Trans. Pattern Anal. Mach. Intell., 31(12):2290–2297, 2009.

1,2,3

[23] Zhengqin Li and Jiansheng Chen. Superpixel segmentation

using linear spectral clustering. In CVPR, pages 1356–1363,

2015. 1,2,3,5

[24] Ming-Yu Liu, Oncel Tuzel, Srikumar Ramalingam, and

Rama Chellappa. Entropy rate superpixel segmentation. In

CVPR, pages 2097–2104. IEEE, 2011. 5

[25] Yong-Jin Liu, Cheng-Chi Yu, Minjing Yu, and Ying He.

Manifold SLIC: A fast method to compute content-sensitive

superpixels. In CVPR, pages 651–659, 2016. 1,2,3

[26] Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Ef-

ﬁcient deep learning for stereo matching. In CVPR, pages

5695–5703, 2016. 3

[27] Nikolaus Mayer, Eddy Ilg, Philip H¨

ausser, Philipp Fischer,

Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A

large dataset to train convolutional networks for disparity,

optical ﬂow, and scene ﬂow estimation. In CVPR, pages

4040–4048, 2016. 2,7

[28] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob

Fergus. Indoor segmentation and support inference from

rgbd images. In ECCV, 2012. 2,5

[29] Jiahao Pang, Wenxiu Sun, Jimmy S. J. Ren, Chengxi Yang,

and Qiong Yan. Cascade residual learning: A two-stage

convolutional neural network for stereo matching. In ICCV

Workshops, pages 878–886, 2017. 3

[30] Daniel Scharstein, Heiko Hirschm ¨

uller, York Kitajima,

Greg Krathwohl, Nera Neˇ

si´

c, Xi Wang, and Porter West-

ling. High-resolution stereo datasets with subpixel-accurate

ground truth. In German conference on pattern recognition,

pages 31–42. Springer, 2014. 2,7

[31] Amit Shaked and Lior Wolf. Improved stereo matching with

constant highway networks and reﬂective conﬁdence learn-

ing. In CVPR, pages 6901–6910, 2017. 3

[32] Guang Shu, Afshin Dehghan, and Mubarak Shah. Improving

an object detector and extracting regions using superpixels.

In CVPR, pages 3721–3727, 2013. 1

[33] David Stutz, Alexander Hermans, and Bastian Leibe. Su-

perpixels: An evaluation of the state-of-the-art. Computer

Vision and Image Understanding, 166:1–27, 2018. 2,5,6

[34] Teppei Suzuki, Shuichi Akizuki, Naoki Kato, and Yoshim-

itsu Aoki. Superpixel convolution for segmentation. In ICIP,

pages 3249–3253, 2018. 1,2

[35] Wei-Chih Tu, Ming-Yu Liu, Varun Jampani, Deqing Sun,

Shao-Yi Chien, Ming-Hsuan Yang, and Jan Kautz. Learning

superpixels with segmentation-aware afﬁnity loss. In CVPR,

pages 568–576, 2018. 2,5

[36] Peng Wang, Gang Zeng, Rui Gan, Jingdong Wang, and

Hongbin Zha. Structure-sensitive superpixels via geodesic

distance. International Journal of Computer Vision,

103(1):1–21, 2013. 1,2,3

[37] Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang.

Superpixel tracking. In ICCV, pages 1323–1330, 2011. 1

[38] Zeng-Fu Wang and Zhi-Gang Zheng. A region based stereo

matching algorithm using cooperative optimization. In

CVPR, 2008. 3

[39] Koichiro Yamaguchi, Tamir Hazan, David A. McAllester,

and Raquel Urtasun. Continuous markov random ﬁelds for

robust stereo estimation. In ECCV, pages 45–58, 2012. 3

[40] Koichiro Yamaguchi, David A. McAllester, and Raquel Ur-

tasun. Efﬁcient joint segmentation, occlusion labeling, stereo

and ﬂow estimation. In ECCV, pages 756–771, 2014. 3

[41] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and

Ming-Hsuan Yang. Saliency detection via graph-based man-

ifold ranking. In CVPR, pages 3166–3173, 2013. 1

[42] Gengshan Yang, Joshua Manela, Michael Happold, and

Deva Ramanan. Hierarchical deep stereo matching on high-

resolution images. In CVPR, pages 5515–5524, 2019. 1,2,

3,7

[43] Jian Yao, Marko Boben, Sanja Fidler, and Raquel Urtasun.

Real-time coarse-to-ﬁne topologically preserving segmenta-

tion. In CVPR, pages 2947–2955, 2015. 5

[44] Lidong Yu, Yucheng Wang, Yuwei Wu, and Yunde Jia.

Deep stereo matching with explicit cost aggregation sub-

architecture. In AAAI, pages 7517–7524, 2018. 3

[45] Jure Zbontar and Yann LeCun. Stereo matching by training

a convolutional neural network to compare image patches.

Journal of Machine Learning Research, 17:65:1–65:32,

2016. 3

[46] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and

Philip HS Torr. Ga-net: Guided aggregation net for end-

to-end stereo matching. In CVPR, pages 185–194, 2019. 6

[47] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-

formable convnets V2: more deformable, better results. In

CVPR, pages 9308–9316. Computer Vision Foundation /

IEEE, 2019. 2

Supplementary Materials for

Superpixel Segmentation with Fully Convolutional Networks

Fengting Yang Qian Sun

The Pennsylvania State University

fuy34@psu.edu, uestcqs@gmail.com

Hailin Jin

Adobe Research

hljin@adobe.com

Zihan Zhou

The Pennsylvania State University

zzhou@ist.psu.edu

In Section 1and Section 2, we provide the detailed ar-

chitecture designs for the superpixel segmentation network

and the stereo matching network, respectively. In Section 3,

we report additional qualitative results for superpixel seg-

mentation on BSDS500 and NYUv2, disparity estimation

on Sceneﬂow, HR-VS, and Middlebury-v3, and superpixel

segmentation on HR-VS.

1. Superpixel Segmentation Network

Table 1shows the speciﬁc design of our superpixel seg-

mentation network. We use a standard encoder-decoder de-

sign with skip connections to predict the superpixel associa-

tion map Q. Batch normalization and leaky Relu with nega-

tive slope 0.1 are used for all the convolution layers, except

for the association prediction layer (assoc) where softmax

is applied.

Table 1. Speciﬁcation of our superpixel segmentation network

architecture.

Name Kernel Str. Ch I/O InpRes OutRes Input

cnv0a 3×31 3/16 208 ×208 208 ×208 image

cnv0b 3×31 16/16 208 ×208 208 ×208 cnv0a

cnv1a 3×32 16/32 208 ×208 104 ×104 cnv0b

cnv1b 3×31 32/32 104 ×104 104 ×104 cnv1a

cnv2a 3×32 32/64 104 ×104 52 ×52 cnv1b

cnv2b 3×31 64/64 52 ×52 52 ×52 cnv2a

cnv3a 3×32 64/128 52 ×52 26 ×26 cnv2b

cnv3b 3×31 128/128 26 ×26 26 ×26 cnv3a

cnv4a 3×32 128/256 26 ×26 13 ×13 cnv3b

cnv4b 3×31 256/256 13 ×13 13 ×13 cnv4a

upcnv3 4×42 256/128 13 ×13 26 ×26 cnv4b

icnv3 3×31 256/128 26 ×26 26 ×26 upcnv3+cnv3b

upcnv2 4×42 128/64 26 ×26 52 ×52 icnv3

icnv2 3×31 128/64 52 ×52 52 ×52 upcnv2+cnv2b

upcnv1 4×42 64/32 52 ×52 104 ×104 icnv2

icnv1 3×31 64/32 104 ×104 104 ×104 upcnv1+cnv1b

upcnv0 4×42 32/16 104 ×104 208 ×208 icnv1

icnv0 3×31 32/16 208 ×208 208 ×208 upcnv0+cnv0b

assoc 3×31 16/9 208 ×208 208 ×208 icnv0

2. Stereo Matching Network

Table 2shows the architecture design of stereo match-

ing network, in which we modify PSMNet [1] to per-

form superpixel-based downsampling/upsampling opera-

tions. We name it superpixel-based PSMNet (SPPSMNet).

Table 2. Speciﬁcation of our stereo matching network (SP-

PSMNet) architecture.

Name Kernel Str. Input OutDim

Input

Img 1/2 H×W×3

Superpixel segmentation and superpixel-based wnwnsampling

assoc 1/2 see Table 1Img 1/2 H×W×9

sImg 1/2 assoc 1/2 4 Img 1/2 1

4H×1

4W×3

PSMNet feature extractor

cnv0 1 3×3,32 1 sImg 1/2 1

4H×1

4W×32

cnv0 2 3×3,32 1 cnv0 1 1

4H×1

4W×32

cnv0 3 3×3,32 1 cnv0 2 1

4H×1

4W×32

cnv1 x 3×3,32

3×3,32×31 cnv0 3 1

4H×1

4W×32

conv2 x 3×3,64

3×3,64×16 1 cnv1 x 1

4H×1

4W×64

cnv3 x 3×3,128

3×3,128×31 cnv2 x 1

4H×1

4W×128

cnv4 x 3×3,128

3×3,128×3, dila = 2 1 cnv3 x 1

4H×1

4W×128

PSMNet SPP module, cost volume, and 3D CNN

output 1

Please refer to [1] for details

1

4H×1

4W×1

4D×1

output 2 1

4H×1

4W×1

4D×1

output 3 1

4H×1

4W×1

4D×1

Superpixel-based upsampling

disp prb1 bilinear upsampling N.A. output 1

1

4H×1

4W×D

assoc 1 4 H×W×D

disp prb2 bilinear upsampling N.A. output 2

1

4H×1

4W×D

assoc 1 4 H×W×D

disp prb3 bilinear upsampling N.A. output 3

1

4H×1

4W×D

assoc 1 4 H×W×D

PSMNet disparity regression

disp 1 disparity regression N.A. disp prb1 H×W

disp 2 disparity regression N.A. disp prb2 H×W

disp 3 disparity regression N.A. disp prb3 H×W

The layers which are different from the orignal PSMNet

have been highlighted in bold face. In Table 2, we use input

image size 256 ×512 with maximum disparity D= 192,

which is the same as the original PSMNet, and we set

superpixel grid cell size 4×4to perform 4×downsam-

pling/upsampling.

For stereo matching tasks with high resolution images

(i.e., HR-VS and Middilebury-v3), we use input image size

1024 ×2048 with maximum disparity D= 768, and we

set superpixel grid cell size 16 ×16 to perform 16×down-

sampling/upsampling. To further reduce the GPU memory

usage, in the high-res stereo matching tasks, we reduce the

channel number of the layers “cnv4a” and “cnv4b” in the su-

perpixel segmentation network from 256 to 128, remove the

batch normalization operation in the superpixel segmenta-

tion network, and perform superpixel-based spatial upsam-

pling after the disparity regression.

3. Additional Qualitative Results

3.1. Superpixel Segmentation

Figure 1and Figure 2show additional qualitative results

for superpixel segmentation on BSDS500 and NYUv2. The

three learning-based methods, SEAL, SSN, and ours, can

recover more detailed boundaries than SLIC, such as the

hub of the windmill in the second row of Figure 1and the

pillow on the right bed in the fourth row of Figure 2. Com-

pared to SEAL and SSN, our method usually generate more

compact superpixels.

3.2. Application to Stereo Matching

Figure 3, Figure 4, and Figure 6show the disparity pre-

diction results on SceneFlow, HR-VS and Middlebury-v3,

respectively. Compared to PSMNet, our methods are able to

better preserve the ﬁne details, such as the headset wire (the

seventh row of Figure 3) , street lamp post (the ﬁrst row of

Figure 4) and the leaves (the ﬁfth row of Figure 6). We also

observe that our method can better handle textureless areas,

such as the car back in the seventh row of Figure 4. It is

probably because our method directly downsample the im-

ages 16 times before sending them to the modiﬁed PSMNet,

while the original PSMNet only downsamples the image 4

times, and uses stride-2 convolution to perform another 4×

downsampling later. The input receptive ﬁled (w.r.t. the

original image) of our method is actually larger than that of

original PSMNet, which enables our method to better lever-

age context information around the textureless area.

Figure 5visualizes the superpixel segmentation results

of Ours ﬁxed and Ours joint methods on HR-VS dataset.

In general, Superpixels generated by Ours joint are more

compact and pay more attentions to the disparity boundary.

The color boundaries that are not aligned with the disparity

boundary, such as the water pit on the road in the second

row of Figure 5, are often ignored by Ours joint. This phe-

nomenon reﬂects the inﬂuence of disparity estimation on

the superpixels in the joint training.

References

[1] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo match-

ing network. In CVPR, pages 5410–5418, 2018. 1

Input GT segments SLIC SEAL SSN Ours

Figure 1. Additional superpixel segmentation results on BSDS500.

Input GT segments SLIC SEAL SSN Ours

Figure 2. Additional superpixel segmentation results on NYUv2.

Left image/GT PSMNet Ours ﬁxed Ours joint

Figure 3. Disparity prediction results on SceneFlow. For each method, we show both the predicted disparity map (top) and the error map

(bottom). For the error map, the darker the color, the lower the end point error (EPE).

Left image/GT PSMNet Ours ﬁxed Ours joint

Figure 4. Disparity prediction results on HR-VS. For each method, we show both the predicted disparity map (top) and the error map

(bottom). For the error map, the darker the color, the lower the end point error (EPE).

Left image Ours ﬁxed Ours joint

Figure 5. Comparsion of superpixel segmentation results on HR-VS. Note we do not enforce the superpixel connectivity here.

Left image PSMNet Ours joint

Figure 6. Disparity estimation results on Middlebury-v3. For each method, we show both the predicted disparity map (top) and the error

map (bottom). For the error map, the darker the color, the lower the error. All the images are from Middlebury-v3 leaderboard.