PreprintPDF Available

Segmenting Transparent Object in the Wild with Transformer

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transparent objects, commonly occurring in the human domestic environment, making it more practical for real-world application. (2) Trans10K-v2 brings more challenges for the current advanced segmentation methods than its former version. Furthermore, a novel transformer-based segmentation pipeline termed Trans2Seg is proposed. Firstly, the transformer encoder of Trans2Seg provides the global receptive field in contrast to CNN's local receptive field, which shows excellent advantages over pure CNN architectures. Secondly, by formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg's transformer decoder, where each prototype learns the statistics of one category in the whole dataset. We benchmark more than 20 recent semantic segmentation methods, demonstrating that Trans2Seg significantly outperforms all the CNN-based methods, showing the proposed algorithm's potential ability to solve transparent object segmentation.
Content may be subject to copyright.
Segmenting Transparent Object in the Wild with Transformer
Enze Xie1, Wenjia Wang2, Wenhai Wang3, Peize Sun1,
Hang Xu4, Ding Liang2, Ping Luo1
1The University of Hong Kong 2Sensetime Research
3Nanjing University 4Huawei Noah’s Ark Lab
Abstract
This work presents a new fine-grained transpar-
ent object segmentation dataset, termed Trans10K-
v2, extending Trans10K-v1, the first large-scale
transparent object segmentation dataset. Unlike
Trans10K-v1 that only has two limited categories,
our new dataset has several appealing benefits. (1)
It has 11 fine-grained categories of transparent ob-
jects, commonly occurring in the human domes-
tic environment, making it more practical for real-
world application. (2) Trans10K-v2 brings more
challenges for the current advanced segmentation
methods than its former version. Furthermore,
a novel transformer-based segmentation pipeline
termed Trans2Seg is proposed. Firstly, the trans-
former encoder of Trans2Seg provides the global
receptive field in contrast to CNN’s local recep-
tive field, which shows excellent advantages over
pure CNN architectures. Secondly, by formulat-
ing semantic segmentation as a problem of dic-
tionary look-up, we design a set of learnable pro-
totypes as the query of Trans2Seg’s transformer
decoder, where each prototype learns the statis-
tics of one category in the whole dataset. We
benchmark more than 20 recent semantic segmen-
tation methods, demonstrating that Trans2Seg sig-
nificantly outperforms all the CNN-based methods,
showing the proposed algorithm’s potential ability
to solve transparent object segmentation. Code is
available in github.com/xieenze/Trans2Seg.
1 Introduction
Modern robots, mainly mobile robots and mechanical manip-
ulators, would benefit a lot from the efficient perception of
the transparent objects in residential environments since the
environments vary drastically. The increasing utilization of
glass wall and transparent door in the building interior and
the glass cups and bottles in residential rooms has resulted in
the wrong detection in various range sensors. In robotic re-
search, most systems perceive the environment by multi-data
sensor fusion via sonars or lidars. The sensors are relatively
consistent in detecting opaque objects but are still affected by
the scan mismatching due to transparent objects. The unique
(a) Selected images and corresponding high-quality masks.
40
45
50
55
60
65
70
75
Trans2Seg
TransLab
DeepLabv3+
DANet
PSPNet
OCNet
DenseASPP
FCN
DUNet
BiSeNet
RefineNet
HardNet
HRNet
FastSCNN
ContextNet
LEDNet
DFANet
mIoU of Trans10K-v2
200
0
GFLOPs
(b) Performance comparison on Trans10K-v2.
Figure 1 (a) shows the high diversity of our dataset and high-
quality annotations. (b) is Comparisons between Trans2Seg and
other CNN-based semantic segmentation methods. All meth-
ods are trained on Trans10K-v2 with same epochs. mIoU is
chosen as the metric. Deeper color bar indicates methods with
larger FLOPS. Our Trans2Seg significantly surpasses other meth-
ods with lower flops.
feature of reflection, refraction, and light projection from the
transparent objects may confuse the sensors. Thus a reliable
vision-based method, which is much cheaper and more robust
than high-precision sensors, would be efficient.
Although some transparent objects dataset [Xu et al., 2015;
Chen et al., 2018a; Mei et al., 2020]were proposed, there are
some obvious problems. (1) Limited dataset scale. These
datasets often have less than 1K images captured from the
real-world and less than 10 unique objects. (2) Poor diver-
sity. The scene of these datasets is monotonous. (3) Fewer
classes. All these datasets have only two classes, background
and transparent objects. They lack fine-grained categories,
which limited their practicality. Recently, [Xie et al., 2020]
proposed a large-scale and high-diversity dataset termed
Trans10K, which divide transparent objects as ‘Things’ and
‘Stuff’. The dataset is high diversity, but it also lacks fine-
arXiv:2101.08461v1 [cs.CV] 21 Jan 2021
grained transparent categories.
In this paper, we proposes a fine-grained transparent ob-
ject segmentation dataset termed Trans10K-v2 with more
elaborately defined categories. The images are inherit from
Trans10K-v1 [Xie et al., 2020]. We annotate the 10428 im-
ages with 11 fine-grained categories: shelf, jar, freezer, win-
dow, glass door, eyeglass, cup, glass wall, glass bowl, water
bottle, storage box. In Trans10K-v1, transparent things are
defined to be grabbed by the manipulators and stuff are for
robot navigation. Though two basic categories can partially
help robots to interact with transparent objects, the provided
fine-grained classes in Trans10K-v2 can provide more. We
analyze these objects’ functions and how robots interact with
them in appendix.
Based on this challenging dataset, we design Trans2Seg,
introducing Transformer into segmentation pipeline for its
encoder-decoder architecture. First, the transformer encoder
provides a global receptive field via self-attention. Larger re-
ceptive field is essential for segmenting transparent objects
because transparent objects often share similar textures and
context with its surroundings. Second, the decoder stacks
successive layers to interact query embedding with trans-
former encoder output. To facilitate the robustness of trans-
parent objects, we carefully design a set of learnable class
prototype embeddings as the query for transformer decoder
and the key is the feature map from the transformer encoder.
Compared with convolutional paradigm, where the class pro-
totypes is the fixed parameters of convolution kernel weight,
our design provides a dynamic and context-aware implemen-
tation. As shown in Figure. 1b, we train and evaluate 20
existing representative segmentation methods on Trans10K-
v2, and found that simply applying previous methods to this
task is far from sufficient. By successfully introducing Trans-
former into this task, our Trans2Seg significantly surpasses
the best TransLab [Xie et al., 2020]by a large margin (72.1
vs. 69.0 on mIoU).
In summary, our main contributions are three-fold:
We propose the largest glass segmentation dataset
(Trans10K-v2) with 11 fine-grained glass image cate-
gories with a diverse scenario and high resolution. All
the images are elaborately annotated with fine-shaped
masks and function-oriented categories.
We introduce a new transformer-based network for
transparent object segmentation with transformer
encoder-decoder architecture. Our method provides
a global receptive field and is more dynamic in mask
prediction, which shows excellent advantages.
We evaluate more than 20 semantic segmentation meth-
ods on Trans10K-v2, and our Trans2Seg significantly
outperforms these methods. Moreover, we show this
task is largely unsolved. Thus more research is needed.
2 Related Work
Semantic Segmentation. In deep learning era, convolutional
neural network (CNN) puts forwards the development of se-
mantic segmentation in various datasets, such as ADE20K,
CityScapes and PASCAL VOC. One of the pioneer works
approaches, FCN [Long et al., 2015], transfers semantic seg-
mentation into an end-to-end fully convolutional classifica-
tion network. For improving the performance, especially
around object boundaries, [Chen et al., 2017; Lin et al., 2016;
Zheng et al., 2015]propose to use structured prediction mod-
ule, conditional random fields (CRFs) [Chen et al., 2014],
to refine network output. Dramatic improvements in perfor-
mance and inference speed have been driven by aggregating
features at multiples scales, for example, PSPNet [Zhao et al.,
2017]and DeepLab [Chen et al., 2017; Chen et al., 2018b],
and propagating structured information across intermediate
CNN representations [Gadde et al., 2016; Liu et al., 2017;
Wang et al., 2018].
Transparent Object Datasets. [Xu et al., 2015]intro-
duces TransCut dataset which only contain 49 images of 7
unique objects. To generate the segmentation result, [Xu et
al., 2015]optimized an energy function based on LF-linearity
which also need to utilize the light-field cameras. [Chen et
al., 2018a]proposed TOM-Net. It contains 876 real images
and 178K synthetic images which are generated by POV-
Ray. However, only 4 unique objects are used in synthesiz-
ing the training data. Recnetly, [Xie et al., 2020]introduce
a first large-scale real-world transparent object segmentation
dataset, termed Trans10K. It has 10K+ images. However,
there are two categories in this dataset, which limits its practi-
cal use. In this work, our Trans10K-v2 inherited the data and
annotates 11 fine-grained categories.
Transformer in Vision Tasks. Transformer [Vaswani
et al., 2017]has been successfully applied in both high-
level vision and low-level vision [Han et al., 2020]. In
ViT [Dosovitskiy et al., 2020], Transformer is directly ap-
plied to sequences of image patches to complete image clas-
sification. In object detection areas [Carion et al., 2020;
Zhu et al., 2020], DETR reasons about the relations of the
object queries and the global image context via Transformer
and outputs the final set of predictions in parallel without non-
maximum suppression(NMS) procedures and anchor gener-
ation. SETR [Zheng et al., 2020]views semantic segmen-
tation from a sequence-to-sequence perspective with Trans-
former. IPT [Chen et al., 2020]applies Transformer model
to low-level computer vision task, such as denoising, super-
resolution and deraining. In video processing, Transformer
has received significantly growing attention. VisTR [Wang
et al., 2020]accomplishes instance sequence segmentation
by Transformer. Multiple-object tracking [Sun et al., 2020;
Meinhardt et al., 2021]employs Transformers to decode ob-
ject queries and feature queries of the previous frame into
bounding boxes of the current frame, and merged by Hungar-
ian Algorithm or NMS.
3 Trans10K-v2 Dataset
Dataset Introduction. Our Trans10K-v2 dataset is based on
Trans10K dataset [Xie et al., 2020]. Following Trans10K, we
use 5000, 1000 and 4428 images in training, validation and
testing respectively. The distribution of the images is abun-
dant in occlusion, spatial scales, perspective distortion. We
further annotate the images with more fine-grained categories
due to the functional usages of different objects. Trans10K-
background
shelf
Jar/Kettle
freezer
window
door
eyeglass
cup wall
bowl
bottle
box
Figure 2 Images in Trans10K-v2 dataset are carefully annotated with high quality. The first row shows sample images and the
second shows the segmentation masks. The color scheme which encodes the object categories are listed on the right of the figure. Zoom in
for best view.
Trans10Kv2 shelf door wall box freezer window cup bottle jar bowl eyeglass
image num 280 1572 3059 603 90 501 3315 1472 997 340 410
CMCC 3.36 5.19 5.61 2.57 3.36 4.27 1.97 1.82 1.99 1.31 2.56
pixel ratio(%) 2.49 9.23 38.42 3.67 1.02 4.28 22.61 6.23 6.75 3.67 0.78
Table 1 Statistic information of Translabv2. ‘CMCC’ denotes Mean Connected Components of each category. ‘image num’ denotes
the image number. ‘pixel ratio’ is the pixel number of a certain category accounts in all the pixels of transparent objects in Trans10K-v2.
v2 dataset contains 10,428 images, with two main categories
and 11 fine-grained categories: (1) Transparent Things con-
taining cup,bottle,jar,bowl and eyeglass. (2) Transparent
Stuff containing windows,shelf,box,freezer,glass walls
and glass doors. In respect to fine-grained categories and
high diversity, Trans10K-v2 is very challenging, and have
promising potential in both computer vision and robotic re-
searches.
Annotation Principle. The transparent objects are man-
ually labeled by expert annotators with professional labeling
tool. The annotators were asked to provide more than 100
points when they trace the boundaries of each transparent
object, which ensures the high-quality outline of the mask
shapes. The way of annotation is mostly the same with se-
mantic segmentation datasets such as ADE20K. We set the
background with 0, and the 11 categories from 1 to 11. We
also provide the scene environment of each image locates
at. The annotators are asked to strictly following principles
when they label the images: (I) Only highly transparent pix-
els are annotated as masks, other semi-transparent and non-
transparent pixels are ignored. Highly transparent objects no
matter made of glass, plastics or crystals should also be anno-
tated. (II) When occluded by opaque objects, the pixels will
be cropped from the masks. (III) The setting of all 11 fine-
grained categories are elaborately observed and induced from
the point of function. We analyze firstly how the robots need
to deal with the transparent objects as avoiding or grasping or
manipulating, then categorize the objects similar in shape and
function into a fine-grained category. The detailed principle
of how we categorize the objects is listed in appendix.
Dataset Statistics. The statistic information of CMCC,
imaga number, pixel proportion are listed in Table 1 in de-
tail. From Table1, the sum of all the image numbers is larger
than 10428 since some image has multiple category of ob-
jects. CMCC denotes Mean Connected Components of each
category. It is caculated by dividing the connected compo-
nents number of a certain category by the image number. The
number of connected components are counted by the bound-
ary of the masks. It represents the complexity of the transpar-
ent objects.
Evaluation Metrics. Results are reported in three metrics
that are widely used in semantic segmentation to benchmark
the performance of fine-grained transparent object segmenta-
tion. (1) Pixel Accuracy indicates the proportion of correctly
classified pixels. (2) Mean IoU indicates mean intersection
over union. (3) Category IoU indicates the intersection over
union of each category.
4 Method
4.1 Overall Pipeline
The overall Trans2Seg architecture contains a CNN back-
bone, an encoder-decoder transformer, and a small convo-
lutional head, as shown in Figure 3. For an input image of
(H, W, 3),
• The CNN backbone generates image feature map of
(H
16 ,W
16 , C).
The encoder takes in the summation of flattened feature
of (H
16
W
16 , C)and positional embedding of (H
16
W
16 , C),
and outputs encoded feature of (H
16
W
16 , C).
• The decoder interacts the learned class prototypes of
(N, C )with encoded feature, and generates attention
map of (N, M , H
16
W
16 ), where Nis number of categories,
Mis number of heads in multi-head attention.
The small convolutional head up-samples the attention
map to (N, M , H
4,W
4), fuses it with high-resolution fea-
ture map Res2 and outputs attention map of (N, H
4,W
4).
The final segmentation is obtained by pixel-wise argmax op-
eration on the output attention map.
3 x H x W
C x (𝐻
16
𝑊
16)
position
embedding
CNN
N x C
Learned Class Prototype s
C x (𝐻
16
𝑊
16)N x M x 𝐻
16"x 𝑊
16
attention map
𝑭𝑭𝒆
𝑬𝒄𝒍𝒔
input image predicted mask
N:number of categories
C:feature channels
H:image height
W:image width
M:number heads in Transformer
small
conv
head
Transformer
Enc od er
Transformer
Decod er
Figure 3 The whole pipeline of our hybrid CNN-Transformer architecture. First, the input image is fed to CNN to extract fea-
tures F. Second, for transformer encoder, the features and position embedding are flatten and fed to transformer for self-attention, and
output feature(Fe) from transformer encoder. Third, for transformer decoder, we specifically define a set of learnable class prototype
embeddings(Ecls) as query, Feas key, and calculate the attention map with Ecls and Fe. Each class prototype embedding corresponds
to a category of final prediction. We also add a small conv head to fuse attention map and Res2 feature from CNN backbone. Details of
transformer decoder and small conv head refer to Figure 4. Finally, we can get the predict results by doing pixel-wise argmax on the atten-
tion map. For example, in this figure, the segmentation mask of two categories (Bottle and Eyeglass) corresponds to two class prototypes
with same colors.
Decoder Layer
……
Decoder Layer
Decoder Layer
Decoder Layer
old query
N x C
C x !
"# x $
"#
Multi-headattention
N x M x !
"#
!x $
"#
Atten tion Map
key & value
new query
Categor y
Prototypes
Encoded
Features
Transformer
Decode r
4x up & Concat
Conv&BN &Relu
Conv 1x1
Point-wise Argmax
PseudoCode of Small Co nvHead
#Attn: attentionma p, [N,M,H/16,W/16]
#Res2: backbone feature, [1,C,H/4,W/4]
Res2 =Res2.repeat(N,1,1,1)#[N,C,H/4,W/4]
Attn =Up(Attn, (H/4, W/4 )) #[N,M,H/4,W/4]
X = Concat([Res2, Attn]) #[N,M+C,H/4,W/4]
X = Conv_BN_ReLU(X)#[N,M+C, H/4 , W/4]
X = Conv(X) #[N,1,H/4,W/4]
Res =X.reshape(N,H/ 4, W/4) #[N,H/4,W/4]
Encoded Features
Attn. Map
Res2 Feat.
Small Conv Head
CNN backbone
NxHxW
Result
Figure 4 Detail of Transformer Decoder and small conv
head. Input: The learnable category prototypes as query, features
from transformer encoder as key and value. The inputs are fed
to transformer decoder, which consists of several decoder layers.
The attention map from last decoder layer and the Res2 feature
from CNN backbone are combined and fed to a small conv head
to get final prediction result. We also provide the Pseudo Code of
small conv head for better understanding.
4.2 Encoder
The Transformer encoder takes a sequence as input, so the
spatial dimensions of the feature map (H
16 ,W
16 , C)is flattened
into one dimension(H
16
W
16 , C). To compensate missing spa-
tial dimensions, positional embedding [Gehring et al., 2017]
is supplemented to one dimension feature to provide infor-
mation about the relative or absolute position of the feature
in the sequence. The positional embedding has the same di-
mension (H
16
W
16 , C)with the flattened feature. The encoder
is composed of stacked encoder layers, each of which con-
sists of a multi-head self-attention module and a feed forward
network [Vaswani et al., 2017].
4.3 Decoder
The Transformer decoder takes input a set of learnable class
prototype embeddings as query, denoted by Ecls, the encoded
feature as key and value, denoted by Fe, and output the atten-
tion map followed by Small Conv Head to obtain final seg-
mentation result, as shown in Figure 4.
The class prototype embeddings are learned category pro-
totypes, updated iteratively by a series of decoder layers
through multi-head attention mechanisms. We denoted it-
erative update rule by J, then the class prototype in each
decoder layer is:
Es
cls =K
i=0,..,s1
softmax(Ei
clsFe)Fe(1)
In the final decoder layer, the attention map is extracted out
to into small conv head:
attention map = Es
clsFe(2)
The pseudo code of small conv head is shown in shown in
Figure 4. The attention map from Transformer decode is the
shape of (N, M , H
16
W
16 ), where Nis number of categories, M
is number of heads in multi-head attention. It is up-sampled
to (N, M , H
4,W
4), then fused with high-resolution feature
map Res2 in the second dimension to (N, M +C, H
4,W
4), and
finally transformed into output attention map of (N, H
4,W
4).
The final segmentation is obtained by pixel-wise argmax op-
eration on the output attention map.
4.4 Discussion
The most related work with Trans2Seg is SETR and
DETR [Zheng et al., 2020; Carion et al., 2020]. In this sec-
tion we discuss the relations and differences in details.
SETR. Trans2Seg and SETR are both segmentation
pipelines. Their key difference is reflected in the design of
the decoder. In SETR, the decoder is simple several con-
volutional layers, which is similar with most previous meth-
ods. However, the decoder of Trans2Seg is also transformer,
which fully utilize the advantages of attention mechanism in
semantic segmentation.
DETR. Trans2Seg and DETR share similar components
in the pipeline, including CNN backbone, Transformer en-
coder and decoder. The biggest difference is the definition of
query. In DETR, the decoder’s queries represents Nlearn-
able objects because DETR is designed for object detection.
However, in Trans2Seg, the queries represents Nlearnable
class prototypes, where each query represents one category.
We could see that the minor change on query design could
generalize Transformer architecture to apply to diverse vision
tasks, such as object detection and semantic segmentation.
5 Experiments
5.1 Implementation Details.
We implement Trans2Seg with Pytorch. The ResNet-50 [He
et al., 2016]with dilation convolution at last stage. is adoped
as the CNN extractor. For loss optimization, we use Adam
optimizer with epsilon 1e-8 and weight decay 1e-4. Batch
size is 8 per GPU. We set learning rate 1e-4 and decayed by
the poly strategy [Yu et al., 2018]for 50 epochs. We use 8
V100 GPUs for all experiments. For all CNN based methods,
we random scale and crop the image to 480 ×480 in train-
ing, and resize image to 513 ×513 in inference, following
common setting on PASCAL VOC [Everingham and Winn,
2011]. For our Trans2Seg, we adopt transformer architecture
and need to keep the shape of learned position embedding
same in training/inference, so we directly resize the image to
512 ×512. Code has been released for community to follow.
5.2 Ablation Studies.
We use the FCN [Long et al., 2015]as our baseline. FCN is a
fully convolutional network with very simple design, and it is
also a very classic semantic segmentation method. First, we
demonstrate that transformer encoder can build long range at-
tention between pixels, which has much larger receptive field
than CNN filters. Second, we remove the CNN decoder in
FCN and replace by our Transformer decoder, we design a
set of learnable class prototypes as queries and show that this
design further helps improve the accuracy. Third, we verify
our method with transformer at different scales.
Self-Attention of Transformer Encoder. As shown in
Figure 2, the FCN baseline without transformer encoder
achieves 62.7% mIoU, when adding transformer encoder,
the mIoU directly improves 6.1%, achieving 66.8% mIoU.
It demonstrates that the self-attention module in transformer
encoder provides global receptive filed, which is better than
CNN’s local receptive field in transparent object segmenta-
tion.
Category Prototypes of Transformer Decoder. In Fig-
ure 2, we verify the effectiveness of learnable category pro-
totypes in transformer decoder. In column 2, with traditional
id Trans. Enc. Trans. Dec. CNN Dec. mIoU
0× × X62.7
1X×X68.8
2X X ×72.1
Table 2 Effectiveness of Transformer encoder and decoder.
‘Trans.’ indicates Transformer. ‘Enc.’ and ‘Dec.’ means encoder
and decoder.
Scale hyper-param. GFlops MParams mIoU
small e128-n1-m2 40.9 30.5 69.2
medium e256-n4-m3 49.0 56.2 72.1
large e768-n12-m4 221.8 327.5 70.3
Table 3 Performance of Transformer at different scales.
‘e{a}-n{b}-m{c}’ means the transformer with number of ‘a’ em-
bedding dims, ‘b’ layers and ‘c’ mlp ratio.
CNN decoder, the mIoU is 68.8%. However, with our trans-
former decoder, the mIoU boosts up to 72.1% with 3.3% im-
provement. The strong performance benefits from the flexible
representation that learnable category prototypes as queries to
find corresponding pixels in feature map.
Scale of Transformer. The scale of transformer is mainly
influenced by three hyper-parameters: (1) embedding dim of
feature. (2) number of attention layers. (3) mlp ratio in feed
forward layer. We are interested in whether enlarge the model
size can continuously improve performance. So we set three
combinations, as shown in Figure 3. We can find that with
the size of transformer increase, the mIoU first increase then
decrease. We argue that if without massive data to pretrain,
e.g. BERT [Devlin et al., 2019]used large-scale nlp data, the
transformer size is not the larger the better for our task.
5.3 Comparison to the state-of-the-art.
We select more than 20 semantic segmentation methods [Xie
et al., 2020; Chen et al., 2018c; Li et al., 2019a; Zhao et al.,
2017; Yuan and Wang, 2018; Yang et al., 2018; Long et al.,
2015; Ronneberger et al., 2015; Yu et al., 2018; Lin et al.,
2017; Chao et al., 2019; Wang et al., 2019a; Poudel et al.,
2019; Poudel et al., 2018; Wang et al., 2019b; Jin et al., 2019;
Zhao et al., 2018; Li et al., 2019a; Liu and Yin, 2019; Li et
al., 2019b; Fu et al., 2019; Mehta et al., 2019]to evaluate
on our Trans10K-v2 dataset, the methods selection largely
follows the benchmark of TransLab [Xie et al., 2020]. For
fair comparsion, we train all the methods with 50 epochs.
Table 4 reports the overall quantitative comparison results
on test set. Our Trans2Seg achieves state-of-the-art 72.15%
mIoU and 94.14% pixel ACC, significant outperforms other
pure CNN-based methods. For example, our method is 2.1%
higher than TransLab, which is the previous SOTA method.
We also find that our method tend to performs much better
on small objects, such as ‘bottle’ and ’eyeglass’ (10.0% and
5.0% higher than previous SOTA). We consider that the trans-
former’s long range attention benefits the small transparent
object segmentation.
In Figure 5, we visualize the mask prediction of Trans2Seg
and other CNN-based methods. We can find that benefit from
transformer’s large receptive field and attention mechanism,
Method FLOPs ACC mIoU Category IoU
bg shelf Jar freezer window door eyeglass cup wall bowl bottle box
FPENet 0.76 70.31 10.14 74.97 0.01 0.00 0.02 2.11 2.83 0.00 16.84 24.81 0.00 0.04 0.00
ESPNetv2 0.83 73.03 12.27 78.98 0.00 0.00 0.00 0.00 6.17 0.00 30.65 37.03 0.00 0.00 0.00
ContextNet 0.87 86.75 46.69 89.86 23.22 34.88 32.34 44.24 42.25 50.36 65.23 60.00 43.88 53.81 20.17
FastSCNN 1.01 88.05 51.93 90.64 32.76 41.12 47.28 47.47 44.64 48.99 67.88 63.80 55.08 58.86 24.65
DFANet 1.02 85.15 42.54 88.49 26.65 27.84 28.94 46.27 39.47 33.06 58.87 59.45 43.22 44.87 13.37
ENet 2.09 71.67 8.50 79.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.25 0.00 0.00 0.00
HRNet w18 4.20 89.58 54.25 92.47 27.66 45.08 40.53 45.66 45.00 68.05 73.24 64.86 52.85 62.52 33.02
HardNet 4.42 90.19 56.19 92.87 34.62 47.50 42.40 49.78 49.19 62.33 72.93 68.32 58.14 65.33 30.90
DABNet 5.18 77.43 15.27 81.19 0.00 0.09 0.00 4.10 10.49 0.00 36.18 42.83 0.00 8.30 0.00
LEDNet 6.23 86.07 46.40 88.59 28.13 36.72 32.45 43.77 38.55 41.51 64.19 60.05 42.40 53.12 27.29
ICNet 10.64 78.23 23.39 83.29 2.96 4.91 9.33 19.24 15.35 24.11 44.54 41.49 7.58 27.47 3.80
BiSeNet 19.91 89.13 58.40 90.12 39.54 53.71 50.90 46.95 44.68 64.32 72.86 63.57 61.38 67.88 44.85
DenseASPP 36.20 90.86 63.01 91.39 42.41 60.93 64.75 48.97 51.40 65.72 75.64 67.93 67.03 70.26 49.64
DeepLabv3+ 37.98 92.75 68.87 93.82 51.29 64.65 65.71 55.26 57.19 77.06 81.89 72.64 70.81 77.44 58.63
FCN 42.23 91.65 62.75 93.62 38.84 56.05 58.76 46.91 50.74 82.56 78.71 68.78 57.87 73.66 46.54
OCNet 43.31 92.03 66.31 93.12 41.47 63.54 60.05 54.10 51.01 79.57 81.95 69.40 68.44 78.41 54.65
RefineNet 44.56 87.99 58.18 90.63 30.62 53.17 55.95 42.72 46.59 70.85 76.01 62.91 57.05 70.34 41.32
Translab 61.31 92.67 69.00 93.90 54.36 64.48 65.14 54.58 57.72 79.85 81.61 72.82 69.63 77.50 56.43
DUNet 123.69 90.67 59.01 93.07 34.20 50.95 54.96 43.19 45.05 79.80 76.07 65.29 54.33 68.57 42.64
UNet 124.55 81.90 29.23 86.34 8.76 15.18 19.02 27.13 24.73 17.26 53.40 47.36 11.97 37.79 1.77
DANet 198.00 92.70 68.81 93.69 47.69 66.05 70.18 53.01 56.15 77.73 82.89 72.24 72.18 77.87 56.06
PSPNet 187.03 92.47 68.23 93.62 50.33 64.24 70.19 51.51 55.27 79.27 81.93 71.95 68.91 77.13 54.43
Trans2Seg 49.03 94.14 72.15 95.35 53.43 67.82 64.20 59.64 60.56 88.52 86.67 75.99 73.98 82.43 57.17
Table 4 Evaluated state-of-the-art semantic segmentation methods. Sorted by FLOPs. Our proposes Trans2Seg surpasses all the other
methods in pixel accuracy and mean IoU, as well as most of the category IoUs (8 in 11).
Trans2Seg DeepLabv3+ FCN ICNet PSPNetImage GroundTruth
Figure 5 Visual comparison of Trans2Seg to other CNN-based
semantic segmentation methods. Our Trans2Seg clearly outper-
forms others thanks to the transformer’s global receptive field and
attention mechanism, especially in dash region. Zoom in for best
view. Refer to supplementary materials for more visualized re-
sults.
our method can distinguish background and different cate-
gories transparent objects much better than other methods,
especially when multiple objects with different categories oc-
curs in one image. Moreover, our method can obtain high
quality detail information,e.g. boundary of object, and tiny
transparent objects, while other CNN-based methods fail to
do so. More results are shown in supplementary material.
6 Conclusion
In this paper, we present a new fine-grained transparent
object segmentation dataset with 11 common categories,
termed Trans10K-v2, where the data is based on the previous
Trans10K. We also discuss the challenging and practical of
the proposed dataset. Moreover, we propose a transformer-
based pipeline, termed Trans2Seg, to solve this challenging
task. In Trans2Seg, the transformer encoder provides global
receptive field, which is essential for transparent objects seg-
mentation. In the transformer decoder, we model the segmen-
tation as dictionary look up with a set of learnable queries,
where each query represents one category. Finally, we evalu-
ate more than 20 mainstream semantic segmentation methods
and shows our Trans2Seg clearly surpass these CNN-based
segmentation methods.
In the future, we are interested in exploring our Trans-
former encoder-decoder design on general segmentation
tasks, such as Cityscapes and PASCAL VOC. We will also
put more effort to solve transparent object segmentation task.
7 Appendix
7.1 Detailed Dataset Information
More Visualized Demonstration of Trans10K-v2.
In this section we show more visualized demonstrations to
show the diversity and quality of Trans10K-v2. In Figure 6
and Figure 7, we show more cropped objects to illustrate the
high-diversity of the objects. We also show more images and
ground-truth masks in Figure 8. All images and transparent
objects in Trans10K-v2 are selected from complex real-world
scenarios that have large variations such as scale, viewpoint,
contrast, occlusion, categories and transparency. From Fig-
ure 8, we can also find that it is challenging for current se-
mantic segmentation methods.
cup
Jar/kettle
bottle
bowl
eyeglass
Figure 6 – Cropped objects of 5 kinds of transparent things: cup,
jar, bottle, bowl, eyeglass. Zoom in for the best view.
wall
freezer
box
door
shelf
window
Figure 7 – Cropped objects of 6 kinds of transparent stuff: wall,
freezer, box, door, shelf, window. Zoom in for the best view.
Scene information
We also provide each image with a scene label that represents
where the objects located in. As shown in the upper part of
Table 5, we list the statistics of the distribution in different
scenarios of each category in detail. The distribution highly
follows the distribution of our residential environments. For
example, the cups, bowls, and bottles are mostly placed on
the desk, while glass walls are often located in mega-malls
or office buildings.
The visualized demonstration of our diverse scene distri-
bution is shown in Figure 9. Trans10k-v2 contains abundant
scenarios and we induce them into 13 categories: on the desk,
mega-mall, store, bedroom, sitting room, kitchen, bathroom,
windowsill, office, office building, outdoor, in the vehicle,
study-room. This information is mainly used to demonstrate
our abundant image distribution which could cover most of
the common real-life scenarios. Each image is provided with
a scene label.
How Robots Deal with Transparent Objects
Transparent objects are widespread in human residential envi-
ronments, so the human-aiding robots find ways to deal with
transparent objects. Some former robotic research illustrates
the substantial value of solving this problem, mainly from
grasping and navigation. This research primarily focuses on
modifying the algorithm to deal with optical signals reflected
from the transparent objects.
For the manipulator grasping, previous work mainly fo-
cuses on grabbing water cups. [Klank et al., 2011]propose an
approach to reconstruct an approximate surface of the trans-
parent cups and bottles by the internal sensory contradiction
from two ToF (time of flight) images captured from an SR4k
camera. The robot arm could grasp and manipulate the ob-
jects. [Spataro et al., 2015]set up a BCI-robot platform to
help patients suffering from limb muscle paralysis by grasp-
ing a glass cup for the patients. Starting from the point that
the usual glass material absorbs light in specific wavelengths,
[Zhou et al., 2018]propose the Depth Likelihood Volume
(DLV), which uses a Monte Carlo object localization algo-
rithm to help the Michigan Progress Fetch robot localize and
manipulate translucent objects.
For the mobile robot navigation, some work also finds
ways to exclude the side-effect of transparent stuff in residen-
tial scenarios. [Foster et al., 2013]modify the standard occu-
pancy grid algorithm during the procedure of autonomous-
mapping robot localize transparent objects from certain an-
gles. [Kim and Chung, 2016]design a novel scan matching
algorithm by comparing all candidate distances scanned by
the laser range finder penetrate and reflected from the glass
walls. [Singh et al., 2018]use information fusion by combin-
ing a laser scanner and a sonar on an autonomous-mapping
mobile robot to reduce the uncertainty caused by glass.
We analyze how robots deal with transparent objects from
previous work and grade them into 4 patterns: navigation,
grasping, manipulation, human-aiding. Navigation and
grasping are the two fundamental interactions between robots
and objects. Manipulation happens on complex objects like
windows, doors, or bottles with lids. Human-aiding is the
highest level of robot mission, and this kind of interaction al-
ways involve human, especially disabled patients. From these
4 patterns, we can then analyze and categorize the transparent
objects in respect to functions.
background
shelf
Jar/Kettle
freezer
window
door
eyeglass
cup
wall
bowl
bottle
box
Figure 8 More images and corresponding high-quality masks in Trans10K-v2. Our dataset is high diversity in scale, categories, pose,
contrast, occlusion, and transparency. Zoom in for the best view.
Categorization Principle
The 11 fine-grained categories are based on how the robots
need to deal with transparent objects like avoiding or grasp-
ing or manipulating. For example, the goblet and cup are both
open-mouthed and mainly used to drink water. These objects
need to be grasped carefully since they do not have lids. They
have different interactive actions with the robots. So they are
both categorized as cup. We show the detailed demonstration
of each category: (1) Shelf. Containing bookshelf, showcase,
cabinet, etc. They mostly have sliding glass doors and are
used to store goods. (2) Freezer. Containing vending ma-
chine, horizontal freezer, etc. They are electrical equipment
and are used to storing drinks and food. (3) Door. Con-
taining automatic glass door, standard glass door, etc. The
doors are located in mega-mall, bathroom or office building.
They are highly transparent and extensive. They could be
used in navigation and helping disabled people pass through.
(4) Wall. Glass walls look like doors. However, walls can
not be opened. This clue should be perceived during mo-
bile robots’ mapping procedure. Glass walls are common
in mega-mall and office buildings. (5) Window. Windows
could be opened like glass doors but should not be traveled
through. (6) Box. Large boxes may not need to be grasped,
but the manipulator robot needs to open the box and search
for specific items. (7) Cup. We category all open-mouthed
cups like goblets and regular cups into this category. Cups
are used for drinking water. The manipulators need to grasp
a cup carefully and be able to assist disabled people to drink
water. (8) Bottle. Bottles are also used to drink water. But
bottles have lids, so they need careful manipulation. (9) Eye-
glass. Eyeglasses need careful grasping and manipulation to
help disable people wear the eyeglasses. (10) Jar. This cat-
egory contains jars, kettles and other transparent containers
used to hold water, flavoring and food. (11) Bowl. Bowls are
usually used to contain water or food. Different from jars,
they do not have lids and need careful grasping. The sample
objects of these categories could be find in Figure 8. We show
the most common type of different categories by cropping the
objects through masks.
As shown in the lower part of Table 5, we analyze and list
the interactive patterns of all the 11 fine-grained categories
of objects. Navigation is the basic interactive pattern of stuff
and grasping is the basic interactive pattern of things. All the
objects with some complex interactions need to be manipu-
lated like the robots helping people open the shelf or window.
Human-aiding is the highest level of interaction and it always
involves patients. The patients need robots to help with open-
ing the door, or feeding water by a cup or bottle.
7.2 More Visual Results Comparison.
In this section, we visualize more test examples produced by
our Trans2Seg and other CNN-based methods on Trans10K-
v2 dataset in Figure 11. From these results, we can easily
observe that our Trans2Seg outputs very high-quality trans-
parent object segmentation masks than other methods. Such
strong results mainly benefit from the successfully introduc-
ing Transformer into transparent object segmentation, which
is the lack in other CNN-based methods.
7.3 Failure Case Analysis
As shown in Figure 10, our method also has some limitations.
For instance, in Figure 10 (a), when transparent objects are
occluded by different categories, our method would confuse
and fail to segment part of the items. In Figure 10 (b), when
the objects are of extreme transparency, our method would
also confuse and output wrong segmentation results. In such
a case, even humans would also fail to distinguish these trans-
parent objects.
1
10
100
1000
10000
On the Desk
Mega-Mall
Store
Bedroom
Sitting Room
Kitchen
Bathroom
Windowsill
Office
Office Building
Outdoor
In the Vehicle
Study-Room
Scene Distribution
Figure 9 – The image number distribution and selected images of different scenes in Trans10K-v2. For better demonstration, the image
number in vertical axis is listed as logarithmic.
Image GT Trans2Seg
(a) Occlusion and Crowd.
Image GT Trans2Seg
(b) Extreme Transparency.
Figure 10 Failure cases analysis. Our Trans2Seg fails to segment transparent objects in some complex scenarios.
Scene/C ategory
Interaction
Stuff Things
shelf freezer door wall window box cup bottle eyeglass jar bowl
on the desk 3 0 0 2 4 227 1946 834 239 302 117
mega-mall 219 35 450 1762 76 128 169 36 75 94 14
store 13 36 5 19 3 75 444 111 1 175 57
bedroom 6 0 4 9 23 2 23 33 6 6 1
living room 10 0 7 14 19 52 310 167 25 139 67
kitchen 0 8 6 4 4 19 79 23 0 46 66
bathroom 0 0 33 31 8 4 5 3 4 0 2
windowsill 0 0 0 31 209 4 17 8 8 17 2
office room 15 7 25 43 12 84 298 235 51 158 2
office building 8 3 1021 1107 131 5 1 5 0 2 0
outdoor 0 0 13 20 2 0 0 2 0 0 0
in the vehicle 0 0 2 0 1 0 4 0 0 0 0
study-room 4 0 3 2 4 1 4 1 0 2 0
navigation XXXXXX
grasping XXXXX
manipulation X X X X X X X X
human-aiding X X X X X
Table 5 – The upper part of this table: the number of the scene. The lower part of this table: the interaction pattern of each category.
BiseNetTrans2Seg DeepLab FCN ICNet PSPNet UNet HRNetImage GroundTruth
Figure 11 Visualized results of comparison with state-of-the-art methods. Our Trans2Seg has the best mask prediction among all
methods. Zoom in for the best view.
References
[Carion et al., 2020]Nicolas Carion, Francisco Massa,
Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. End-to-End object detection with
transformers. In ECCV, 2020.
[Chao et al., 2019]Ping Chao, Chao-Yang Kao, Yu-Shan
Ruan, Chien-Hsiang Huang, and Youn-Long Lin. Hard-
net: A low memory traffic network. In ICCV, 2019.
[Chen et al., 2014]Liang-Chieh Chen, George Papandreou,
Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Se-
mantic image segmentation with deep convolutional nets
and fully connected crfs. arXiv, 2014.
[Chen et al., 2017]Liang-Chieh Chen, George Papandreou,
Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.
Deeplab: Semantic image segmentation with deep convo-
lutional nets, atrous convolution, and fully connected crfs.
TPAMI, 2017.
[Chen et al., 2018a]Guanying Chen, Kai Han, and Kwan-
Yee K. Wong. Tom-net: Learning transparent object mat-
ting from a single image. In CVPR, 2018.
[Chen et al., 2018b]Liang-Chieh Chen, Yukun Zhu, George
Papandreou, Florian Schroff, and Hartwig Adam.
Encoder-decoder with atrous separable convolution for se-
mantic image segmentation. In ECCV, 2018.
[Chen et al., 2018c]Liang-Chieh Chen, Yukun Zhu, George
Papandreou, Florian Schroff, and Hartwig Adam.
Encoder-decoder with atrous separable convolution for se-
mantic image segmentation. In ECCV, 2018.
[Chen et al., 2020]Hanting Chen, Yunhe Wang, Tianyu
Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma,
Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image
processing transformer. arXiv preprint arXiv:2012.00364,
2020.
[Devlin et al., 2019]Jacob Devlin, Ming-Wei Chang, Ken-
ton Lee, and Kristina Toutanova. BERT: pre-training of
deep bidirectional transformers for language understand-
ing. In Jill Burstein, Christy Doran, and Thamar Solorio,
editors, Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL-HLT
2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1
(Long and Short Papers), pages 4171–4186. Association
for Computational Linguistics, 2019.
[Dosovitskiy et al., 2020]Alexey Dosovitskiy, Lucas Beyer,
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Min-
derer, Georg Heigold, Sylvain Gelly, et al. An image is
worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint arXiv:2010.11929, 2020.
[Everingham and Winn, 2011]Mark Everingham and John
Winn. The pascal visual object classes challenge 2012
(voc2012) development kit. Pattern Analysis, Statistical
Modelling and Computational Learning, Tech. Rep, 2011.
[Foster et al., 2013]Paul Foster, Zhenghong Sun, Jong Jin
Park, and Benjamin Kuipers. Visagge: Visible angle grid
for glass environments. In 2013 IEEE International Con-
ference on Robotics and Automation, pages 2213–2220.
IEEE, 2013.
[Fu et al., 2019]Jun Fu, Jing Liu, Haijie Tian, Yong Li,
Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual at-
tention network for scene segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3146–3154, 2019.
[Gadde et al., 2016]Raghudeep Gadde, Varun Jampani,
Martin Kiefel, Daniel Kappler, and Peter V Gehler. Super-
pixel convolutional networks using bilateral inceptions. In
ECCV, 2016.
[Gehring et al., 2017]Jonas Gehring, Michael Auli, David
Grangier, Denis Yarats, and Yann N Dauphin. Convo-
lutional sequence to sequence learning. arXiv preprint
arXiv:1705.03122, 2017.
[Han et al., 2020]Kai Han, Yunhe Wang, Hanting Chen,
Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang,
An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vi-
sual transformer. arXiv preprint arXiv:2012.12556, 2020.
[He et al., 2016]Kaiming He, Xiangyu Zhang, Shaoqing
Ren, and Jian Sun. Deep residual learning for image recog-
nition. In CVPR, 2016.
[Jin et al., 2019]Qiangguo Jin, Zhaopeng Meng, Tuan D
Pham, Qi Chen, Leyi Wei, and Ran Su. Dunet:
A deformable network for retinal vessel segmentation.
Knowledge-Based Systems, 2019.
[Kim and Chung, 2016]Jiwoong Kim and Woojin Chung.
Localization of a mobile robot using a laser range finder
in a glass-walled environment. IEEE Transactions on In-
dustrial Electronics, 63(6):3616–3627, 2016.
[Klank et al., 2011]Ulrich Klank, Daniel Carton, and
Michael Beetz. Transparent object detection and recon-
struction on a mobile platform. In IEEE International
Conference on Robotics & Automation, 2011.
[Li et al., 2019a]Gen Li, Inyoung Yun, Jonghyun Kim, and
Joongkyu Kim. Dabnet: Depth-wise asymmetric bottle-
neck for real-time semantic segmentation. arXiv, 2019.
[Li et al., 2019b]Hanchao Li, Pengfei Xiong, Haoqiang
Fan, and Jian Sun. Dfanet: Deep feature aggregation for
real-time semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 9522–9531, 2019.
[Lin et al., 2016]Guosheng Lin, Chunhua Shen, Anton Van
Den Hengel, and Ian Reid. Efficient piecewise training
of deep structured models for semantic segmentation. In
CVPR, 2016.
[Lin et al., 2017]Guosheng Lin, Anton Milan, Chunhua
Shen, and Ian Reid. Refinenet: Multi-path refinement
networks for high-resolution semantic segmentation. In
CVPR, 2017.
[Liu and Yin, 2019]Mengyu Liu and Hujun Yin. Feature
pyramid encoding network for real-time semantic segmen-
tation. arXiv, 2019.
[Liu et al., 2017]Sifei Liu, Shalini De Mello, Jinwei Gu,
Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz.
Learning affinity via spatial propagation networks. In
NIPS, 2017.
[Long et al., 2015]Jonathan Long, Evan Shelhamer, and
Trevor Darrell. Fully convolutional networks for seman-
tic segmentation. In CVPR, 2015.
[Mehta et al., 2019]Sachin Mehta, Mohammad Rastegari,
Linda Shapiro, and Hannaneh Hajishirzi. Espnetv2: A
light-weight, power efficient, and general purpose convo-
lutional neural network. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
9190–9200, 2019.
[Mei et al., 2020]Haiyang Mei, Xin Yang, Yang Wang,
Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng
Wei, and Rynson W.H. Lau. Don’t hit me! glass detec-
tion in real-world scenes. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2020.
[Meinhardt et al., 2021]Tim Meinhardt, Alexander Kirillov,
Laura Leal-Taixe, and Christoph Feichtenhofer. Track-
former: Multi-object tracking with transformers. arXiv
preprint arXiv:2101.02702, 2021.
[Poudel et al., 2018]Rudra PK Poudel, Ujwal Bonde,
Stephan Liwicki, and Christopher Zach. Contextnet: Ex-
ploring context and detail for semantic segmentation in
real-time. arXiv, 2018.
[Poudel et al., 2019]Rudra PK Poudel, Stephan Liwicki,
and Roberto Cipolla. Fast-scnn: fast semantic segmen-
tation network. arXiv, 2019.
[Ronneberger et al., 2015]Olaf Ronneberger, Philipp Fis-
cher, and Thomas Brox. U-net: Convolutional networks
for biomedical image segmentation. In MICCAI, 2015.
[Singh et al., 2018]Ravinder Singh, Kuldeep Singh Nagla,
John Page, and John Page. Multi-data sensor fusion frame-
work to detect transparent object for the efficient mobile
robot mapping. International Journal of Intelligent Un-
manned Systems, pages 00–00, 2018.
[Spataro et al., 2015]R. Spataro, R. Sorbello, S. Tramonte,
G. Tumminello, M. Giardina, A. Chella, and V. La Bella.
Reaching and grasping a glass of water by locked-in als
patients through a bci-controlled humanoid robot. Fron-
tiers in Human Neuroscience, 357:e48–e49, 2015.
[Sun et al., 2020]Peize Sun, Yi Jiang, Rufeng Zhang,
Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Ze-
huan Yuan, Changhu Wang, and Ping Luo. Transtrack:
Multiple-object tracking with transformer. arXiv preprint
arXiv:2012.15460, 2020.
[Vaswani et al., 2017]Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In Advances in neural information processing sys-
tems, pages 5998–6008, 2017.
[Wang et al., 2018]Xiaolong Wang, Ross Girshick, Abhinav
Gupta, and Kaiming He. Non-local neural networks. In
CVPR, 2018.
[Wang et al., 2019a]Jingdong Wang, Ke Sun, Tianheng
Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu,
Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep
high-resolution representation learning for visual recogni-
tion. arXiv, 2019.
[Wang et al., 2019b]Yu Wang, Quan Zhou, Jia Liu, Jian
Xiong, Guangwei Gao, Xiaofu Wu, and Longin Jan Late-
cki. Lednet: A lightweight encoder-decoder network for
real-time semantic segmentation. In ICIP, 2019.
[Wang et al., 2020]Yuqing Wang, Zhaoliang Xu, Xinlong
Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and
Huaxia Xia. End-to-end video instance segmentation with
transformers. arXiv preprint arXiv:2011.14503, 2020.
[Xie et al., 2020]Enze Xie, Wenjia Wang, Wenhai Wang,
Mingyu Ding, Chunhua Shen, and Ping Luo. Seg-
menting transparent objects in the wild. arXiv preprint
arXiv:2003.13948, 2020.
[Xu et al., 2015]Yichao Xu, Hajime Nagahara, Atsushi Shi-
mada, and Rin-ichiro Taniguchi. Transcut: Transparent
object segmentation from a light-field image. In ICCV,
2015.
[Yang et al., 2018]Maoke Yang, Kun Yu, Chi Zhang, Zhi-
wei Li, and Kuiyuan Yang. Denseaspp for semantic seg-
mentation in street scenes. In CVPR, 2018.
[Yu et al., 2018]Changqian Yu, Jingbo Wang, Chao Peng,
Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bi-
lateral segmentation network for real-time semantic seg-
mentation. In ECCV, 2018.
[Yuan and Wang, 2018]Yuhui Yuan and Jingdong Wang.
Ocnet: Object context network for scene parsing. arXiv,
2018.
[Zhao et al., 2017]Hengshuang Zhao, Jianping Shi, Xiao-
juan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene
parsing network. In CVPR, 2017.
[Zhao et al., 2018]Hengshuang Zhao, Xiaojuan Qi, Xiaoy-
ong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-
time semantic segmentation on high-resolution images. In
ECCV, 2018.
[Zheng et al., 2015]Shuai Zheng, Sadeep Jayasumana,
Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,
Dalong Du, Chang Huang, and Philip HS Torr. Con-
ditional random fields as recurrent neural networks. In
ICCV, 2015.
[Zheng et al., 2020]Sixiao Zheng, Jiachen Lu, Hengshuang
Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei
Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al.
Rethinking semantic segmentation from a sequence-to-
sequence perspective with transformers. arXiv preprint
arXiv:2012.15840, 2020.
[Zhou et al., 2018]Zheming Zhou, Zhiqiang Sui, and
Odest Chadwicke Jenkins. Plenoptic monte carlo object
localization for robot grasping under layered translucency.
In 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 1–8. IEEE, 2018.
[Zhu et al., 2020]Xizhou Zhu, Weijie Su, Lewei Lu, Bin
Li, Xiaogang Wang, and Jifeng Dai. Deformable detr:
Deformable transformers for end-to-end object detection.
arXiv preprint arXiv:2010.04159, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Encoder-decoder with atrous separable convolution for semantic image segmentation
Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. [Chen et al., 2020] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364, 2020.
BERT: pre-training of deep bidirectional transformers for language understanding
  • Devlin
[Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171-4186. Association for Computational Linguistics, 2019.
Feature pyramid encoding network for real-time semantic segmentation
  • Mengyu Liu
  • Hujun Yin
and Yin, 2019] Mengyu Liu and Hujun Yin. Feature pyramid encoding network for real-time semantic segmentation. arXiv, 2019.
Plenoptic monte carlo object localization for robot grasping under layered translucency
et al., 2018] Zheming Zhou, Zhiqiang Sui, and Odest Chadwicke Jenkins. Plenoptic monte carlo object localization for robot grasping under layered translucency. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1-8. IEEE, 2018. [Zhu et al., 2020] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.