ArticlePDF Available

MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation

Authors:

Abstract and Figures

In this paper, we present a novel end-to-end learning neural network, i.e., MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. To achieve this, an asymmetric attention block, named Motion-Attentive Transition (MAT), is proposed within a two-stream encoder network to firstly identify moving regions and then attend appearance learning to capture the full extent of objects. Putting MATs in different convolutional layers, our encoder becomes deeply interleaved, allowing for close hierarchical interactions between object apperance and motion. Such a biologically-inspired design is proven to be superb to conventional two-stream structures, which treat motion and appearance independently in separate streams and often suffer severe overfitting to object appearance. Moreover, we introduce a bridge network to modulate multi-scale spatiotemporal features into more compact, discriminative and scale-sensitive representations, which are subsequently fed into a boundary-aware decoder network to produce accurate segmentation with crisp boundaries. We perform extensive quantitative and qualitative experiments on four challenging public benchmarks, i.e., DAVIS16, DAVIS17, FBMS and YouTube-Objects. Results show that our method achieves compelling performance against current state-of-the-art ZVOS methods. To further demonstrate the generalization ability of our spatiotemporal learning framework, we extend MATNet to another relevant task: dynamic visual attention prediction (DVAP). The experiments on two popular datasets (i.e., Hollywood-2 and UCF-Sports) further verify the superiority of our model. Our implementations have been made publicly available at https://github.com/tfzhou/MATNet.
Content may be subject to copyright.
IEEE TRANSACTIONS ON IMAGE PROCESSING 1
MATNet: Motion-Attentive Transition Network for
Zero-Shot Video Object Segmentation
Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, Senior Member, IEEE
and Jianbing Shen, Senior Member, IEEE
Abstract—In this paper, we present a novel end-to-end learning
neural network, i.e., MATNet, for zero-shot video object segmen-
tation (ZVOS). Motivated by the human visual attention behavior,
MATNet leverages motion cues as a bottom-up signal to guide the
perception of object appearance. To achieve this, an asymmetric
attention block, named Motion-Attentive Transition (MAT), is
proposed within a two-stream encoder network to firstly identify
moving regions and then attend appearance learning to capture
the full extent of objects. Putting MATs in different convolutional
layers, our encoder becomes deeply interleaved, allowing for
close hierarchical interactions between object apperance and
motion. Such a biologically-inspired design is proven to be superb
to conventional two-stream structures, which treat motion and
appearance independently in separate streams and often suffer
severe overfitting to object appearance. Moreover, we introduce a
bridge network to modulate multi-scale spatiotemporal features
into more compact, discriminative and scale-sensitive representa-
tions, which are subsequently fed into a boundary-aware decoder
network to produce accurate segmentation with crisp boundaries.
We perform extensive quantitative and qualitative experiments
on four challenging public benchmarks, i.e., DAVIS16, DAVIS17 ,
FBMS and YouTube-Objects. Results show that our method
achieves compelling performance against current state-of-the-
art ZVOS methods. To further demonstrate the generalization
ability of our spatiotemporal learning framework, we extend
MATNet to another relevant task: dynamic visual attention
prediction (DVAP). The experiments on two popular datasets (i.e.,
Hollywood-2 and UCF-Sports) further verify the superiority of
our model1.
Index Terms—Video object segmentation, zero-shot, two-
stream, spatiotemporal representation, neural attention, dynamic
visual attention prediction.
I. INTRODUCTION
The task of automatically identifying primary object(s) from
videos has gained significant attention over the past decade,
owing to its academic value and practical significance in many
areas, such as robotics [1], video compression [2], human-
object interaction [3] and autonomous driving [4]. However,
due to the lack of human intervention, in addition to typical
challenging factors posed by video data (e.g., occlusions,
This work was supported in part by the Beijing Natural Science Foundation
(No. L191004) and the National Natural Science Foundation of China (No.
61271374). (Corresponding author: Jianwu Li)
T. Zhou and J. Shen are with Inception Institute of Artifical Intelligence,
Abu Dhabi, UAE. (Email: {ztfei.debug, shenjianbingcg}@gmail.com)
J. Li and S. Wang are with Beijing Laboratory of Intelligent Information-
Technology, School of Computer Science, Beijing Institute of Technology,
Beijing, China. (Email: ljw@bit.edu.cn)
R. Tao is with School of Information and Electronics, Beijing Institute
of Technology, Beijing, China, and also with Beijing Key Laboratory of
Fractional Signals and Systems, Beijing, China.
1Our code is available at https://github.com/ tfzhou/ MATNet
motion blur, object deformations, cluttered background), the
task suffers from great difficulties in accurately distinguishing
the most prominent objects throughout a video sequence.
Early non-learning methods are built upon handcrafted
features (e.g., motion boundary [5], saliency [6, 7, 8], point
trajectories [9, 10, 11, 12, 13]) and rely heavily on classic
heuristics in video segmentation (e.g., object proposal rank-
ing[14, 15], spatiotemporal coherency[5], long-term trajectory
clustering [9]). Although these methods can work in a purely
unsupervised way, they suffer the limited representability of
the handcrafted features. More recently, research has turned to-
wards the deep learning paradigm, with several studies[16, 17]
casting this problem as a zero-shot solution. These approaches
follow the zero-shot learning paradigm [18] to learn from
large-scale video data and can generalize well to test videos
that never appear in the training set, without any human
involvement. This is different from one-shot video object seg-
mentation (OVOS)[19], which requires first-frame annotations
for model adaption to test data in the inference phase.
Even before the era of deep learning, object motion has
always been considered as one of the most important cues for
automatic video object segmentation. This is largely inspired
by the human vision system (HVS), which has remarkable
motion perception capabilities to quickly orient human at-
tention to moving objects in dynamic scenes [20, 21]. In
fact, it has been demonstrated that infants [22] and newly
sighted congenitally blind people [23] tend to over-segment
static objects, even if they are strongly contrasted against
their surroundings; however, they can easily group things
together once the objects start moving, following the Gestalt
principle of common fate [24, 25]. These abilities enable us
to easily discover never-before-seen moving objects, before
knowing their particular semantic names. Motion surely does
not work alone. Recent studies [20] have revealed that, in
HVS, motion-based perception appears early, while static
perception is acquired later, and possibly bootstrapped by
motion cues to focus more on processing the most salient
objects. In this work, we take these biological mechanisms into
account and design our model to reflect such human behaviors,
i.e., first orienting rough attention to the moving parts of
objects, and then transferring the attention to object appearance
which provides a generic objectness prior for capturing the
whole picture of objects. In this way, our model is able to
learn a more effective spatiotemporal object representation,
encouraging more robust video object segmentation.
By considering knowledge propagation from object motion
to appearance, valuable temporal context can be exploited to
IEEE TRANSACTIONS ON IMAGE PROCESSING 2
Res1
Res2
Res3
Res4
Res5
Res1
Res2
Res3
Res4
Res5
CC C C
SSA
MAT
SSA SSA SSA
BAR5
BAR4
BAR3
BAR2
GC
Boundary GT Boundary HEM
Segmentation GT
MAT MAT
Encoder Decoder
Ia
Im
473×473 Z2
Z3
Z4
Z5
F2
F3
F4
F5
Mb
2
˜
Mb
5
Va
Vmˆ
U
Ua
Um
Lbdry
LCE
Ms
Gs
Fig. 1: Pipeline of MATNet. The frame Iaand flow Imare first input into the interleaved encoder to extract multi-scale
spatiotemporal features {ˆ
Ui}5
i=2. At each residual stage i, we break the original information flow in ResNet. Instead, a MAT
module is proposed to create a new interleaved information flow by simultaneously considering motion Vm,i and appearance
Va,i.ˆ
Uiis further fed into the boundary-aware decoder via the bridge network to obtain boundary results Mb
2Mb
5and
segmentation results Ms.
alleviate possible ambiguities in object appearance (e.g., visual
similarity to the background), thus facilitating representation
learning. However, in the context of deep learning, current
segmentation models often overlook this potential. Most prior
works [26, 27, 28, 29] simply treat motion cues as being equal
to appearance and learn to predict segmentation masks from
motion or appearance independently. Some approaches [30,
31] utilize motion cues to enrich object representations, but
they rely on complex heuristics and only work at a single
scale, ignoring the critical hierarchical structure.
Motivated by these observations, we propose a Motion-
Attentive Transition Network (MATNet) for zero-shot video
object segmentation (ZVOS). Fig. 1 illustrates its pipeline,
which has an encoder-bridge-decoder structure. The core of
MATNet is a deeply interleaved two-stream encoder which
not only inherits the advantages of traditional two-stream
networks for multi-modal learning, but also progressively
transfers intermediate motion features to facilitate more robust
appearance learning. The transition is carried out by multiple
Motion-Attentive Transition (MAT) modules. Each MAT takes
as input the intermediate features of both the input image
and optical flow field at a convolutional stage, and produces
informative spatiotemporal features for the following stage.
For each MAT, an asymmetric attention mechanism is built
to first infer regions of interest based on optical flow, and
then transfer the inference to provide better selectivity for
appearance features. The deep interleaved structure captures
the intrinsic characteristics of the human vision system and
brings immediate improvement in segmentation accuracy.
Given the powerful spatiotemporal features from the en-
coder, we design a decoder network to infer pixel-accurate
object segmentation through a top-down refinement process.
The decoder progressively refines high-level semantic fea-
tures using spatially rich low-level features via a cascade
of Boundary-Aware Refinement (BAR) modules. Each BAR
is responsible for generating features with finer structures,
under the assistance of salient boundary detection. Beyond
traditional methods that connect the encoder and decoder via
skip connections, we introduce a lightweight attention module,
i.e., Scale-Sensitive Attention (SSA), to connect each pair
of encoder and decoder layers. SSA adaptively modulates
spatiotemporal convolution features before sending them to
the decoder. More specifically, SSA is a two-level attention
scheme in which the local level serves to highlight the most
informative features and suppress useless information, while
the global level helps to re-calibrate features for objects with
different scales.
MATNet can be easily instantiated with various backbones,
and optimized in an end-to-end manner. We perform exten-
sive experiments on four popular video object segmentation
datasets, i.e., DAVIS16 [32], DAVIS17 [33], FBMS [11] and
YouTube-Objects [34], in which the proposed method yields
consistent performance improvement over the state-of-the-arts.
Additionally, we showcase the advantages and generalizability
of our framework via the task of dynamic visual attention
prediction (DVAP). Our model is proven to generalize well to
the DVAP task and produce reliable dynamic-fixation predic-
tion results over two large-scale benchmarks, i.e., Hollywood-
2 [35] and UCF-Sports [35].
In summary, the main contributions of this paper are
three-fold: First, we propose a novel interleaved two-stream
network architecture to learn powerful spatiotemporal object
representations for ZVOS. This is achieved by an asymmetric
attention module, i.e., MAT, that accounts for object motion
and appearance interactions in a more comprehensive way.
Second, we introduce a boundary-aware decoder to obtain seg-
mentation with crisp object boundaries. The decoder learned
with a novel adapted cross-entropy loss produces accurate
boundaries in regions of primary objects. Third, based on
these designs, our MATNet consistently outperforms state-
of-the-art methods over several ZVOS benchmarks and also
shows superior performance in instance-level segmentation
and the DVAP task.
This paper builds upon our conference paper [36] and sig-
nificantly extends it in various aspects: First, to demonstrate
the effectiveness of our model, we extend it to an instance-
level segmentation setting, which is more challenging and
essential for practical cases in which multiple instances may
appear. Second, we examine our model for the DVAP task,
and it outperforms all specialized methods on two large-
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
scale benchmarks, demonstrating its generality. Third, we
also provide a more inclusive and insightful overview of the
recent work on video object segmentation, motion-aware video
analysis and dynamic visual attention prediction. Last but not
least, we report much more experimental results and conduct
more ablation studies (e.g., attribute-based analysis, impacts
of different optical flow methods) for thorough and in-depth
examinations of our model.
II. RE LATE D WORK
Our model is related to four lines of research, i.e., auto-
matic video object segmentation, motion-aware modeling in
video analysis, dynamic visual attention prediction and neural
attention. We will briefly discuss each of them.
A. Automatic Video Object Segmentation
A large number of methods have been proposed for auto-
matic (or unsupervised) video object segmentation, targeting at
segmenting conspicuous and eye-catching objects without any
human intervention. Many non-deep learning methods are
based on hand-crafted features and rely on certain heuristics
(e.g., saliency, object proposal ranking, trajectory clustering).
For instance, [6, 7, 8] take visual saliency as prior knowledge
to guide object segmentation, while [7, 14, 15, 37] infer the
object regions from hundreds of object candidates [38]. Object
motion is also widely used as a reliable cue for identifying
objects. [5] detects motion boundaries to determine foreground
regions. [9, 10, 11, 12, 13] take advantage of long-term point
trajectories for motion segmentation, making them more robust
to occlusions. Please refer to [39] for a more comprehensive
review of these approaches.
In recent years, with the renaissance of neural networks
in computer vision, deep learning based solutions are now
dominant in this field. Many approaches[16, 26, 28, 31, 40, 41,
42, 43, 44, 45, 46, 47] solve the task with zero-shot solutions,
which require no additional annotations during inference and
are thus more flexible for automatic video analysis. For exam-
ple, [43] proposes a dynamic visual attention-driven model for
video object segmentation, and [17, 41] mine higher-order re-
lations between video frames, resulting in more comprehensive
understanding of video content and more accurate foreground
estimation. However, these approaches only rely on object
appearance, and can thus easily fail in cases where objects are
visually similar to the background. To cope with this, many
approaches discover the motion patterns of objects [26] as
complementary cues to object appearance. This is typically
achieved within two-stream network architectures [28, 48],
in which an RGB image and the corresponding flow field
are separately processed by two independent networks and
the results are fused to produce the final segmentation. Some
methods [42, 49] design complex heuristics to fuse motion
and appearance for better segmentation. However, a major
drawback of these approaches is that they fail to consider
the importance of deep interactions between appearance and
motion in learning rich spatiotemporal features. To address
this issue, we propose a deep interleaved two-stream encoder,
in which a motion transition module is leveraged for more
effective representation learning.
B. Motion-Aware Modeling in Video Analysis
Deep learning models have been widely used in various
video-related tasks, such as action recognition [50, 51, 52],
video salient object detection [40, 53, 54] and dynamic visual
attention prediction [55, 56, 57, 58]. The most significant
difference between static images and videos is that objects
in videos are moving, which is a key factor that draws human
attention. Therefore, how to involve object motion into the
design of neural networks has been a critical issue in deep
learning-based video analysis.
Many approaches [40, 59] learn temporal coherence in-
formation by simply feeding consecutive frames into fully
convolutional networks. These methods are computationally
efficient; however, since they do not employ explicit motion
information (e.g., optical flow), they are sensitive to cluttered
and distracting backgrounds. Some other models consider
recurrent neural networks to capture long-range spatiotemporal
features [53, 60, 61]. However, all these models ignore the
complementary roles of spatial and temporal information.
This issue has been well addressed by the famous two-
stream ConvNet architecture proposed in [50], which consists
of spatial and temporal networks to better capture the comple-
mentary information of object appearance and motion. It has
achieved great success on human action recognition in videos.
Along this line, [51] injects residual connections into the
two-stream architecture to allow spatiotemporal interactions
between two modalities, while [44, 52] further improve such
a spatiotemporal residual network with multiplicative gating
functions. These two-stream architectures have also shown
strong performance in video object processing tasks, like video
object segmentation [28, 29, 54] and dynamic video attention
prediction [55, 58]. Despite this, current two-stream networks
tend to fuse motion and appearance features with a simple
gating mechanism and are limited in their use of local context.
In this work, we reconsider the interactions between object
motion and appearance with an asymmetric attention mod-
ule, which utilize motion-attentive features to promote the
appearance features, in a hierarchical manner. The powerful
representation ability of our model is verified in both ZVOS
and DVAP tasks.
C. Dynamic Visual Attention Prediction
Dynamic visual attention prediction, or dynamic fixation
prediction, is a close topic to ZVOS. Rather than targeting
at object-level saliency prediction, DVAP aims to identify
observers’ fixations during dynamic scene viewing. The task is
useful for machines to understand human attentional behaviors
and has shown great potential in many practical applications
(e.g., object segmentation [43], video captioning [62]). Early
DVAP methods [63, 64, 65] largely relied on hand-crafted,
biologically-inspired features (e.g., color, optical flow), the-
ories of visual attention in the cognitive area (e.g., guided
search [66], attention shift [67]). Recently, deep learning-
based methods become mainstream and generally yield better
performance. Representative works use two-stream networks
to account for multi-modal features [55, 58] or LSTMs for
sequence fixation prediction over consecutive frames [68].
IEEE TRANSACTIONS ON IMAGE PROCESSING 4
CCC
Soft Attention
Soft Attention
Attention
Transition
Vm
Um
P>
Um
Va
Ua
Q>
Ua
S
Sr
ˆ
Uˆ
Ua
ˆ
Um
1×1conv 1×1conv
softmax+fc softmax +fc
softmaxr
Fig. 2: Computational graph of MAT. and c
indicate matrix
multiplication and concatenation operations, respectively.
Although MATNet is originally designed for the object-
aware segmentation task, we show that it also achieves re-
markable performance on the DVAP task (V-C). This can
be largely attributed to the proposed encoder network, which
can provide informative spatiotemporal features to capture the
most important parts of the visual stimuli.
D. Neural Attention
Neural attention mechanisms, which are derived from hu-
man perception, have been widely studied in deep neural
networks and yield significant improvements for various tasks,
e.g., neural machine translation [69], object recognition [70,
71], and visual question answering [54, 72], to name a few
representative ones. Neural attention stimulates the human
selective attention mechanism, allowing the networks to focus
on the most informative parts of the inputs.
Neural attention mechanisms have also been used in recent
ZVOS approaches [17, 41], which aim to mine consistent
object patterns among video frames. Our idea is fundamentally
different with them. We propose an asymmetric attention
module (i.e., MAT) to mimic human attention behavior in
dynamic scenarios. It encourages more comprehensive in-
teractions between object motion and appearance, yielding
more powerful spatiotemporal features. Besides, we extend
MAT into a deeper version to conduct multi-step reasoning of
spatiotemporal attention, which can highlight more accurate
target regions, especially for complex scenarios. In addition,
MATs are incorporated into multiple convolutional layers,
leading to an entirely different network architecture, which
is expected to benefit various video analysis tasks.
III. PROP OS ED METHOD
A. Network Overview
We propose an end-to-end deep neural network, i.e., MAT-
Net, for ZVOS, which leverages motion cues to effectively
bootstrap the perception of object appearance. More specifi-
cally, our approach is designed as a unified of three tightly
coupled sub-networks: Interleaved Encoder Network,Bridge
Network and Boundary-Aware Decoder Network. The pipeline
is illustrated in Fig. 1.
(a) (b) (c) (d)
Fig. 3: Illustration of effects of the MAT module. (a) and (b)
are inputs of images and optical flow fields. (c) and (d) denote
feature maps in Vaand ˆ
Ua. As seen, the MAT module can
effectively emphasize important object regions and suppress
background responses, benefiting the segmentation.
1) Interleaved Encoder Network: The encoder resorts to a
two-stream structure to jointly capture the spatial and temporal
information, which has been proven effective in many related
video analysis tasks [50, 51, 52]. In contrast to previous works,
which treat the two streams equally, our encoder incorporates
a MAT module (III-B) into each network layer, which offers
a motion-to-appearance pathway for information exchange.
Such a design enables us to learn more powerful spatiotempo-
ral object representations. More technically, we take the first
five convolutional blocks of ResNet-101 [73] as the backbone
for each stream. Given an RGB frame IaRw×h×3and its
optical flow field ImRw×h×3, the encoder first extracts
intermediate appearance and motion features separately at the
i-th (i{2,3,4,5}) residual stage, denoted as Va,i RW×H×C
and Vm,i RW×H×C, where W,Hand Crepresent the spatial
width, height and channel number of the feature tensors,
respectively. The features are subsequently enhanced by a
MAT module FMAT as:
ˆ
Ua,i,ˆ
Um,i =FMAT(Va,i ,Vm,i),(1)
where ˆ
U·,i RW×H×Crepresents the enriched features. For
the i-th stage, the spatiotemporal object representation ˆ
Uiis
obtained as ˆ
Ui=Concat(ˆ
Ua,i,ˆ
Um,i)RW×H×2Cwhich is
further fed into the down-stream decoder via a bridge network.
2) Bridge Network: The bridge network is responsible for
selecting informative spatiotemporal features for the decoder.
It is built upon several SSA modules (III-C), each of which
takes advantage of Uiat the i-th stage, attending it both
locally and globally to produce attentive feature Zi, with a
unified attention module. The local attention adopts channel-
wise and spatial-wise attention mechanisms to highlight the
correct object regions and suppress possible noise existing in
the redundant features, while the global attention aims to re-
calibrate the features to account for objects of different sizes.
3) Boundary-Aware Decoder Network: The decoder net-
work adopts a coarse-to-fine scheme to conduct segmentation
inference. It consists of four BAR modules (III-D), i.e.,
FBARi, i ∈ {2,3,4,5}, each corresponding to the i-th residual
block. From FBAR5to FBAR2, the resolution of feature maps
gradually increases by compensating for high-level coarse
features with more low-level details. The FBAR2produces the
finest feature maps, whose resolutions are 1/4of the input
IEEE TRANSACTIONS ON IMAGE PROCESSING 5
image size. They are sequentially processed by three additional
layers, i.e., conv(3×3,1),upsampling and sigmoid, to obtain the
final mask output MsRw×h.
As follows, we will introduce the proposed modules (i.e.,
MAT, SSA, BAR) in detail. For simplicity, we omit the
subscript i.
B. Motion-Attentive Transition Module
Each MAT module is comprised of two soft attention units
and one attention transition unit, as depicted in Fig. 2. The
soft attention units help to emphasize the most informative
regions in the appearance or motion feature maps, while
the transition unit transfers the attentive motion features to
facilitate spatiotemporal feature learning.
1) Soft Attention: This unit softly weights the input feature
map Vm(or Va) at each spatial location. Taking Vmas an
example, this unit outputs a motion-attentive feature Um
RW×H×Cas follows:
Softmax attention: Am=softmax(Wm(Vm)),
Attention-enhanced feature: Uc
m=AmKVc
m,(2)
where Wmis a 1×1convolution that transforms Vmto
an importance map, which is normalized using a softmax
operation to generate an attention map AmRW×H, where
PW×H
i=1 Ai
m= 1. Here, each value Ai
mis the probability
with which our model believes the corresponding location
is important. Uc
mand Vc
mindicate the 2D feature slices of
Umand Vmat the c-th channel, respectively. Jdenotes the
Hadamard product. Similarly, given Va, we can obtain the
appearance-attentive feature Uaby Eq. (2).
2) Attention Transition: To transfer motion-attentive fea-
tures Um, we first seek the affinity between Uaand Umin
a non-local manner, using the following multi-modal bilinear
model:
S=U>
mWUaR(W H)×(W H ),(3)
where WRC×Cis a trainable weight matrix. The affinity
matrix Scan effectively capture pairwise relationships between
the two feature spaces. However, it also introduces a huge
number of parameters, which increases the computational cost
and creates the risk of over-fitting. To overcome this problem,
Wis approximately factorized into two low-rank matrices P
RC×C
dand QRC×C
d, where d(d > 1) is a reduction ratio.
Then, Eq. (3) can be rewritten as:
S=U>
mPQ>
Ua= (P>Um)>(Q>
Ua).(4)
This operation is equal to applying channel-wise feature
transformations to Umand Uabefore computing the similarity.
Its advantages over Eq. (3) are three-fold: 1) It reduces the
number of parameters by 2/d times; 2) It requires much
fewer multiplication operations. For comparison, Eq. (3) needs
W H C2+W2H2Cmultiplications, while Eq. (4) only requires
(2W H C2+W2H2C)/d; 3) It helps to generate a compact
channel-wise feature representation for each modality.
Then, we normalize Srow-wise to derive an attention
map Srconditioned on motion features and achieve enhanced
appearance features ˆ
UaRW×H×C:
Motion conditioned attention: Sr=softmaxr(S),
Attention-enhanced feature: ˆ
Ua=UaSr,(5)
(a) Image (b) Groundtruth (c) HED (d) Hard Example Mining
(e) Image (f) Groundtruth (g) Boundary w/o HEM (h) Boundary w/ HEM
Fig. 4: Illustration of hard example mining (HEM) for salient
object boundary detection. During training, for each training
image in (a), our method first estimates an edge map (c) using
off-the-shelf HED [75], and then determines hard pixels (d) to
facilitate training. For each test image in (e), we see that the
boundary results with HEM (h) are more accurate than those
without HEM (g).
where softmaxrindicates row-wise softmax.
3) Deep-MAT: For complex videos, using one MAT layer
to predict the attention is sub-optimal due to the noise intro-
duced by distractors which are irrelevant to the target regions.
Therefore, we extend MAT into Deep-MAT for multi-step rea-
soning of spatiotemporal attention. Deep-MAT progressively
refines attention via multiple MAT layers and can pinpoint
more accurate target regions. In particular, our deep MAT
consists of LMAT layers cascaded in depth (denoted by
F(1)
MAT,F(2)
MAT,· · · ,F(L)
MAT). Let ˆ
U(l1)
aand ˆ
U(l1)
mbe the input
features for F(l)
MAT. It then produces outputs ˆ
U(l)
aand ˆ
U(l)
m,
which are further fed to F(l+1)
MAT in a recursive manner:
ˆ
U(l)
a,ˆ
U(l)
m=F(l)
MAT(ˆ
U(l1)
a,ˆ
U(l1)
m),(6)
where ˆ
U(l)
ais computed by Eq. (5) and ˆ
U(l)
m=U(l1)
mfollowing
Eq. (2). In addition, we have ˆ
U(0)
a=Vaand ˆ
U(0)
m=Vm.
It should be noted that stacking MAT layers directly leads to
an obvious drop in performance. Inspired by [74], we propose
to stack multiple MAT layers in a residual form as follows:
ˆ
U(l)
a=ˆ
U(l1)
a+U(l1)
aSr,
ˆ
U(l)
m=ˆ
U(l1)
m+U(l1)
m.
(7)
4) Discussion: In Fig. 3, we show the visual effects of the
MAT module. We can observe that with MAT, the feature maps
in Vaare well refined to produce more effective features in
ˆ
Ua. The new features show tremendous properties with promi-
nent objects highlighted and distractors suppressed, which are
beneficial for accurate segmentation.
C. Scale-Sensitive Attention Module
The SSA module FSSA is extended from a simplified CBAM
FCBAM [71] by adding a global attention Fg. Given a feature
map URW×H×2C, our SSA module refines it as follows:
Z=FSSA(U) = Fg(FCBAM(U)) RW×H×2C.(8)
IEEE TRANSACTIONS ON IMAGE PROCESSING 6
The CBAM module FCBAM consists of two sequential sub-
modules: channel and spatial attention, which can be formu-
lated as:
Channel attention: s=Fs(U),e=Fe(s),Zc=e?U,
Spatial attention: p=Fp(Zc),ZCBAM =pKZc,(9)
where Fsis a squeeze operator that gathers the global
spatial information of Uinto a vector sR2C, while Fe
is an excitation operator that captures channel-wise depen-
dencies and outputs an attention vector eR2C. Follow-
ing [70], Fsis implemented by applying avg pooling on
each feature channel, and Feis formed by four consec-
utive operations: fc(2C
16 )ReLU fc(2C)sigmoid.
ZcRW×H×2Cdenotes channel-wise attentive features, and
?indicates the channel-wise multiplication. In the spatial
attention, Fpexploits the inter-spatial relationship of Zc
and produces a spatial-wise attention map pRW×Hby
conv(7×7,1) sigmoid. Then, we achieve the attention
glimpse ZCBAM RW×H×2Cas the local-level feature.
The global attention Fgshares a similar spirit to the
channel attention layer in Eq. (9), in that it has the
same squeeze layer but modifies the excitation layer as
fc(2C
16 )fc(1) sigmoid to obtain a scale selection factor
gR1. It can then obtain scale-sensitive features Zgas
follows:
Z= (gZCBAM) + U.(10)
Note that we use identity mapping to avoid losing important
information on the regions with attention values close to 0.
D. Boundary-Aware Refinement Module
In the decoder network, each BAR FBARiaccepts two in-
puts, Zifrom the corresponding SSA module and Fifrom the
previous BAR. To obtain a sharp mask output, the BAR first
performs object boundary estimation using an extra boundary
detection module FBDRY, which compels the network to em-
phasize finer object details. The predicted boundary map is
then combined with the two inputs to produce finer features
for the next BAR module. This can be formulated as:
Mb
i=FBDRY(Fi),
Fi1=FBARi(Zi,Fi,Mb
i),(11)
where FBDRY consists of a stack of convolutional layers and a
sigmoid layer (see Fig. 5), Mb
iRw×hindicates the boundary
map and Fi1is the output feature map of BARi. The full
computational graph of BARiis shown in Fig. 5.
BAR benefits from two key factors: the first is that we apply
Atrous Spatial Pyramid Pooling (ASPP) [76] on convolutional
features to transform them into a multi-scale representation.
This helps to enlarge the receptive field and obtain more
spatial details for decoding. Technically, ASPP consists of
multiple parallel dilated convolutional layers with different
sampling rates. In this paper, four dilated convolutional layers
are adopted, and the dilation rates are set as {2k}4
k=1. In this
way, each BAR module first extracts spatiotemporal features
on four scales, which are then concatenated together to em-
phasize multi-scale features. During decoding, these features
are futher concatenated with the boundary prediction Mb
i, and
ASPP
Res
C
Res
Conv 3x3
Conv 1x1
Sigmoid
Boundary
UP
FBDRY
FBARi
Fi
Zi
Fi1
Mb
i
Fig. 5: Computational graph of the BARimodule. Here,
‘Res’ is a residual block [73], while ‘UP’ denotes bilinear
upsampling. c
and indicate concatenation and element-
wise addition operations, respectively.
then progressively proccessed by a residual block (‘Res’ in
Fig. 5), an element-wise summation with Zi, and another
residual block to obtain more fine-grained features Fi1, as
shown in Fig. 5. Here the residual block is implemented by
two stacked 3×3convolutions with an identity shortcut [73].
The second benefit is that we introduce a heuristic method
for automatically mining hard negative pixels to support the
training of FBDRY. Specifically, for each training frame, we
use the popular off-the-shelf HED model [75] to predict a
boundary map E[0,1]w×h, wherein each value Ekrepresents
the probability of pixel kbeing an edge pixel. Then, pixel
kis regarded as a hard negative pixel if it has a high edge
probability (e.g., Ek>0.2) and falls outside the dilated ground-
truth region. If pixel kis a hard pixel, then its weights wk=
1+Ek; otherwise, wk= 1.
Then, wkis used to weight the following adaptive boundary
loss so that it can be penalized heavily if the hard pixels are
misclassified:
LBDRY(Mb,Gb) = Xkwk((1Gb
k) log(1Mb
k)
+Gb
klog(Mb
k)),
(12)
where Mband Gbare the boundary prediction and ground-
truth, respectively.
Fig. 4 offers an illustration of the above hard example
mining (HEM) scheme. Clearly, by explicitly discovering
hard negative pixels, the network can produce more accurate
boundary predictions with background pixels well suppressed
(see Fig. 4 (g) and (h)).
E. Detailed Network Architecture
Our whole model is end-to-end trainable, because all the
components in MATNet are parameterized by neural networks.
At each stream of the encoder, we use the first five convolu-
tion blocks of ResNet-101 [73] as our backbone for feature
extraction. The spatiotemporal features in the last convolution
stage are fed into a global convolutional layer (GC in Fig. 1)
to enlarge the valid receptive field [77], which is implemented
by combining 1×77×1and 7×11×7convolutional
layers, following by a residual block.
IEEE TRANSACTIONS ON IMAGE PROCESSING 7
Fig. 6: Qualitative results on four sequences. From top to bottom: dance-twirl from DAVIS16 ,dogs02 from FBMS, cat-0001
from YouTube-Objects and dogs-jump from DAVIS17.
1) Training Phase: Given an input frame IaR473×473×3,
we first compute its optical flow field ImR473×473×3using
PWC-Net [78] due to its high efficiency and accuracy. Then,
our MATNet predicts a segmentation mask Ms[0,1]473×473
and four boundary masks {Mb
i[0,1]473×473 }4
i=1 through
the decoder network. Let Gs∈ {0,1}473×473 be the binary
segmentation ground-truth, and Gb∈ {0,1}473×473 be the
boundary ground-truth which can be easily computed from
Gs. The overall loss function is formulated as:
LZVOS =LCE(Ms,Gs)+ 1
NXN=4
i=1 LBDRY(Mb
i,Gb),(13)
where LCE indicates the classic cross entropy loss, and LBDRY
is defined in Eq. (12).
2) Testing Phase: Once the network is trained, we apply it
to unseen videos. Given a test video, we resize all the frames
to 473×473, and feed each frame, along with its optical flow, to
the network for segmentation. We follow the common protocol
used in previous works [27, 30, 43] and employ CRF to obtain
the final binary segmentation results.
3) Runtime: Our model is implemented in PyTorch and
trained on a single Nvidia RTX 2080Ti GPU and an Intel(R)
Xeon Gold 5120 CPU. Testing is conducted on the same
machine. For each test frame of size 473×473, the forward
inference of our MATNet takes about 0.05s, while optical flow
estimation and CRF-based post-processing take about 0.2s and
0.5s, respectively.
IV. EXTENSION OF MATNET
In this section, we describe two extensions of our MATNet:
zero-shot video instance segmentation and dynamic visual
attention prediction. The former focuses on multi-object un-
supervised video segmentation [79], targeting at more fine-
grained results in multi-object scenarios. The latter aims at
predicting where people look over dynamic scenes.
A. Zero-Shot Video Instance Segmentation
To adapt our MATNet into an instance-level segmentation
setting, we modify our model into a saliency-driven instance
selection method. More specifically, for a test video V=
{It}T
t=1 with Tframes, our approach takes three stages to
generate segmentation tracks for it. 1) Object proposal gener-
ation. For each frame It, we generate a collection of category-
agnostic segment proposals Pt={Pi
t}iusing COCO-trained
Mask R-CNN [83] for detecting generic objects. Our MATNet
is also applied to generate an object-level segmentation mask
Ms
t. Then, we compute a score Si
tfor each proposal:
Si
t=Si
MATNet Si
MRCNN,
Si
MATNet =kPi
tMs
tk
kPi
tk,(14)
where Si
MRCNN denotes the detection score of Pi
tfrom Mask
R-CNN, while Si
MATNet measures its saliency score. The pro-
posals with small scores (Si
t<0.03) are discarded. 2) Short-
Term Tracklet Generation. Given the remaining proposals, we
further connect them temporally in a greedy manner. Firstly,
each proposal Pi
tis warped to the next frame using optical
flow, and we search for its matched proposal in Pt+1 by
evaluating the IoU scores. If the maximum IoU score is
above 0.1, the corresponding proposal is regarded as being
matched with Pi
t. 3) Tracklet Merging by Re-Identification
(ReID). We further merge short-term tracklets into a set of
consistent segmentation tracks using object re-identification.
The ReID embedding vector for each proposal is computed
using a pretrained ReID network [84]. For each tracklet, its
embedding is computed as the average embedding of all
proposals belonging to it. We use L2distance to measure
the similarities between two tracklets and adopt the merging
strategy in [85] to obtain final segmentation tracks.
B. Dynamic Visual Attention Prediction
Our MATNet is flexible to fit the DVAP task with modi-
fications in two aspects: 1) Network structure: Since bound-
IEEE TRANSACTIONS ON IMAGE PROCESSING 8
TABLE I: Quantitative comparison of ZVOS methods on DAVIS16 val. The best result for each metric is boldfaced (This
note is also applied to other tables.). All the results are borrowed from the public leaderboard maintained by the DAVIS16
challenge (https://davischallenge.org/davis2016/ soa compare.html). See V-A for details.
Measure SFL [29] FSEG[28] LVO [48] ARP [80] PDB [53] LSMO [44] MOT[81] EPO [49] AGS [43] COSNet [41] AGNN[17] AnDiff [82] MATNet
Mean67.4 70.7 75.9 76.2 77.2 78.2 77.2 80.6 79.7 80.5 80.7 81.7 82.4
JRecall81.4 83.5 89.1 91.1 90.1 89.1 87.8 95.2 91.1 93.1 94.0 90.9 94.5
Decay6.2 1.5 0.0 7.0 0.9 4.1 5.0 2.2 1.9 4.4 0.0 2.2 5.5
Mean66.7 65.3 72.1 70.6 74.5 75.9 77.4 75.5 77.4 79.5 79.1 80.5 80.7
FRecall77.1 73.8 83.4 83.5 84.4 84.7 84.4 87.9 85.8 89.5 90.5 85.1 90.2
Decay5.1 1.8 1.3 7.9 -0.2 3.5 3.3 2.4 1.6 5.0 0.0 0.6 4.5
TMean28.2 32.8 26.5 39.3 29.1 21.2 27.9 19.3 26.7 18.4 33.7 21.4 21.6
TABLE II: Quantitative results for each category on YouTube-
Objects over Mean J. See V-A for details.
LVO SFL FSEG PDB AGS COSNet AGNN MATNet
Category [48] [29] [28] [53] [43] [41] [17]
Airplane (6) 86.2 65.6 81.7 78.0 87.7 81.1 81.1 72.9
Bird (6) 81.0 65.4 63.8 80.0 76.7 75.7 75.9 77.5
Boat (15) 68.5 59.9 72.3 58.9 72.2 71.3 70.7 66.9
Car (7) 69.3 64.0 74.9 76.5 78.6 77.6 78.1 79.0
Cat (16) 58.8 58.9 68.4 63.0 69.2 66.5 67.9 73.7
Cow (20) 68.5 51.2 68.0 64.1 64.6 69.8 69.7 67.4
Dog (27) 61.7 54.1 69.4 70.1 73.3 76.8 77.4 75.9
Horse (14) 53.9 64.8 60.4 67.6 64.4 67.4 67.3 63.2
Motorbike (10) 60.8 52.6 62.7 58.4 62.1 67.7 68.3 62.6
Train (5) 66.3 34.0 62.2 35.3 48.2 46.8 47.8 51.0
Mean J ↑ 67.5 57.1 68.4 65.5 69.7 70.5 70.8 69.0
TABLE III: Quantitative results on FBMS over Mean J(V-A).
Measure MSTP [42] FSEG [28] IET [45] OBN [31] PDB [53] COSNet [41] MATNet
Mean J ↑ 60.8 68.4 71.9 73.9 74.0 75.6 76.1
ary ground-truths are not available in this task, we discard
the object boundary constraints so that Eq. (11) becomes:
Fi1=FBARi(Zi,Fi). In this way, for BARi, more fine-
grained features Fi1are produced by relying on only the
features Fifrom BARi1as well as the corresponding convo-
lutional feature Zi. Besides, we also remove the unnecessary
concatenation operator in Fig. 5. All other modules are kept
unchanged. 2) Loss function: We consider the Kullback-Leibler
(KL) divergence loss LKL as our main learning objective. It is
more task-oriented and has been proven effective in [86]. The
overall loss function is:
LDVAP =LKL (Mv,Gv) + λLCE(Mv,Gv),(15)
where Mvand Gvare the attention prediction and ground-
truth, respectively. LKL (Mv,Gv) = PiGv
ilog( Gv
i
Mv
i).λ= 0.1
is a weight to balance the contributions of the two losses.
V. EXPERIMENTS
In this section, we first compare MATNet with state-of-
the-art models on our main task of interest, i.e., ZVOS,
on both object-level (V-A) and instance-level (V-B) settings.
Then, we investigate the performance of our model on the
DVAP task (V-C). For each task, we separately introduce the
corresponding standalone datasets and experimental results.
Finally, to gain a deeper insight into our model, we conduct
detailed ablation studies in V-D.
Fig. 7: Attribute-based comparison on DAVIS16 val. We
compare MATNet with three top-performing methods, i.e., An-
Diff [82], COSNet [41] and AGS [43]. For each method, Mean
Jis computed over all sequences with specified attributes.
Fig. 8: Attribute-based ablation study on DAVIS16 val. We
compare the Mean Jof different network variants under
various attributes.
A. Main Task: Zero-Shot Video Object Segmentation
1) Datasets: We carry out comprehensive experiments on
three popular datasets:
DAVIS16 [32] is one of the most popular video object seg-
mentation datasets, which consists of 50 high-quality videos
in total (30 for train and 20 for val). Each frame contains
pixel-wise annotations for foreground objects. For quantitative
evaluation, we use three standard metrics suggested by [32],
namely region similarity J, boundary accuracy F, and time
stability T.
YouTube-Objects [34] is a large dataset of 126 web videos
with 10 semantic object categories and more than 20,000
frames. Following its protocol, we use the region similarity
Jmetric to measure the performance on the whole dataset
without further training.
FBMS [11] consists of 59 video sequences with ground-truth
annotations provided in a subset of the frames. Following the
standard protocol [48], we do not use any sequence for training
and only evaluate on the val set consisting of 30 sequences.
2) Implementation Details: The training data consist of two
parts: i) all training data from DAVIS16 [32], including 30
videos with about 2K frames; ii) a subset of 12K frames
selected from the training set of YouTube-VOS [87], which is
obtained by sampling images every ten frames in each video.
In total, we use 14K training samples, basically matching
the current top-performing methods, i.e., AGNN [17], COS-
IEEE TRANSACTIONS ON IMAGE PROCESSING 9
TABLE IV: Quantitative comparison of ZVOS methods on
DAVIS17 val. All the results are borrowed from the public
leaderboard of the DAVIS17 challenge (https://davischallenge.
org/davis2017/ soa compare.html). See V-B for details.
Measure RVOS [16] PDB [53] AGS [43] MATNet
J&FMean41.2 55.1 57.5 58.6
Mean36.8 53.2 55.5 56.7
JRecall40.2 58.9 61.6 65.2
Decay0.5 4.9 7.0 -3.6
Mean45.7 57.0 59.5 60.4
FRecall46.4 60.2 62.8 68.2
Decay1.7 6.8 9.0 1.8
Net [41] and AGS [43]. The entire network is trained using the
SGD optimizer with a learning rate of 1e-4 for the encoder and
the bridge network, and 1e-3 for the decoder. During training,
the batch size, momentum and weight decay are set to 2,0.9,
and 1e-5, respectively. The data are augmented online with
horizontal flipping and rotations covering a range of degrees
(10,10).
3) Performance on DAVIS16 val:We compare our MAT-
Net with the top performing ZVOS methods in the public
leaderboard of DAVIS16. The detailed results are reported
in Table I. We can observe that our MATNet achieves the
best performance compared to other methods. Specifically,
it outperforms the second-best method (i.e., AnDiff [82]) by
+0.7% and +0.2% in terms of Mean Jand Mean F, and
+3.6% and +5.1% in terms of Recall Jand Recall F.
In Table I, some of the deep learning-based models, e.g.,
FSEG [28], LVO [48], MOT [81], use motion cues to improve
segmentation. Our MATNet outperforms all of these methods
by a large margin. The reason lies in that these methods
learn motion and appearance features independently, without
considering the close interactions between them. In contrast,
our MATNet can learn more effective multi-modal object
representations with the interleaved encoder.
Fig. 7 shows the results of attribute-based study on
DAVIS16 [32] using 15 video attributes provided by the
dataset. Three top-performing ZVOS methods, i.e., An-
Diff [82], COSNet [41] and AGS [43], are selected for
comparison. Our model significantly outperforms them in
terms of many attributes (e.g., low resolution,fast motion,
dynamic background,motion blur,heterogeneous object, and
appearance change). This demonstrates the robustness of our
model against various challenges present in videos.
4) Performance on YouTube-Objects: Table II reports the
detailed results on YouTube-Objects. Our model shows
promising performance in most categories. It lags behind some
methods in the Airplane and Boat categories. This is mainly
because sequences in these categories contain slowly-moving
objects, which are often visually similar to their surroundings.
These factors may result in inaccurate estimation of optical
flow, thereby hurting the performance.
5) Performance on FBMS: For completeness, we also eval-
uate our method on FBMS. As shown in Table III, MATNet
produces the best results with 76.1% in Mean J, which
outperforms the second-best result, i.e., PDB, by 2.1%.
6) Qualitative results: Fig. 6 depicts sample results for
representative sequences from these three datasets. The dance-
TABLE V: Quantitative DVAP results on the val sets of
Hollywood-2 and UCF-Sports. See V-C for details.
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.913 0.542 0.757 0.623 3.086
SalEMA [88] 0.919 0.487 0.708 0.613 3.186
Hollywood-2 TASED [89] 0.918 0.507 0.768 0.646 3.302
STRA [58] 0.923 0.536 0.774 0.662 3.478
Ours 0.915 0.539 0.797 0.674 3.486
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.897 0.406 0.744 0.510 2.567
SalEMA [88] 0.906 0.431 0.740 0.544 2.638
UCF-Sports TASED [89] 0.899 0.469 0.752 0.582 2.920
STRA [58] 0.910 0.479 0.751 0.593 3.018
Ours 0.901 0.503 0.783 0.625 3.291
twirl sequence from DAVIS16 contains many challenging fac-
tors, such as object deformation, motion blur and background
clutter. As seen, our method is robust to these challenges and
delineates the target with accurate contours. The effectiveness
is further proved in cat-0001 from YouTube-Objects, in which
the cat is visually similar to its surroundings and undergoes
large deformation. In addition, our model also works well in
dogs02, in which the target suffers large scale variations.
B. Additional Task: Zero-Shot Video Instance Segmentation
1) Datasets: DAVIS17 [33] extends DAVIS16 with another
70 sequences, leading to 120 videos in total. These videos are
split into 60 for train, 30 for val and 30 for test-dev.
Different from DAVIS16 , this dataset provides instance-level
annotations. Therefore, we use it to evaluate the performance
of our model in instance-level video object segmentation.
Following the standard evaluation setting, we measure the
performance in terms of region similarity J, contour accuracy
F, and their combination J&F.
2) Quantitative and Qualitative Results: Table IV reports
the performance of MATNet against three top-performing
models (i.e., RVOS [16], PDB [53] and AGS [43]). The results
clearly demonstrate that our model outperforms all of them by
a large margin. For instance, in terms of J&F, mean Jand
mean F, our model surpasses the second-best method (i.e.,
AGS), by 1.1%,1.2% and 0.9%, respectively.
Besides, some qualitative results on DAVIS17 are shown in
Fig. 6 (the last row), validating that our model yields high-
quality ZVOS results in the instance-level setting.
C. Additional Task: Dynamic Visual Attention Prediction
1) Datasets: Hollywood-2 [35] consists of 1,707 video
sequences (823 for train and 884 for test) collected from
69 Hollywood movies, covering 12 action categories (e.g.,
eating, kissing and running). The dataset focuses on more task-
driven scenes, e.g., movie scenes and human actions.
UCF-Sports [35] includes 150 videos covering 9 common
sports action categories, such as walking, diving and golfing.
Similar to Hollywood-2, the annotations in this dataset mainly
focus on action behaviors. The dataset is split into 103 videos
for train and 47 for test.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10
TABLE VI: Ablation study of MATNet on DAVIS16 val,
measured by the Mean Jand Mean F. See V-D for details.
Network Variant Mean J ↑ JMean F ↑ F
MATNet w/o MAT 79.5 -2.9 77.3 -3.4
MATNet w/o SSA 80.7 -1.7 79.7 -1.0
MATNet w/o HEM 81.4 -1.0 78.4 -2.3
MATNet w/ Res50 81.1 -1.3 79.3 -1.4
MATNet w/ Res101 82.4 -80.7 -
(a) Image (b) Groundtruth (c) w/o MAT (d) w/o SSA (e) w/o HEM (f) MATNet
Fig. 9: Qualitative results of ablation study.
2) Implementation Details: For each dataset, we use the
train set to train our model. The network is trained with
the same setting as in V-A, except that the training images
are resized to 360 ×360 for fair comparison with previous
works [58, 68, 89]. λin Eq. 15 is empirically set to 0.1.
3) Metrics: Following previous work [68], we report the
performance of our model using five metrics, namely Nor-
malized Scanpath Saliency (NSS), Similarity (SIM), Linear
Correlation Coefficient (CC), Area Under the Curve by Judd
(AUC-J) and shuffled AUC (s-AUC). NSS and CC measure the
correlation between the prediction and ground-truth saliency
map. SIM computes the similarity between two histograms,
while AUC-J and s-AUC are variants of the well-known
AUC metric. For each metric, higher scores indicate better
performance.
4) Quantitative Results: We compare our model with four
DVAP models, i.e., ACLNet [68], SalEMA [88], STRA [58]
and TASED [89]. The results of these methods are directly
obtained from the authors. As shown in Table V, MATNet
generally outperforms all the competitors across most of the
metrics, in both the Hollywood-2 and UCF-Sports datasets.
This verifies the strong generality of our model.
D. Ablation Study
Table VI summarizes the ablation analysis of MATNet on
DAVIS16 val.
1) Efficacy of MAT: We first study the effects of the MAT
module by comparing our full model to one following the same
architecture without MAT, denoted as MATNet w/o MAT.
The encoder in this network is thus equivalent to a standard
two-stream model, where convolution features from the two
streams are concatenated at each residual stage for object
representation. As shown in Table VI, this model encounters a
huge performance degradation (-2.9% in Mean Jand -3.4%
in Mean F), which verifies the effectiveness of MAT.
Moreover, we also evaluate the performance of MATNet
with a different number of MAT modules in each deep residual
MAT layer. The results in TableVII show that the performance
gradually improves as Lincreases, reaching saturation at L=
5. Based on this analysis, we choose L= 5 as the default
number of MAT modules in MATNet.
TABLE VII: Performance comparisons with different numbers
of MAT blocks cascaded in each MAT layer on DAVIS16 val.
See V-D for details.
Metric L= 0 L= 1 L= 3 L= 5 L= 7
Mean J ↑ 79.5 80.6 81.6 82.4 82.2
Mean F ↑ 77.3 80.3 80.7 80.7 80.6
TABLE VIII: Impacts of different optical flow methods on
DAVIS16 val. See V-D for details.
Flow Method Mean J ↑ Mean F ↑ Mean T ↓
LiteFlowNet [90] 80.9 79.3 23.2
SpyNet [91] 78.4 76.8 26.6
PWCNet [78] 82.4 80.7 21.6
2) Efficacy of SSA: To measure the effectiveness of the
SSA module, we design another network, MATNet w/o SSA,
by replacing the SSA block with a simple skip layer. As
can be observed, its performance is -1.7% lower than our
full model in terms of Mean J, and -1.0% lower in Mean
F. The performance drop is mainly caused by the redundant
spatiotemporal features from the encoder. Our SSA module
aims to eliminate the redundancy by only highlighting the
features that are beneficial to segmentation.
3) Efficacy of HEM: We also study the influence of using
HEM during training. HEM is expected to facilitate the learn-
ing of more accurate object boundaries, which should further
boost the segmentation procedure. The results in Table VI
(see MATNet w/o HEM) indicate the importance of HEM.
By directly controlling the loss function in Eq. (12), HEM
helps to improve the contour accuracy by 2.3%.
4) Impact of Backbone: To verify that the high performance
of our network is not mainly due to the powerful backbone, we
replace ResNet-101 with ResNet-50 to build another network,
i.e., MATNet w/ Res50. We see that the performance degrades
slightly, but the model still outperforms previous methods (e.g.,
AGNN [17], COSNet [41], AGS [43]). This further confirms
the effectiveness of the proposed modules.
5) Impact of Optical Flow: Table VIII reports the results
of MATNet on DAVIS16 val with three open-sourced op-
tical flow computation methods, i.e., PWC-Net [78], Lite-
FlowNet [90] and SpyNet [91]. They rank #23, #34 and
#143 in the public MPI Sintel Flow Benchmark (http://sintel.
is.tue.mpg.de/results), respectively. Generally, better optical
flow models lead to more accurate segmentation results, but
the performance does not change much, demonstrating the
robustness of our model against optical flow inputs.
6) Attribute Analysis: Fig. 8 illustrates the performance
comparison of different variants in the ablation study under
various video attributes. The performance is consistent with
that reported in Table VI. All three modules (i.e., MAT, SSA
and HEM) are critical for our model to improve performance.
7) Qualitative Comparison: Fig. 9 shows visual results of
the above ablation studies on two sequences. We see that all
of the network variants produce worse results compared with
our full model. It is worth noting that the MAT block has the
greatest visually influence on the performance.
IEEE TRANSACTIONS ON IMAGE PROCESSING 11
VI. CONCLUSION
In this paper, we proposed a novel MATNet for ZVOS.
We introduced a new way to learn rich spatiotemporal object
representations with an interleaved encoder, which encour-
ages knowledge propagation from motion to appearance in a
hierarchical manner. The spatiotemporal features are further
processed by a bridge network to produce more compact
representations, which are subsequently fed into a boundary-
aware decoder to obtain accurate segmentation in a top-down
fashion. We compared the proposed model with other state-of-
the-art ZVOS methods over four large-scale benchmarks and
the experimental results demonstrated that it achieves favor-
able performance against other contenders. Benefiting from
the powerful interleaved encoder for representation learning
in videos, our model also showed compelling performance
in the DVAP task. In the future, we will further extend it
to other video analysis tasks, such as action recognition and
video classification.
REFERENCES
[1] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep
interactive object selection,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016, pp.
373–381.
[2] H. Hadizadeh and I. V. Baji ´
c, “Saliency-aware video compres-
sion,” IEEE Transactions on Image Processing, vol. 23, no. 1,
pp. 19–33, 2013.
[3] T. Zhou, W. Wang, S. Qi, H. Ling, and J. Shen, “Cascaded
human-object interaction recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 4263–4272.
[4] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmenta-
tion for autonomous driving with deep densely connected mrfs,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 669–677.
[5] A. Papazoglou and V. Ferrari, “Fast object segmentation in
unconstrained video,” in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 1777–1784.
[6] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic
video object segmentation,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2015, pp.
3395–3402.
[7] A. Faktor and M. Irani, “Video segmentation by non-local
consensus voting,” in Proceedings of the British Machine Vision
Conference, 2014, pp. 8–20.
[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware
video object segmentation,IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 40, no. 1, pp. 20–33, 2017.
[9] T. Brox and J. Malik, “Object segmentation by long term
analysis of point trajectories,” in European Conference on
Computer Vision, 2010, pp. 282–295.
[10] P. Ochs and T. Brox, “Object segmentation in video: a hierarchi-
cal variational approach for turning point trajectories into dense
regions,” in Proceedings of the IEEE International Conference
on Computer Vision, 2011, pp. 1583–1590.
[11] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects
by long term video analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1187–
1200, 2013.
[12] M. Keuper, B. Andres, and T. Brox, “Motion trajectory seg-
mentation via minimum cost multicuts,” in Proceedings of the
IEEE International Conference on Computer Vision, 2015, pp.
3271–3279.
[13] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by
tracing discontinuities in a trajectory embedding,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 1846–1853.
[14] D. Zhang, O. Javed, and M. Shah, “Video object segmentation
through spatially accurate and temporally dense extraction of
primary object regions,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2013, pp. 628–
635.
[15] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video
object segmentation,” in Proceedings of the IEEE International
Conference on Computer Vision, 2011, pp. 1995–2002.
[16] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and
X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 5277–
5286.
[17] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao,
“Zero-shot video object segmentation via attentive graph neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 9236–9245.
[18] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell,
“Zero-shot learning with semantic output codes,” in Advances in
Neural Information Processing Systems, 2009, pp. 1410–1418.
[19] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´
e, D. Cre-
mers, and L. Van Gool, “One-shot video object segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 221–230.
[20] L. L. Cloutman, “Interaction between dorsal and ventral pro-
cessing streams: Where, when and how?” Brain and Language,
vol. 127, no. 2, pp. 251 – 263, 2013.
[21] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level
visual features have a causal influence on gaze during dynamic
scene viewing?” Journal of Vision, vol. 13, no. 9, pp. 144–144,
2013.
[22] E. S. Spelke, “Principles of object perception,” Cognitive sci-
ence, vol. 14, no. 1, pp. 29–56, 1990.
[23] Y. Ostrovsky, E. Meyers, S. Ganesh, U. Mathur, and P. Sinha,
“Visual parsing after recovery from blindness,” Psychological
Science, vol. 20, no. 12, pp. 1484–1491, 2009.
[24] S. E. Palmer, Vision science: Photons to phenomenology. MIT
press, 1999.
[25] M. Wertheimer, “Laws of organization in perceptual forms.”
1938.
[26] P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion
patterns in videos,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 3386–
3394.
[27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and
A. Sorkine-Hornung, “Learning video object segmentation from
static images,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2663–
2672.
[28] S. D. Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to
combine motion and appearance for fully automatic segmenta-
tion of generic objects in videos,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2117–2126.
[29] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint
learning for video object segmentation and optical flow,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 686–695.
[30] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “Monet:
Deep motion exploitation for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 1140–1148.
[31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo,
“Unsupervised video object segmentation with motion-based bi-
lateral networks,” in European Conference on Computer Vision,
2018, pp. 207–223.
[32] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
IEEE TRANSACTIONS ON IMAGE PROCESSING 12
M. Gross, and A. Sorkine-Hornung, “A benchmark dataset
and evaluation methodology for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 724–732.
[33] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´
aez, A. Sorkine-
Hornung, and L. Van Gool, “The 2017 davis challenge on video
object segmentation,arXiv preprint arXiv:1704.00675, 2017.
[34] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,
“Learning object class detectors from weakly annotated video,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 3282–3289.
[35] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic
gaze datasets and learnt saliency models for visual recognition,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 37, no. 7, pp. 1408–1424, 2014.
[36] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-
attentive transition for zero-shot video object segmentation,
in Proceedings of AAAI Conference on Artificial Intelligence,
2020.
[37] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung,
“Fully connected object proposals for video segmentation,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 3227–3234.
[38] I. Endres and D. Hoiem, “Category independent object propos-
als,” in European Conference on Computer Vision, 2010, pp.
575–588.
[39] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, “Video object
segmentation and tracking: A survey,” ACM Transactions on
Intelligent Systems and Technology, vol. 11, no. 4, pp. 1–47,
2020.
[40] W. Wang, J. Shen, and L. Shao, “Video salient object detection
via fully convolutional networks,IEEE Transactions on Image
Processing, vol. 27, no. 1, p. 3849, 2018.
[41] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See
more, know more: Unsupervised video object segmentation with
co-attention siamese networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019,
pp. 3623–3632.
[42] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Unsupervised
video object segmentation using motion saliency-guided spatio-
temporal propagation,” in European Conference on Computer
Vision, 2018, pp. 786–802.
[43] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. H. Hoi,
and H. Ling, “Learning unsupervised video object segmentation
through visual attention,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2019, pp.
3064–3074.
[44] P. Tokmakov, C. Schmid, and K. Alahari, “Learning to segment
moving objects,International Journal of Computer Vision, vol.
127, no. 3, pp. 282–301, 2019.
[45] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C.
Jay Kuo, “Instance embedding transfer to unsupervised video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 6526–
6535.
[46] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli,
“Unsupervised online video object segmentation with motion
property understanding,” IEEE Transactions on Image Process-
ing, vol. 29, pp. 237–249, 2019.
[47] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and
L. Van Gool, “Video object segmentation with episodic graph
memory networks,” in European Conference on Computer Vi-
sion, 2020.
[48] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video
object segmentation with visual memory,” in Proceedings of
the IEEE International Conference on Computer Vision, 2017,
pp. 4481–4490.
[49] M. Faisal, I. Akhter, M. Ali, and R. Hartley, “Exploiting
geometric constraints on dense trajectories for motion saliency,
Winter Conference on Applications of Computer Vision, 2019.
[50] K. Simonyan and A. Zisserman, “Two-stream convolutional
networks for action recognition in videos,” in Advances in
Neural Information Processing Systems, 2014, pp. 568–576.
[51] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal
residual networks for video action recognition,” in Advances in
Neural Information Processing Systems, 2016, pp. 3468–3476.
[52] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal
multiplier networks for video action recognition,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 4768–4777.
[53] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid
dilated deeper convlstm for video salient object detection,” in
European Conference on Computer Vision, 2018, pp. 715–731.
[54] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention
for video salient object detection,” in Proceedings of the IEEE
International Conference on Computer Vision, 2019, pp. 7274–
7283.
[55] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal
saliency networks for dynamic saliency prediction,IEEE
Transactions on Multimedia, vol. 20, no. 7, pp. 1688–1698,
2017.
[56] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep
learning based video saliency prediction approach,” in European
Conference on Computer Vision, 01 2018, pp. 625–642.
[57] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,
“Revisiting video saliency prediction in the deep learning era,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PP, pp. 1–1, 06 2019.
[58] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency
prediction using spatiotemporal residual attentive networks,
IEEE Transactions on Image Processing, vol. 29, pp. 1113–
1126, 2019.
[59] K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang, “Spatiotem-
poral cnn for video object segmentation,” arXiv preprint
arXiv:1904.02363, 2019.
[60] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-
vised learning of video representations using lstms,” in Pro-
ceedings of the International Conference on Machine Learning,
2015, pp. 843–852.
[61] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin, “Flow guided
recurrent neural encoder for video salient object detection,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 3243–3252.
[62] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,
“Supervising neural attention models for video captioning by
human gaze data,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 490–498.
[63] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau, “A
coherent computational approach to model bottom-up visual
attention,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 5, pp. 802–817, 2006.
[64] N. Bruce and J. Tsotsos, “Saliency based on information
maximization,” in Advances in Neural Information Processing
Systems, 2006, pp. 155–162.
[65] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,
in Advances in Neural Information Processing Systems, 2007,
pp. 545–552.
[66] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an
alternative to the feature integration model for visual search.
Journal of Experimental Psychology: Human perception and
performance, vol. 15, no. 3, p. 419, 1989.
[67] C. Koch and S. Ullman, “Shifts in selective visual attention:
towards the underlying neural circuitry,” in Matters of intelli-
gence. Springer, 1987, pp. 115–141.
[68] W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji,
“Revisiting video saliency: A large-scale benchmark and a new
model,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4894–4903.
IEEE TRANSACTIONS ON IMAGE PROCESSING 13
[69] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
lation by jointly learning to align and translate,” Proceedings
of the International Conference on Learning Representations,
2015.
[70] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 7132–7141.
[71] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Con-
volutional block attention module,” in European Conference on
Computer Vision, 2018, pp. 3–19.
[72] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked
attention networks for image question answering,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 21–29.
[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[74] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,
and X. Tang, “Residual attention network for image classifi-
cation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 3156–3164.
[75] S. Xie and Z. Tu, “Holistically-nested edge detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 1395–1403.
[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille, “Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected
crfs,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[77] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel
matters–improve semantic segmentation by global convolutional
network,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 4353–4361.
[78] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: Cnns
for optical flow using pyramid, warping, and cost volume,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8934–8943.
[79] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K.
Maninis, and L. Van Gool, “The 2019 davis challenge on
vos: Unsupervised multi-object segmentation,arXiv preprint
arXiv:1905.00737, 2019.
[80] Y. J. Koh and C.-S. Kim, “Primary object segmentation in
videos based on region augmentation and reduction,” in Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 7417–7425.
[81] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny,
and M. Jagersand, “Video object segmentation using teacher-
student adaptation in a human robot interaction (hri) setting,”
in International Conference on Robotics and Automation, 2019,
pp. 50–56.
[82] Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr,
“Anchor diffusion for unsupervised video object segmentation,
in Proceedings of the IEEE International Conference on Com-
puter Vision, 2019, pp. 931–940.
[83] K. He, G. Gkioxari, P. Doll´
ar, and R. Girshick, “Mask R-
CNN,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 2961–2969.
[84] J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-
generation, refinement and merging for video object segmenta-
tion,” in ACCV, 2018, pp. 565–580.
[85] J. Luiten, I. E. Zulfikar, and B. Leibe, “Unovost: Unsupervised
offline video object segmentation and tracking,” in Winter Con-
ference on Applications of Computer Vision, 2020, pp. 2000–
2009.
[86] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing
the semantic gap in saliency prediction by adapting deep neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2015, pp. 262–270.
[87] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
S. Cohen, and T. Huang, “Youtube-vos: Sequence-to-sequence
video object segmentation,” in European Conference on Com-
puter Vision, 2018, pp. 585–601.
[88] P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-
i Nieto, and K. McGuinness, “Simple vs complex temporal
recurrences for video saliency prediction,” in Proceedings of
the British Machine Vision Conference, 2019.
[89] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spa-
tial encoder-decoder network for video saliency detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2019, pp. 2394–2403.
[90] T.-W. Hui, X. Tang, and C. C. Loy, “Liteflownet: A lightweight
convolutional neural network for optical flow estimation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8981–8989.
[91] A. Ranjan and M. J. Black, “Optical flow estimation using a
spatial pyramid network,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2017, pp.
4161–4170.
... Many scholars have applied attention mechanisms in the field of computer vision. Zhou et al. [21] proposed a novel end-to-end learning neural network called dynamic visual attention prediction (DVAP), which included an asymmetric attention block named motion attention transition (MAT) for zero-shot video object segmentation (ZVOS). They achieved promising results in practice. ...
Article
Full-text available
For small target objects on fast-moving conveyor belts, traditional vision detection algorithms equipped with conventional robotic arms struggle to capture the long and short-range pixel dependencies crucial for accurate detection. This leads to high miss rates and low precision. In this study, we integrate the traditional EMA (efficient multi-scale attention) algorithm with the c2f (channel-to-pixel) module from the original YOLOv8, alongside a Faster-Net module designed based on partial convolution concepts. This fusion results in the Faster-EMA-Net module, which greatly enhances the ability of the algorithm and robotic technologies to extract pixel dependencies for small targets, and improves perception of dynamic small target objects. Furthermore, by incorporating a small target semantic information enhancement layer into the multiscale feature fusion network, we aim to extract more expressive features for small targets, thereby boosting detection accuracy. We also address issues with training time and subpar performance on small targets in the original YOLOv8 algorithm by improving the loss function. Through experiments, we demonstrate that our attention-based visual detection algorithm effectively enhances accuracy and recall rates for fast-moving small targets, meeting the demands of real industrial scenarios. Our approach to target detection using industrial robotic arms is both practical and cutting-edge.
... V IDEO object segmentation (VOS) is a fundamental research topic in visual understanding, with the aim to segment target objects in video sequences. VOS enables machines to sense the motion pattern, location, and boundaries of the objects of interest in videos [1], which can foster a wide range of applications, e.g., augmented reality, video editing, and robotics. This work focuses on semi-supervised VOS, where object segmentations given on the first-frame are leveraged to segment and track objects in subsequent frames. ...
Article
Full-text available
Current semi-supervised video object segmentation (VOS) methods often employ the entire features of one frame to predict object masks and update memory. This introduces significant redundant computations. To reduce redundancy, we introduce a Region Aware Video Object Segmentation (RAVOS) approach, which predicts regions of interest (ROIs) for efficient object segmentation and memory storage. RAVOS includes a fast object motion tracker to predict object ROIs in the next frame. For efficient segmentation, object features are extracted based on the ROIs, and an object decoder is designed for object-level segmentation. For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects. In addition to RAVOS, we also propose a large-scale occluded VOS dataset, dubbed OVOS, to benchmark the performance of VOS models under occlusions. Evaluation on DAVIS and YouTube-VOS benchmarks and our new OVOS dataset show that our method achieves state-of-the-art performance with significantly faster inference time, e.g., 86.1 J&F at 42 FPS on DAVIS and 84.4 J&F at 23 FPS on YouTube-VOS. Project page: ravos.netlify.app.
... To sum up, the theoretical basis of ZSL involves many aspects, including the extraction of visual features, the construction of semantic space, the design of mapping functions, and strategies to solve the problem of category imbalance and domain shift. The combination of these theories and technologies has promoted the development of ZSL and demonstrated its unique value in practical applications [10]. As research continues to advance, ZSL is expected to be more widely used in various fields in the future to achieve effective identification of new categories. ...
Article
Full-text available
Zero-shot learning (ZSL) in the field of computer vision refers to enabling the model to recognize and understand categories that have not been encountered during the training phase. It is particularly critical for object detection and segmentation tasks, because these tasks require the model to have good generalization capabilities to unknown categories. Object detection requires the model to determine the location of the object, while segmentation further requires the precise demarcation of the object's boundaries. In ZSL research, knowledge representation and transfer are core issues. Researchers have tried to use semantic attributes as a knowledge bridge to connect categories seen during the training phase and categories not seen during the testing phase. These attributes may be color, shape, etc., but this method requires accurate attribute annotation, which is often not easy to achieve in practice. Therefore, researchers have begun to explore the use of non-visual information such as knowledge maps and text descriptions to enrich the recognition capabilities of models, but this also introduces the challenge of information integration and alignment. At present, ZSL has made certain progress in target detection and segmentation tasks, but there is still a significant gap compared with traditional supervised learning. This is mainly due to the limited ability of ZSL models to generalize to new categories. To this end, researchers have begun to explore combining ZSL with other technologies, such as generative adversarial networks (GANs) and reinforcement learning, to enhance the model's detection and segmentation capabilities for new categories. Future research needs to focus on several aspects. The first is how to design a more effective knowledge representation and transfer mechanism so that the model can better utilize existing knowledge. The second step is to develop new algorithms to improve the performance of ZSL in complex environments. In addition, research should focus on how to reduce the dependence on computing resources so that the ZSL method can run effectively in resource-limited environments. In summary, the research on target detection and segmentation technology of zero-shot learning is a cutting-edge topic in the field of computer vision. Despite the challenges, with the deepening of research, we expect these technologies to contribute to improving the generalization ability and intelligence level of computer vision systems.
... Zhou et al. [10] have proposed an encoder-bridge-decoder structure to perform object segmentation in videos which have been generalized well for the objects that have not been used in the training phase. The framework has been proposed based on inspiration from the human visual system (HVS). ...
Article
Full-text available
On-shelf availability (OSA) refers to the number of products available in saleable condition to a customer at the place he expects and at the time he wants to buy the product. OSA of the products in retail stores is important to enhance the customer shopping experience and profitability. The products on the retail store shelves are arranged strategically. This arrangement may be changed during shopping, because it is observed that customers may not move the products in the same or at the right place when they pick them up from the shelf. In such real-time scenarios, it would be appropriate to perform shelf image analysis to detect and identify these misplaced objects. This will help the shopkeeper to maintain a large-scale store effortlessly, otherwise it will require massive human effort and labour along with time. Many computer vision-based technology solutions have been developed by researchers to solve this problem. This article discusses a convolutional neural network (CNN)-based method for classifying shelf imagery into the correct semantic classes (that is whether it has misplaced products or not). The proposed architecture was evaluated with a modified COIL-100 dataset. The results of the proposed CNN model were compared with the variant of the proposed model in addition to lightweight pre-trained deep learning models viz; mobilenetV2, densenet121, efficientnetb0, efficientnetb3, and NASNetMobile using transfer learning (TL) concept. According to this study, it has been found that the TL-based MobilenetV2 model is lightweight and achieves better classification performance with 91.28% accuracy. Furthermore, a CNN-based model with 11 user-defined layers provides an accuracy of 90.36%. This shows the efficacy of the proposed models for change detection.
Article
In the field of computer vision, remote sensing image object detection plays an important role. Although the object detection algorithm has made significant progress, there are still problems in detecting objects with multi-scale in remote sensing image. Due to the insufficient utilization of object feature information, the detection accuracy of multi-scale objects is very low. To address the aforementioned issues, this paper proposes an effective object detection algorithm for remote sensing image based on semantic fusion and scale adaptability, known as SFSANet. Firstly, in view of the problem that the existing methods ignore the semantic differences between different scale feature maps, the semantic fusion (SF) module is proposed to enrich the semantic information and improve the ability to classify and locate objects. Next, to address the issue of the objects being easily interfered in complex background and the detection performance is poor, the spatial location attention (SLA) module is constructed to suppress background information and make key objects more prominent. Additionally, the scale adaptability module (SA) is designed to enrich the expression of feature information, realize the integration of global and local information, and ensure the integrity of image structure. Finally, we adopt the SIoU loss function as the localization loss to expedite model convergence. In order to verify the effectiveness of the proposed method, we conduct experiments on the mainstream datasets DIOR and NWPU VHR-10, which fully demonstrate the superiority of the proposed method.
Article
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS .
Article
Full-text available
The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.
Chapter
How to make a segmentation model efficiently adapt to a specific video as well as online target appearance variations is a fundamental issue in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well to both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.
Article
In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e., DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts. Code is available at: https://github.com/tfzhou/MATNet.
Article
Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.