ArticlePDF Available

MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation

Authors:

Abstract and Figures

In this paper, we present a novel end-to-end learning neural network, i.e., MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. To achieve this, an asymmetric attention block, named Motion-Attentive Transition (MAT), is proposed within a two-stream encoder network to firstly identify moving regions and then attend appearance learning to capture the full extent of objects. Putting MATs in different convolutional layers, our encoder becomes deeply interleaved, allowing for close hierarchical interactions between object apperance and motion. Such a biologically-inspired design is proven to be superb to conventional two-stream structures, which treat motion and appearance independently in separate streams and often suffer severe overfitting to object appearance. Moreover, we introduce a bridge network to modulate multi-scale spatiotemporal features into more compact, discriminative and scale-sensitive representations, which are subsequently fed into a boundary-aware decoder network to produce accurate segmentation with crisp boundaries. We perform extensive quantitative and qualitative experiments on four challenging public benchmarks, i.e., DAVIS16, DAVIS17, FBMS and YouTube-Objects. Results show that our method achieves compelling performance against current state-of-the-art ZVOS methods. To further demonstrate the generalization ability of our spatiotemporal learning framework, we extend MATNet to another relevant task: dynamic visual attention prediction (DVAP). The experiments on two popular datasets (i.e., Hollywood-2 and UCF-Sports) further verify the superiority of our model. Our implementations have been made publicly available at https://github.com/tfzhou/MATNet.
Content may be subject to copyright.
IEEE TRANSACTIONS ON IMAGE PROCESSING 1
MATNet: Motion-Attentive Transition Network for
Zero-Shot Video Object Segmentation
Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, Senior Member, IEEE
and Jianbing Shen, Senior Member, IEEE
Abstract—In this paper, we present a novel end-to-end learning
neural network, i.e., MATNet, for zero-shot video object segmen-
tation (ZVOS). Motivated by the human visual attention behavior,
MATNet leverages motion cues as a bottom-up signal to guide the
perception of object appearance. To achieve this, an asymmetric
attention block, named Motion-Attentive Transition (MAT), is
proposed within a two-stream encoder network to firstly identify
moving regions and then attend appearance learning to capture
the full extent of objects. Putting MATs in different convolutional
layers, our encoder becomes deeply interleaved, allowing for
close hierarchical interactions between object apperance and
motion. Such a biologically-inspired design is proven to be superb
to conventional two-stream structures, which treat motion and
appearance independently in separate streams and often suffer
severe overfitting to object appearance. Moreover, we introduce a
bridge network to modulate multi-scale spatiotemporal features
into more compact, discriminative and scale-sensitive representa-
tions, which are subsequently fed into a boundary-aware decoder
network to produce accurate segmentation with crisp boundaries.
We perform extensive quantitative and qualitative experiments
on four challenging public benchmarks, i.e., DAVIS16, DAVIS17 ,
FBMS and YouTube-Objects. Results show that our method
achieves compelling performance against current state-of-the-
art ZVOS methods. To further demonstrate the generalization
ability of our spatiotemporal learning framework, we extend
MATNet to another relevant task: dynamic visual attention
prediction (DVAP). The experiments on two popular datasets (i.e.,
Hollywood-2 and UCF-Sports) further verify the superiority of
our model1.
Index Terms—Video object segmentation, zero-shot, two-
stream, spatiotemporal representation, neural attention, dynamic
visual attention prediction.
I. INTRODUCTION
The task of automatically identifying primary object(s) from
videos has gained significant attention over the past decade,
owing to its academic value and practical significance in many
areas, such as robotics [1], video compression [2], human-
object interaction [3] and autonomous driving [4]. However,
due to the lack of human intervention, in addition to typical
challenging factors posed by video data (e.g., occlusions,
This work was supported in part by the Beijing Natural Science Foundation
(No. L191004) and the National Natural Science Foundation of China (No.
61271374). (Corresponding author: Jianwu Li)
T. Zhou and J. Shen are with Inception Institute of Artifical Intelligence,
Abu Dhabi, UAE. (Email: {ztfei.debug, shenjianbingcg}@gmail.com)
J. Li and S. Wang are with Beijing Laboratory of Intelligent Information-
Technology, School of Computer Science, Beijing Institute of Technology,
Beijing, China. (Email: ljw@bit.edu.cn)
R. Tao is with School of Information and Electronics, Beijing Institute
of Technology, Beijing, China, and also with Beijing Key Laboratory of
Fractional Signals and Systems, Beijing, China.
1Our code is available at https://github.com/ tfzhou/ MATNet
motion blur, object deformations, cluttered background), the
task suffers from great difficulties in accurately distinguishing
the most prominent objects throughout a video sequence.
Early non-learning methods are built upon handcrafted
features (e.g., motion boundary [5], saliency [6, 7, 8], point
trajectories [9, 10, 11, 12, 13]) and rely heavily on classic
heuristics in video segmentation (e.g., object proposal rank-
ing[14, 15], spatiotemporal coherency[5], long-term trajectory
clustering [9]). Although these methods can work in a purely
unsupervised way, they suffer the limited representability of
the handcrafted features. More recently, research has turned to-
wards the deep learning paradigm, with several studies[16, 17]
casting this problem as a zero-shot solution. These approaches
follow the zero-shot learning paradigm [18] to learn from
large-scale video data and can generalize well to test videos
that never appear in the training set, without any human
involvement. This is different from one-shot video object seg-
mentation (OVOS)[19], which requires first-frame annotations
for model adaption to test data in the inference phase.
Even before the era of deep learning, object motion has
always been considered as one of the most important cues for
automatic video object segmentation. This is largely inspired
by the human vision system (HVS), which has remarkable
motion perception capabilities to quickly orient human at-
tention to moving objects in dynamic scenes [20, 21]. In
fact, it has been demonstrated that infants [22] and newly
sighted congenitally blind people [23] tend to over-segment
static objects, even if they are strongly contrasted against
their surroundings; however, they can easily group things
together once the objects start moving, following the Gestalt
principle of common fate [24, 25]. These abilities enable us
to easily discover never-before-seen moving objects, before
knowing their particular semantic names. Motion surely does
not work alone. Recent studies [20] have revealed that, in
HVS, motion-based perception appears early, while static
perception is acquired later, and possibly bootstrapped by
motion cues to focus more on processing the most salient
objects. In this work, we take these biological mechanisms into
account and design our model to reflect such human behaviors,
i.e., first orienting rough attention to the moving parts of
objects, and then transferring the attention to object appearance
which provides a generic objectness prior for capturing the
whole picture of objects. In this way, our model is able to
learn a more effective spatiotemporal object representation,
encouraging more robust video object segmentation.
By considering knowledge propagation from object motion
to appearance, valuable temporal context can be exploited to
IEEE TRANSACTIONS ON IMAGE PROCESSING 2
Res1
Res2
Res3
Res4
Res5
Res1
Res2
Res3
Res4
Res5
CC C C
SSA
MAT
SSA SSA SSA
BAR5
BAR4
BAR3
BAR2
GC
Boundary GT Boundary HEM
Segmentation GT
MAT MAT
Encoder Decoder
Ia
Im
473×473 Z2
Z3
Z4
Z5
F2
F3
F4
F5
Mb
2
˜
Mb
5
Va
Vmˆ
U
Ua
Um
Lbdry
LCE
Ms
Gs
Fig. 1: Pipeline of MATNet. The frame Iaand flow Imare first input into the interleaved encoder to extract multi-scale
spatiotemporal features {ˆ
Ui}5
i=2. At each residual stage i, we break the original information flow in ResNet. Instead, a MAT
module is proposed to create a new interleaved information flow by simultaneously considering motion Vm,i and appearance
Va,i.ˆ
Uiis further fed into the boundary-aware decoder via the bridge network to obtain boundary results Mb
2Mb
5and
segmentation results Ms.
alleviate possible ambiguities in object appearance (e.g., visual
similarity to the background), thus facilitating representation
learning. However, in the context of deep learning, current
segmentation models often overlook this potential. Most prior
works [26, 27, 28, 29] simply treat motion cues as being equal
to appearance and learn to predict segmentation masks from
motion or appearance independently. Some approaches [30,
31] utilize motion cues to enrich object representations, but
they rely on complex heuristics and only work at a single
scale, ignoring the critical hierarchical structure.
Motivated by these observations, we propose a Motion-
Attentive Transition Network (MATNet) for zero-shot video
object segmentation (ZVOS). Fig. 1 illustrates its pipeline,
which has an encoder-bridge-decoder structure. The core of
MATNet is a deeply interleaved two-stream encoder which
not only inherits the advantages of traditional two-stream
networks for multi-modal learning, but also progressively
transfers intermediate motion features to facilitate more robust
appearance learning. The transition is carried out by multiple
Motion-Attentive Transition (MAT) modules. Each MAT takes
as input the intermediate features of both the input image
and optical flow field at a convolutional stage, and produces
informative spatiotemporal features for the following stage.
For each MAT, an asymmetric attention mechanism is built
to first infer regions of interest based on optical flow, and
then transfer the inference to provide better selectivity for
appearance features. The deep interleaved structure captures
the intrinsic characteristics of the human vision system and
brings immediate improvement in segmentation accuracy.
Given the powerful spatiotemporal features from the en-
coder, we design a decoder network to infer pixel-accurate
object segmentation through a top-down refinement process.
The decoder progressively refines high-level semantic fea-
tures using spatially rich low-level features via a cascade
of Boundary-Aware Refinement (BAR) modules. Each BAR
is responsible for generating features with finer structures,
under the assistance of salient boundary detection. Beyond
traditional methods that connect the encoder and decoder via
skip connections, we introduce a lightweight attention module,
i.e., Scale-Sensitive Attention (SSA), to connect each pair
of encoder and decoder layers. SSA adaptively modulates
spatiotemporal convolution features before sending them to
the decoder. More specifically, SSA is a two-level attention
scheme in which the local level serves to highlight the most
informative features and suppress useless information, while
the global level helps to re-calibrate features for objects with
different scales.
MATNet can be easily instantiated with various backbones,
and optimized in an end-to-end manner. We perform exten-
sive experiments on four popular video object segmentation
datasets, i.e., DAVIS16 [32], DAVIS17 [33], FBMS [11] and
YouTube-Objects [34], in which the proposed method yields
consistent performance improvement over the state-of-the-arts.
Additionally, we showcase the advantages and generalizability
of our framework via the task of dynamic visual attention
prediction (DVAP). Our model is proven to generalize well to
the DVAP task and produce reliable dynamic-fixation predic-
tion results over two large-scale benchmarks, i.e., Hollywood-
2 [35] and UCF-Sports [35].
In summary, the main contributions of this paper are
three-fold: First, we propose a novel interleaved two-stream
network architecture to learn powerful spatiotemporal object
representations for ZVOS. This is achieved by an asymmetric
attention module, i.e., MAT, that accounts for object motion
and appearance interactions in a more comprehensive way.
Second, we introduce a boundary-aware decoder to obtain seg-
mentation with crisp object boundaries. The decoder learned
with a novel adapted cross-entropy loss produces accurate
boundaries in regions of primary objects. Third, based on
these designs, our MATNet consistently outperforms state-
of-the-art methods over several ZVOS benchmarks and also
shows superior performance in instance-level segmentation
and the DVAP task.
This paper builds upon our conference paper [36] and sig-
nificantly extends it in various aspects: First, to demonstrate
the effectiveness of our model, we extend it to an instance-
level segmentation setting, which is more challenging and
essential for practical cases in which multiple instances may
appear. Second, we examine our model for the DVAP task,
and it outperforms all specialized methods on two large-
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
scale benchmarks, demonstrating its generality. Third, we
also provide a more inclusive and insightful overview of the
recent work on video object segmentation, motion-aware video
analysis and dynamic visual attention prediction. Last but not
least, we report much more experimental results and conduct
more ablation studies (e.g., attribute-based analysis, impacts
of different optical flow methods) for thorough and in-depth
examinations of our model.
II. RE LATE D WORK
Our model is related to four lines of research, i.e., auto-
matic video object segmentation, motion-aware modeling in
video analysis, dynamic visual attention prediction and neural
attention. We will briefly discuss each of them.
A. Automatic Video Object Segmentation
A large number of methods have been proposed for auto-
matic (or unsupervised) video object segmentation, targeting at
segmenting conspicuous and eye-catching objects without any
human intervention. Many non-deep learning methods are
based on hand-crafted features and rely on certain heuristics
(e.g., saliency, object proposal ranking, trajectory clustering).
For instance, [6, 7, 8] take visual saliency as prior knowledge
to guide object segmentation, while [7, 14, 15, 37] infer the
object regions from hundreds of object candidates [38]. Object
motion is also widely used as a reliable cue for identifying
objects. [5] detects motion boundaries to determine foreground
regions. [9, 10, 11, 12, 13] take advantage of long-term point
trajectories for motion segmentation, making them more robust
to occlusions. Please refer to [39] for a more comprehensive
review of these approaches.
In recent years, with the renaissance of neural networks
in computer vision, deep learning based solutions are now
dominant in this field. Many approaches[16, 26, 28, 31, 40, 41,
42, 43, 44, 45, 46, 47] solve the task with zero-shot solutions,
which require no additional annotations during inference and
are thus more flexible for automatic video analysis. For exam-
ple, [43] proposes a dynamic visual attention-driven model for
video object segmentation, and [17, 41] mine higher-order re-
lations between video frames, resulting in more comprehensive
understanding of video content and more accurate foreground
estimation. However, these approaches only rely on object
appearance, and can thus easily fail in cases where objects are
visually similar to the background. To cope with this, many
approaches discover the motion patterns of objects [26] as
complementary cues to object appearance. This is typically
achieved within two-stream network architectures [28, 48],
in which an RGB image and the corresponding flow field
are separately processed by two independent networks and
the results are fused to produce the final segmentation. Some
methods [42, 49] design complex heuristics to fuse motion
and appearance for better segmentation. However, a major
drawback of these approaches is that they fail to consider
the importance of deep interactions between appearance and
motion in learning rich spatiotemporal features. To address
this issue, we propose a deep interleaved two-stream encoder,
in which a motion transition module is leveraged for more
effective representation learning.
B. Motion-Aware Modeling in Video Analysis
Deep learning models have been widely used in various
video-related tasks, such as action recognition [50, 51, 52],
video salient object detection [40, 53, 54] and dynamic visual
attention prediction [55, 56, 57, 58]. The most significant
difference between static images and videos is that objects
in videos are moving, which is a key factor that draws human
attention. Therefore, how to involve object motion into the
design of neural networks has been a critical issue in deep
learning-based video analysis.
Many approaches [40, 59] learn temporal coherence in-
formation by simply feeding consecutive frames into fully
convolutional networks. These methods are computationally
efficient; however, since they do not employ explicit motion
information (e.g., optical flow), they are sensitive to cluttered
and distracting backgrounds. Some other models consider
recurrent neural networks to capture long-range spatiotemporal
features [53, 60, 61]. However, all these models ignore the
complementary roles of spatial and temporal information.
This issue has been well addressed by the famous two-
stream ConvNet architecture proposed in [50], which consists
of spatial and temporal networks to better capture the comple-
mentary information of object appearance and motion. It has
achieved great success on human action recognition in videos.
Along this line, [51] injects residual connections into the
two-stream architecture to allow spatiotemporal interactions
between two modalities, while [44, 52] further improve such
a spatiotemporal residual network with multiplicative gating
functions. These two-stream architectures have also shown
strong performance in video object processing tasks, like video
object segmentation [28, 29, 54] and dynamic video attention
prediction [55, 58]. Despite this, current two-stream networks
tend to fuse motion and appearance features with a simple
gating mechanism and are limited in their use of local context.
In this work, we reconsider the interactions between object
motion and appearance with an asymmetric attention mod-
ule, which utilize motion-attentive features to promote the
appearance features, in a hierarchical manner. The powerful
representation ability of our model is verified in both ZVOS
and DVAP tasks.
C. Dynamic Visual Attention Prediction
Dynamic visual attention prediction, or dynamic fixation
prediction, is a close topic to ZVOS. Rather than targeting
at object-level saliency prediction, DVAP aims to identify
observers’ fixations during dynamic scene viewing. The task is
useful for machines to understand human attentional behaviors
and has shown great potential in many practical applications
(e.g., object segmentation [43], video captioning [62]). Early
DVAP methods [63, 64, 65] largely relied on hand-crafted,
biologically-inspired features (e.g., color, optical flow), the-
ories of visual attention in the cognitive area (e.g., guided
search [66], attention shift [67]). Recently, deep learning-
based methods become mainstream and generally yield better
performance. Representative works use two-stream networks
to account for multi-modal features [55, 58] or LSTMs for
sequence fixation prediction over consecutive frames [68].
IEEE TRANSACTIONS ON IMAGE PROCESSING 4
CCC
Soft Attention
Soft Attention
Attention
Transition
Vm
Um
P>
Um
Va
Ua
Q>
Ua
S
Sr
ˆ
Uˆ
Ua
ˆ
Um
1×1conv 1×1conv
softmax+fc softmax +fc
softmaxr
Fig. 2: Computational graph of MAT. and c
indicate matrix
multiplication and concatenation operations, respectively.
Although MATNet is originally designed for the object-
aware segmentation task, we show that it also achieves re-
markable performance on the DVAP task (V-C). This can
be largely attributed to the proposed encoder network, which
can provide informative spatiotemporal features to capture the
most important parts of the visual stimuli.
D. Neural Attention
Neural attention mechanisms, which are derived from hu-
man perception, have been widely studied in deep neural
networks and yield significant improvements for various tasks,
e.g., neural machine translation [69], object recognition [70,
71], and visual question answering [54, 72], to name a few
representative ones. Neural attention stimulates the human
selective attention mechanism, allowing the networks to focus
on the most informative parts of the inputs.
Neural attention mechanisms have also been used in recent
ZVOS approaches [17, 41], which aim to mine consistent
object patterns among video frames. Our idea is fundamentally
different with them. We propose an asymmetric attention
module (i.e., MAT) to mimic human attention behavior in
dynamic scenarios. It encourages more comprehensive in-
teractions between object motion and appearance, yielding
more powerful spatiotemporal features. Besides, we extend
MAT into a deeper version to conduct multi-step reasoning of
spatiotemporal attention, which can highlight more accurate
target regions, especially for complex scenarios. In addition,
MATs are incorporated into multiple convolutional layers,
leading to an entirely different network architecture, which
is expected to benefit various video analysis tasks.
III. PROP OS ED METHOD
A. Network Overview
We propose an end-to-end deep neural network, i.e., MAT-
Net, for ZVOS, which leverages motion cues to effectively
bootstrap the perception of object appearance. More specifi-
cally, our approach is designed as a unified of three tightly
coupled sub-networks: Interleaved Encoder Network,Bridge
Network and Boundary-Aware Decoder Network. The pipeline
is illustrated in Fig. 1.
(a) (b) (c) (d)
Fig. 3: Illustration of effects of the MAT module. (a) and (b)
are inputs of images and optical flow fields. (c) and (d) denote
feature maps in Vaand ˆ
Ua. As seen, the MAT module can
effectively emphasize important object regions and suppress
background responses, benefiting the segmentation.
1) Interleaved Encoder Network: The encoder resorts to a
two-stream structure to jointly capture the spatial and temporal
information, which has been proven effective in many related
video analysis tasks [50, 51, 52]. In contrast to previous works,
which treat the two streams equally, our encoder incorporates
a MAT module (III-B) into each network layer, which offers
a motion-to-appearance pathway for information exchange.
Such a design enables us to learn more powerful spatiotempo-
ral object representations. More technically, we take the first
five convolutional blocks of ResNet-101 [73] as the backbone
for each stream. Given an RGB frame IaRw×h×3and its
optical flow field ImRw×h×3, the encoder first extracts
intermediate appearance and motion features separately at the
i-th (i{2,3,4,5}) residual stage, denoted as Va,i RW×H×C
and Vm,i RW×H×C, where W,Hand Crepresent the spatial
width, height and channel number of the feature tensors,
respectively. The features are subsequently enhanced by a
MAT module FMAT as:
ˆ
Ua,i,ˆ
Um,i =FMAT(Va,i ,Vm,i),(1)
where ˆ
U·,i RW×H×Crepresents the enriched features. For
the i-th stage, the spatiotemporal object representation ˆ
Uiis
obtained as ˆ
Ui=Concat(ˆ
Ua,i,ˆ
Um,i)RW×H×2Cwhich is
further fed into the down-stream decoder via a bridge network.
2) Bridge Network: The bridge network is responsible for
selecting informative spatiotemporal features for the decoder.
It is built upon several SSA modules (III-C), each of which
takes advantage of Uiat the i-th stage, attending it both
locally and globally to produce attentive feature Zi, with a
unified attention module. The local attention adopts channel-
wise and spatial-wise attention mechanisms to highlight the
correct object regions and suppress possible noise existing in
the redundant features, while the global attention aims to re-
calibrate the features to account for objects of different sizes.
3) Boundary-Aware Decoder Network: The decoder net-
work adopts a coarse-to-fine scheme to conduct segmentation
inference. It consists of four BAR modules (III-D), i.e.,
FBARi, i ∈ {2,3,4,5}, each corresponding to the i-th residual
block. From FBAR5to FBAR2, the resolution of feature maps
gradually increases by compensating for high-level coarse
features with more low-level details. The FBAR2produces the
finest feature maps, whose resolutions are 1/4of the input
IEEE TRANSACTIONS ON IMAGE PROCESSING 5
image size. They are sequentially processed by three additional
layers, i.e., conv(3×3,1),upsampling and sigmoid, to obtain the
final mask output MsRw×h.
As follows, we will introduce the proposed modules (i.e.,
MAT, SSA, BAR) in detail. For simplicity, we omit the
subscript i.
B. Motion-Attentive Transition Module
Each MAT module is comprised of two soft attention units
and one attention transition unit, as depicted in Fig. 2. The
soft attention units help to emphasize the most informative
regions in the appearance or motion feature maps, while
the transition unit transfers the attentive motion features to
facilitate spatiotemporal feature learning.
1) Soft Attention: This unit softly weights the input feature
map Vm(or Va) at each spatial location. Taking Vmas an
example, this unit outputs a motion-attentive feature Um
RW×H×Cas follows:
Softmax attention: Am=softmax(Wm(Vm)),
Attention-enhanced feature: Uc
m=AmKVc
m,(2)
where Wmis a 1×1convolution that transforms Vmto
an importance map, which is normalized using a softmax
operation to generate an attention map AmRW×H, where
PW×H
i=1 Ai
m= 1. Here, each value Ai
mis the probability
with which our model believes the corresponding location
is important. Uc
mand Vc
mindicate the 2D feature slices of
Umand Vmat the c-th channel, respectively. Jdenotes the
Hadamard product. Similarly, given Va, we can obtain the
appearance-attentive feature Uaby Eq. (2).
2) Attention Transition: To transfer motion-attentive fea-
tures Um, we first seek the affinity between Uaand Umin
a non-local manner, using the following multi-modal bilinear
model:
S=U>
mWUaR(W H)×(W H ),(3)
where WRC×Cis a trainable weight matrix. The affinity
matrix Scan effectively capture pairwise relationships between
the two feature spaces. However, it also introduces a huge
number of parameters, which increases the computational cost
and creates the risk of over-fitting. To overcome this problem,
Wis approximately factorized into two low-rank matrices P
RC×C
dand QRC×C
d, where d(d > 1) is a reduction ratio.
Then, Eq. (3) can be rewritten as:
S=U>
mPQ>
Ua= (P>Um)>(Q>
Ua).(4)
This operation is equal to applying channel-wise feature
transformations to Umand Uabefore computing the similarity.
Its advantages over Eq. (3) are three-fold: 1) It reduces the
number of parameters by 2/d times; 2) It requires much
fewer multiplication operations. For comparison, Eq. (3) needs
W H C2+W2H2Cmultiplications, while Eq. (4) only requires
(2W H C2+W2H2C)/d; 3) It helps to generate a compact
channel-wise feature representation for each modality.
Then, we normalize Srow-wise to derive an attention
map Srconditioned on motion features and achieve enhanced
appearance features ˆ
UaRW×H×C:
Motion conditioned attention: Sr=softmaxr(S),
Attention-enhanced feature: ˆ
Ua=UaSr,(5)
(a) Image (b) Groundtruth (c) HED (d) Hard Example Mining
(e) Image (f) Groundtruth (g) Boundary w/o HEM (h) Boundary w/ HEM
Fig. 4: Illustration of hard example mining (HEM) for salient
object boundary detection. During training, for each training
image in (a), our method first estimates an edge map (c) using
off-the-shelf HED [75], and then determines hard pixels (d) to
facilitate training. For each test image in (e), we see that the
boundary results with HEM (h) are more accurate than those
without HEM (g).
where softmaxrindicates row-wise softmax.
3) Deep-MAT: For complex videos, using one MAT layer
to predict the attention is sub-optimal due to the noise intro-
duced by distractors which are irrelevant to the target regions.
Therefore, we extend MAT into Deep-MAT for multi-step rea-
soning of spatiotemporal attention. Deep-MAT progressively
refines attention via multiple MAT layers and can pinpoint
more accurate target regions. In particular, our deep MAT
consists of LMAT layers cascaded in depth (denoted by
F(1)
MAT,F(2)
MAT,· · · ,F(L)
MAT). Let ˆ
U(l1)
aand ˆ
U(l1)
mbe the input
features for F(l)
MAT. It then produces outputs ˆ
U(l)
aand ˆ
U(l)
m,
which are further fed to F(l+1)
MAT in a recursive manner:
ˆ
U(l)
a,ˆ
U(l)
m=F(l)
MAT(ˆ
U(l1)
a,ˆ
U(l1)
m),(6)
where ˆ
U(l)
ais computed by Eq. (5) and ˆ
U(l)
m=U(l1)
mfollowing
Eq. (2). In addition, we have ˆ
U(0)
a=Vaand ˆ
U(0)
m=Vm.
It should be noted that stacking MAT layers directly leads to
an obvious drop in performance. Inspired by [74], we propose
to stack multiple MAT layers in a residual form as follows:
ˆ
U(l)
a=ˆ
U(l1)
a+U(l1)
aSr,
ˆ
U(l)
m=ˆ
U(l1)
m+U(l1)
m.
(7)
4) Discussion: In Fig. 3, we show the visual effects of the
MAT module. We can observe that with MAT, the feature maps
in Vaare well refined to produce more effective features in
ˆ
Ua. The new features show tremendous properties with promi-
nent objects highlighted and distractors suppressed, which are
beneficial for accurate segmentation.
C. Scale-Sensitive Attention Module
The SSA module FSSA is extended from a simplified CBAM
FCBAM [71] by adding a global attention Fg. Given a feature
map URW×H×2C, our SSA module refines it as follows:
Z=FSSA(U) = Fg(FCBAM(U)) RW×H×2C.(8)
IEEE TRANSACTIONS ON IMAGE PROCESSING 6
The CBAM module FCBAM consists of two sequential sub-
modules: channel and spatial attention, which can be formu-
lated as:
Channel attention: s=Fs(U),e=Fe(s),Zc=e?U,
Spatial attention: p=Fp(Zc),ZCBAM =pKZc,(9)
where Fsis a squeeze operator that gathers the global
spatial information of Uinto a vector sR2C, while Fe
is an excitation operator that captures channel-wise depen-
dencies and outputs an attention vector eR2C. Follow-
ing [70], Fsis implemented by applying avg pooling on
each feature channel, and Feis formed by four consec-
utive operations: fc(2C
16 )ReLU fc(2C)sigmoid.
ZcRW×H×2Cdenotes channel-wise attentive features, and
?indicates the channel-wise multiplication. In the spatial
attention, Fpexploits the inter-spatial relationship of Zc
and produces a spatial-wise attention map pRW×Hby
conv(7×7,1) sigmoid. Then, we achieve the attention
glimpse ZCBAM RW×H×2Cas the local-level feature.
The global attention Fgshares a similar spirit to the
channel attention layer in Eq. (9), in that it has the
same squeeze layer but modifies the excitation layer as
fc(2C
16 )fc(1) sigmoid to obtain a scale selection factor
gR1. It can then obtain scale-sensitive features Zgas
follows:
Z= (gZCBAM) + U.(10)
Note that we use identity mapping to avoid losing important
information on the regions with attention values close to 0.
D. Boundary-Aware Refinement Module
In the decoder network, each BAR FBARiaccepts two in-
puts, Zifrom the corresponding SSA module and Fifrom the
previous BAR. To obtain a sharp mask output, the BAR first
performs object boundary estimation using an extra boundary
detection module FBDRY, which compels the network to em-
phasize finer object details. The predicted boundary map is
then combined with the two inputs to produce finer features
for the next BAR module. This can be formulated as:
Mb
i=FBDRY(Fi),
Fi1=FBARi(Zi,Fi,Mb
i),(11)
where FBDRY consists of a stack of convolutional layers and a
sigmoid layer (see Fig. 5), Mb
iRw×hindicates the boundary
map and Fi1is the output feature map of BARi. The full
computational graph of BARiis shown in Fig. 5.
BAR benefits from two key factors: the first is that we apply
Atrous Spatial Pyramid Pooling (ASPP) [76] on convolutional
features to transform them into a multi-scale representation.
This helps to enlarge the receptive field and obtain more
spatial details for decoding. Technically, ASPP consists of
multiple parallel dilated convolutional layers with different
sampling rates. In this paper, four dilated convolutional layers
are adopted, and the dilation rates are set as {2k}4
k=1. In this
way, each BAR module first extracts spatiotemporal features
on four scales, which are then concatenated together to em-
phasize multi-scale features. During decoding, these features
are futher concatenated with the boundary prediction Mb
i, and
ASPP
Res
C
Res
Conv 3x3
Conv 1x1
Sigmoid
Boundary
UP
FBDRY
FBARi
Fi
Zi
Fi1
Mb
i
Fig. 5: Computational graph of the BARimodule. Here,
‘Res’ is a residual block [73], while ‘UP’ denotes bilinear
upsampling. c
and indicate concatenation and element-
wise addition operations, respectively.
then progressively proccessed by a residual block (‘Res’ in
Fig. 5), an element-wise summation with Zi, and another
residual block to obtain more fine-grained features Fi1, as
shown in Fig. 5. Here the residual block is implemented by
two stacked 3×3convolutions with an identity shortcut [73].
The second benefit is that we introduce a heuristic method
for automatically mining hard negative pixels to support the
training of FBDRY. Specifically, for each training frame, we
use the popular off-the-shelf HED model [75] to predict a
boundary map E[0,1]w×h, wherein each value Ekrepresents
the probability of pixel kbeing an edge pixel. Then, pixel
kis regarded as a hard negative pixel if it has a high edge
probability (e.g., Ek>0.2) and falls outside the dilated ground-
truth region. If pixel kis a hard pixel, then its weights wk=
1+Ek; otherwise, wk= 1.
Then, wkis used to weight the following adaptive boundary
loss so that it can be penalized heavily if the hard pixels are
misclassified:
LBDRY(Mb,Gb) = Xkwk((1Gb
k) log(1Mb
k)
+Gb
klog(Mb
k)),
(12)
where Mband Gbare the boundary prediction and ground-
truth, respectively.
Fig. 4 offers an illustration of the above hard example
mining (HEM) scheme. Clearly, by explicitly discovering
hard negative pixels, the network can produce more accurate
boundary predictions with background pixels well suppressed
(see Fig. 4 (g) and (h)).
E. Detailed Network Architecture
Our whole model is end-to-end trainable, because all the
components in MATNet are parameterized by neural networks.
At each stream of the encoder, we use the first five convolu-
tion blocks of ResNet-101 [73] as our backbone for feature
extraction. The spatiotemporal features in the last convolution
stage are fed into a global convolutional layer (GC in Fig. 1)
to enlarge the valid receptive field [77], which is implemented
by combining 1×77×1and 7×11×7convolutional
layers, following by a residual block.
IEEE TRANSACTIONS ON IMAGE PROCESSING 7
Fig. 6: Qualitative results on four sequences. From top to bottom: dance-twirl from DAVIS16 ,dogs02 from FBMS, cat-0001
from YouTube-Objects and dogs-jump from DAVIS17.
1) Training Phase: Given an input frame IaR473×473×3,
we first compute its optical flow field ImR473×473×3using
PWC-Net [78] due to its high efficiency and accuracy. Then,
our MATNet predicts a segmentation mask Ms[0,1]473×473
and four boundary masks {Mb
i[0,1]473×473 }4
i=1 through
the decoder network. Let Gs∈ {0,1}473×473 be the binary
segmentation ground-truth, and Gb∈ {0,1}473×473 be the
boundary ground-truth which can be easily computed from
Gs. The overall loss function is formulated as:
LZVOS =LCE(Ms,Gs)+ 1
NXN=4
i=1 LBDRY(Mb
i,Gb),(13)
where LCE indicates the classic cross entropy loss, and LBDRY
is defined in Eq. (12).
2) Testing Phase: Once the network is trained, we apply it
to unseen videos. Given a test video, we resize all the frames
to 473×473, and feed each frame, along with its optical flow, to
the network for segmentation. We follow the common protocol
used in previous works [27, 30, 43] and employ CRF to obtain
the final binary segmentation results.
3) Runtime: Our model is implemented in PyTorch and
trained on a single Nvidia RTX 2080Ti GPU and an Intel(R)
Xeon Gold 5120 CPU. Testing is conducted on the same
machine. For each test frame of size 473×473, the forward
inference of our MATNet takes about 0.05s, while optical flow
estimation and CRF-based post-processing take about 0.2s and
0.5s, respectively.
IV. EXTENSION OF MATNET
In this section, we describe two extensions of our MATNet:
zero-shot video instance segmentation and dynamic visual
attention prediction. The former focuses on multi-object un-
supervised video segmentation [79], targeting at more fine-
grained results in multi-object scenarios. The latter aims at
predicting where people look over dynamic scenes.
A. Zero-Shot Video Instance Segmentation
To adapt our MATNet into an instance-level segmentation
setting, we modify our model into a saliency-driven instance
selection method. More specifically, for a test video V=
{It}T
t=1 with Tframes, our approach takes three stages to
generate segmentation tracks for it. 1) Object proposal gener-
ation. For each frame It, we generate a collection of category-
agnostic segment proposals Pt={Pi
t}iusing COCO-trained
Mask R-CNN [83] for detecting generic objects. Our MATNet
is also applied to generate an object-level segmentation mask
Ms
t. Then, we compute a score Si
tfor each proposal:
Si
t=Si
MATNet Si
MRCNN,
Si
MATNet =kPi
tMs
tk
kPi
tk,(14)
where Si
MRCNN denotes the detection score of Pi
tfrom Mask
R-CNN, while Si
MATNet measures its saliency score. The pro-
posals with small scores (Si
t<0.03) are discarded. 2) Short-
Term Tracklet Generation. Given the remaining proposals, we
further connect them temporally in a greedy manner. Firstly,
each proposal Pi
tis warped to the next frame using optical
flow, and we search for its matched proposal in Pt+1 by
evaluating the IoU scores. If the maximum IoU score is
above 0.1, the corresponding proposal is regarded as being
matched with Pi
t. 3) Tracklet Merging by Re-Identification
(ReID). We further merge short-term tracklets into a set of
consistent segmentation tracks using object re-identification.
The ReID embedding vector for each proposal is computed
using a pretrained ReID network [84]. For each tracklet, its
embedding is computed as the average embedding of all
proposals belonging to it. We use L2distance to measure
the similarities between two tracklets and adopt the merging
strategy in [85] to obtain final segmentation tracks.
B. Dynamic Visual Attention Prediction
Our MATNet is flexible to fit the DVAP task with modi-
fications in two aspects: 1) Network structure: Since bound-
IEEE TRANSACTIONS ON IMAGE PROCESSING 8
TABLE I: Quantitative comparison of ZVOS methods on DAVIS16 val. The best result for each metric is boldfaced (This
note is also applied to other tables.). All the results are borrowed from the public leaderboard maintained by the DAVIS16
challenge (https://davischallenge.org/davis2016/ soa compare.html). See V-A for details.
Measure SFL [29] FSEG[28] LVO [48] ARP [80] PDB [53] LSMO [44] MOT[81] EPO [49] AGS [43] COSNet [41] AGNN[17] AnDiff [82] MATNet
Mean67.4 70.7 75.9 76.2 77.2 78.2 77.2 80.6 79.7 80.5 80.7 81.7 82.4
JRecall81.4 83.5 89.1 91.1 90.1 89.1 87.8 95.2 91.1 93.1 94.0 90.9 94.5
Decay6.2 1.5 0.0 7.0 0.9 4.1 5.0 2.2 1.9 4.4 0.0 2.2 5.5
Mean66.7 65.3 72.1 70.6 74.5 75.9 77.4 75.5 77.4 79.5 79.1 80.5 80.7
FRecall77.1 73.8 83.4 83.5 84.4 84.7 84.4 87.9 85.8 89.5 90.5 85.1 90.2
Decay5.1 1.8 1.3 7.9 -0.2 3.5 3.3 2.4 1.6 5.0 0.0 0.6 4.5
TMean28.2 32.8 26.5 39.3 29.1 21.2 27.9 19.3 26.7 18.4 33.7 21.4 21.6
TABLE II: Quantitative results for each category on YouTube-
Objects over Mean J. See V-A for details.
LVO SFL FSEG PDB AGS COSNet AGNN MATNet
Category [48] [29] [28] [53] [43] [41] [17]
Airplane (6) 86.2 65.6 81.7 78.0 87.7 81.1 81.1 72.9
Bird (6) 81.0 65.4 63.8 80.0 76.7 75.7 75.9 77.5
Boat (15) 68.5 59.9 72.3 58.9 72.2 71.3 70.7 66.9
Car (7) 69.3 64.0 74.9 76.5 78.6 77.6 78.1 79.0
Cat (16) 58.8 58.9 68.4 63.0 69.2 66.5 67.9 73.7
Cow (20) 68.5 51.2 68.0 64.1 64.6 69.8 69.7 67.4
Dog (27) 61.7 54.1 69.4 70.1 73.3 76.8 77.4 75.9
Horse (14) 53.9 64.8 60.4 67.6 64.4 67.4 67.3 63.2
Motorbike (10) 60.8 52.6 62.7 58.4 62.1 67.7 68.3 62.6
Train (5) 66.3 34.0 62.2 35.3 48.2 46.8 47.8 51.0
Mean J ↑ 67.5 57.1 68.4 65.5 69.7 70.5 70.8 69.0
TABLE III: Quantitative results on FBMS over Mean J(V-A).
Measure MSTP [42] FSEG [28] IET [45] OBN [31] PDB [53] COSNet [41] MATNet
Mean J ↑ 60.8 68.4 71.9 73.9 74.0 75.6 76.1
ary ground-truths are not available in this task, we discard
the object boundary constraints so that Eq. (11) becomes:
Fi1=FBARi(Zi,Fi). In this way, for BARi, more fine-
grained features Fi1are produced by relying on only the
features Fifrom BARi1as well as the corresponding convo-
lutional feature Zi. Besides, we also remove the unnecessary
concatenation operator in Fig. 5. All other modules are kept
unchanged. 2) Loss function: We consider the Kullback-Leibler
(KL) divergence loss LKL as our main learning objective. It is
more task-oriented and has been proven effective in [86]. The
overall loss function is:
LDVAP =LKL (Mv,Gv) + λLCE(Mv,Gv),(15)
where Mvand Gvare the attention prediction and ground-
truth, respectively. LKL (Mv,Gv) = PiGv
ilog( Gv
i
Mv
i).λ= 0.1
is a weight to balance the contributions of the two losses.
V. EXPERIMENTS
In this section, we first compare MATNet with state-of-
the-art models on our main task of interest, i.e., ZVOS,
on both object-level (V-A) and instance-level (V-B) settings.
Then, we investigate the performance of our model on the
DVAP task (V-C). For each task, we separately introduce the
corresponding standalone datasets and experimental results.
Finally, to gain a deeper insight into our model, we conduct
detailed ablation studies in V-D.
Fig. 7: Attribute-based comparison on DAVIS16 val. We
compare MATNet with three top-performing methods, i.e., An-
Diff [82], COSNet [41] and AGS [43]. For each method, Mean
Jis computed over all sequences with specified attributes.
Fig. 8: Attribute-based ablation study on DAVIS16 val. We
compare the Mean Jof different network variants under
various attributes.
A. Main Task: Zero-Shot Video Object Segmentation
1) Datasets: We carry out comprehensive experiments on
three popular datasets:
DAVIS16 [32] is one of the most popular video object seg-
mentation datasets, which consists of 50 high-quality videos
in total (30 for train and 20 for val). Each frame contains
pixel-wise annotations for foreground objects. For quantitative
evaluation, we use three standard metrics suggested by [32],
namely region similarity J, boundary accuracy F, and time
stability T.
YouTube-Objects [34] is a large dataset of 126 web videos
with 10 semantic object categories and more than 20,000
frames. Following its protocol, we use the region similarity
Jmetric to measure the performance on the whole dataset
without further training.
FBMS [11] consists of 59 video sequences with ground-truth
annotations provided in a subset of the frames. Following the
standard protocol [48], we do not use any sequence for training
and only evaluate on the val set consisting of 30 sequences.
2) Implementation Details: The training data consist of two
parts: i) all training data from DAVIS16 [32], including 30
videos with about 2K frames; ii) a subset of 12K frames
selected from the training set of YouTube-VOS [87], which is
obtained by sampling images every ten frames in each video.
In total, we use 14K training samples, basically matching
the current top-performing methods, i.e., AGNN [17], COS-
IEEE TRANSACTIONS ON IMAGE PROCESSING 9
TABLE IV: Quantitative comparison of ZVOS methods on
DAVIS17 val. All the results are borrowed from the public
leaderboard of the DAVIS17 challenge (https://davischallenge.
org/davis2017/ soa compare.html). See V-B for details.
Measure RVOS [16] PDB [53] AGS [43] MATNet
J&FMean41.2 55.1 57.5 58.6
Mean36.8 53.2 55.5 56.7
JRecall40.2 58.9 61.6 65.2
Decay0.5 4.9 7.0 -3.6
Mean45.7 57.0 59.5 60.4
FRecall46.4 60.2 62.8 68.2
Decay1.7 6.8 9.0 1.8
Net [41] and AGS [43]. The entire network is trained using the
SGD optimizer with a learning rate of 1e-4 for the encoder and
the bridge network, and 1e-3 for the decoder. During training,
the batch size, momentum and weight decay are set to 2,0.9,
and 1e-5, respectively. The data are augmented online with
horizontal flipping and rotations covering a range of degrees
(10,10).
3) Performance on DAVIS16 val:We compare our MAT-
Net with the top performing ZVOS methods in the public
leaderboard of DAVIS16. The detailed results are reported
in Table I. We can observe that our MATNet achieves the
best performance compared to other methods. Specifically,
it outperforms the second-best method (i.e., AnDiff [82]) by
+0.7% and +0.2% in terms of Mean Jand Mean F, and
+3.6% and +5.1% in terms of Recall Jand Recall F.
In Table I, some of the deep learning-based models, e.g.,
FSEG [28], LVO [48], MOT [81], use motion cues to improve
segmentation. Our MATNet outperforms all of these methods
by a large margin. The reason lies in that these methods
learn motion and appearance features independently, without
considering the close interactions between them. In contrast,
our MATNet can learn more effective multi-modal object
representations with the interleaved encoder.
Fig. 7 shows the results of attribute-based study on
DAVIS16 [32] using 15 video attributes provided by the
dataset. Three top-performing ZVOS methods, i.e., An-
Diff [82], COSNet [41] and AGS [43], are selected for
comparison. Our model significantly outperforms them in
terms of many attributes (e.g., low resolution,fast motion,
dynamic background,motion blur,heterogeneous object, and
appearance change). This demonstrates the robustness of our
model against various challenges present in videos.
4) Performance on YouTube-Objects: Table II reports the
detailed results on YouTube-Objects. Our model shows
promising performance in most categories. It lags behind some
methods in the Airplane and Boat categories. This is mainly
because sequences in these categories contain slowly-moving
objects, which are often visually similar to their surroundings.
These factors may result in inaccurate estimation of optical
flow, thereby hurting the performance.
5) Performance on FBMS: For completeness, we also eval-
uate our method on FBMS. As shown in Table III, MATNet
produces the best results with 76.1% in Mean J, which
outperforms the second-best result, i.e., PDB, by 2.1%.
6) Qualitative results: Fig. 6 depicts sample results for
representative sequences from these three datasets. The dance-
TABLE V: Quantitative DVAP results on the val sets of
Hollywood-2 and UCF-Sports. See V-C for details.
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.913 0.542 0.757 0.623 3.086
SalEMA [88] 0.919 0.487 0.708 0.613 3.186
Hollywood-2 TASED [89] 0.918 0.507 0.768 0.646 3.302
STRA [58] 0.923 0.536 0.774 0.662 3.478
Ours 0.915 0.539 0.797 0.674 3.486
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.897 0.406 0.744 0.510 2.567
SalEMA [88] 0.906 0.431 0.740 0.544 2.638
UCF-Sports TASED [89] 0.899 0.469 0.752 0.582 2.920
STRA [58] 0.910 0.479 0.751 0.593 3.018
Ours 0.901 0.503 0.783 0.625 3.291
twirl sequence from DAVIS16 contains many challenging fac-
tors, such as object deformation, motion blur and background
clutter. As seen, our method is robust to these challenges and
delineates the target with accurate contours. The effectiveness
is further proved in cat-0001 from YouTube-Objects, in which
the cat is visually similar to its surroundings and undergoes
large deformation. In addition, our model also works well in
dogs02, in which the target suffers large scale variations.
B. Additional Task: Zero-Shot Video Instance Segmentation
1) Datasets: DAVIS17 [33] extends DAVIS16 with another
70 sequences, leading to 120 videos in total. These videos are
split into 60 for train, 30 for val and 30 for test-dev.
Different from DAVIS16 , this dataset provides instance-level
annotations. Therefore, we use it to evaluate the performance
of our model in instance-level video object segmentation.
Following the standard evaluation setting, we measure the
performance in terms of region similarity J, contour accuracy
F, and their combination J&F.
2) Quantitative and Qualitative Results: Table IV reports
the performance of MATNet against three top-performing
models (i.e., RVOS [16], PDB [53] and AGS [43]). The results
clearly demonstrate that our model outperforms all of them by
a large margin. For instance, in terms of J&F, mean Jand
mean F, our model surpasses the second-best method (i.e.,
AGS), by 1.1%,1.2% and 0.9%, respectively.
Besides, some qualitative results on DAVIS17 are shown in
Fig. 6 (the last row), validating that our model yields high-
quality ZVOS results in the instance-level setting.
C. Additional Task: Dynamic Visual Attention Prediction
1) Datasets: Hollywood-2 [35] consists of 1,707 video
sequences (823 for train and 884 for test) collected from
69 Hollywood movies, covering 12 action categories (e.g.,
eating, kissing and running). The dataset focuses on more task-
driven scenes, e.g., movie scenes and human actions.
UCF-Sports [35] includes 150 videos covering 9 common
sports action categories, such as walking, diving and golfing.
Similar to Hollywood-2, the annotations in this dataset mainly
focus on action behaviors. The dataset is split into 103 videos
for train and 47 for test.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10
TABLE VI: Ablation study of MATNet on DAVIS16 val,
measured by the Mean Jand Mean F. See V-D for details.
Network Variant Mean J ↑ JMean F ↑ F
MATNet w/o MAT 79.5 -2.9 77.3 -3.4
MATNet w/o SSA 80.7 -1.7 79.7 -1.0
MATNet w/o HEM 81.4 -1.0 78.4 -2.3
MATNet w/ Res50 81.1 -1.3 79.3 -1.4
MATNet w/ Res101 82.4 -80.7 -
(a) Image (b) Groundtruth (c) w/o MAT (d) w/o SSA (e) w/o HEM (f) MATNet
Fig. 9: Qualitative results of ablation study.
2) Implementation Details: For each dataset, we use the
train set to train our model. The network is trained with
the same setting as in V-A, except that the training images
are resized to 360 ×360 for fair comparison with previous
works [58, 68, 89]. λin Eq. 15 is empirically set to 0.1.
3) Metrics: Following previous work [68], we report the
performance of our model using five metrics, namely Nor-
malized Scanpath Saliency (NSS), Similarity (SIM), Linear
Correlation Coefficient (CC), Area Under the Curve by Judd
(AUC-J) and shuffled AUC (s-AUC). NSS and CC measure the
correlation between the prediction and ground-truth saliency
map. SIM computes the similarity between two histograms,
while AUC-J and s-AUC are variants of the well-known
AUC metric. For each metric, higher scores indicate better
performance.
4) Quantitative Results: We compare our model with four
DVAP models, i.e., ACLNet [68], SalEMA [88], STRA [58]
and TASED [89]. The results of these methods are directly
obtained from the authors. As shown in Table V, MATNet
generally outperforms all the competitors across most of the
metrics, in both the Hollywood-2 and UCF-Sports datasets.
This verifies the strong generality of our model.
D. Ablation Study
Table VI summarizes the ablation analysis of MATNet on
DAVIS16 val.
1) Efficacy of MAT: We first study the effects of the MAT
module by comparing our full model to one following the same
architecture without MAT, denoted as MATNet w/o MAT.
The encoder in this network is thus equivalent to a standard
two-stream model, where convolution features from the two
streams are concatenated at each residual stage for object
representation. As shown in Table VI, this model encounters a
huge performance degradation (-2.9% in Mean Jand -3.4%
in Mean F), which verifies the effectiveness of MAT.
Moreover, we also evaluate the performance of MATNet
with a different number of MAT modules in each deep residual
MAT layer. The results in TableVII show that the performance
gradually improves as Lincreases, reaching saturation at L=
5. Based on this analysis, we choose L= 5 as the default
number of MAT modules in MATNet.
TABLE VII: Performance comparisons with different numbers
of MAT blocks cascaded in each MAT layer on DAVIS16 val.
See V-D for details.
Metric L= 0 L= 1 L= 3 L= 5 L= 7
Mean J ↑ 79.5 80.6 81.6 82.4 82.2
Mean F ↑ 77.3 80.3 80.7 80.7 80.6
TABLE VIII: Impacts of different optical flow methods on
DAVIS16 val. See V-D for details.
Flow Method Mean J ↑ Mean F ↑ Mean T ↓
LiteFlowNet [90] 80.9 79.3 23.2
SpyNet [91] 78.4 76.8 26.6
PWCNet [78] 82.4 80.7 21.6
2) Efficacy of SSA: To measure the effectiveness of the
SSA module, we design another network, MATNet w/o SSA,
by replacing the SSA block with a simple skip layer. As
can be observed, its performance is -1.7% lower than our
full model in terms of Mean J, and -1.0% lower in Mean
F. The performance drop is mainly caused by the redundant
spatiotemporal features from the encoder. Our SSA module
aims to eliminate the redundancy by only highlighting the
features that are beneficial to segmentation.
3) Efficacy of HEM: We also study the influence of using
HEM during training. HEM is expected to facilitate the learn-
ing of more accurate object boundaries, which should further
boost the segmentation procedure. The results in Table VI
(see MATNet w/o HEM) indicate the importance of HEM.
By directly controlling the loss function in Eq. (12), HEM
helps to improve the contour accuracy by 2.3%.
4) Impact of Backbone: To verify that the high performance
of our network is not mainly due to the powerful backbone, we
replace ResNet-101 with ResNet-50 to build another network,
i.e., MATNet w/ Res50. We see that the performance degrades
slightly, but the model still outperforms previous methods (e.g.,
AGNN [17], COSNet [41], AGS [43]). This further confirms
the effectiveness of the proposed modules.
5) Impact of Optical Flow: Table VIII reports the results
of MATNet on DAVIS16 val with three open-sourced op-
tical flow computation methods, i.e., PWC-Net [78], Lite-
FlowNet [90] and SpyNet [91]. They rank #23, #34 and
#143 in the public MPI Sintel Flow Benchmark (http://sintel.
is.tue.mpg.de/results), respectively. Generally, better optical
flow models lead to more accurate segmentation results, but
the performance does not change much, demonstrating the
robustness of our model against optical flow inputs.
6) Attribute Analysis: Fig. 8 illustrates the performance
comparison of different variants in the ablation study under
various video attributes. The performance is consistent with
that reported in Table VI. All three modules (i.e., MAT, SSA
and HEM) are critical for our model to improve performance.
7) Qualitative Comparison: Fig. 9 shows visual results of
the above ablation studies on two sequences. We see that all
of the network variants produce worse results compared with
our full model. It is worth noting that the MAT block has the
greatest visually influence on the performance.
IEEE TRANSACTIONS ON IMAGE PROCESSING 11
VI. CONCLUSION
In this paper, we proposed a novel MATNet for ZVOS.
We introduced a new way to learn rich spatiotemporal object
representations with an interleaved encoder, which encour-
ages knowledge propagation from motion to appearance in a
hierarchical manner. The spatiotemporal features are further
processed by a bridge network to produce more compact
representations, which are subsequently fed into a boundary-
aware decoder to obtain accurate segmentation in a top-down
fashion. We compared the proposed model with other state-of-
the-art ZVOS methods over four large-scale benchmarks and
the experimental results demonstrated that it achieves favor-
able performance against other contenders. Benefiting from
the powerful interleaved encoder for representation learning
in videos, our model also showed compelling performance
in the DVAP task. In the future, we will further extend it
to other video analysis tasks, such as action recognition and
video classification.
REFERENCES
[1] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep
interactive object selection,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016, pp.
373–381.
[2] H. Hadizadeh and I. V. Baji ´
c, “Saliency-aware video compres-
sion,” IEEE Transactions on Image Processing, vol. 23, no. 1,
pp. 19–33, 2013.
[3] T. Zhou, W. Wang, S. Qi, H. Ling, and J. Shen, “Cascaded
human-object interaction recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 4263–4272.
[4] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmenta-
tion for autonomous driving with deep densely connected mrfs,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 669–677.
[5] A. Papazoglou and V. Ferrari, “Fast object segmentation in
unconstrained video,” in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 1777–1784.
[6] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic
video object segmentation,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2015, pp.
3395–3402.
[7] A. Faktor and M. Irani, “Video segmentation by non-local
consensus voting,” in Proceedings of the British Machine Vision
Conference, 2014, pp. 8–20.
[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware
video object segmentation,IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 40, no. 1, pp. 20–33, 2017.
[9] T. Brox and J. Malik, “Object segmentation by long term
analysis of point trajectories,” in European Conference on
Computer Vision, 2010, pp. 282–295.
[10] P. Ochs and T. Brox, “Object segmentation in video: a hierarchi-
cal variational approach for turning point trajectories into dense
regions,” in Proceedings of the IEEE International Conference
on Computer Vision, 2011, pp. 1583–1590.
[11] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects
by long term video analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1187–
1200, 2013.
[12] M. Keuper, B. Andres, and T. Brox, “Motion trajectory seg-
mentation via minimum cost multicuts,” in Proceedings of the
IEEE International Conference on Computer Vision, 2015, pp.
3271–3279.
[13] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by
tracing discontinuities in a trajectory embedding,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 1846–1853.
[14] D. Zhang, O. Javed, and M. Shah, “Video object segmentation
through spatially accurate and temporally dense extraction of
primary object regions,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2013, pp. 628–
635.
[15] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video
object segmentation,” in Proceedings of the IEEE International
Conference on Computer Vision, 2011, pp. 1995–2002.
[16] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and
X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 5277–
5286.
[17] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao,
“Zero-shot video object segmentation via attentive graph neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 9236–9245.
[18] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell,
“Zero-shot learning with semantic output codes,” in Advances in
Neural Information Processing Systems, 2009, pp. 1410–1418.
[19] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´
e, D. Cre-
mers, and L. Van Gool, “One-shot video object segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 221–230.
[20] L. L. Cloutman, “Interaction between dorsal and ventral pro-
cessing streams: Where, when and how?” Brain and Language,
vol. 127, no. 2, pp. 251 – 263, 2013.
[21] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level
visual features have a causal influence on gaze during dynamic
scene viewing?” Journal of Vision, vol. 13, no. 9, pp. 144–144,
2013.
[22] E. S. Spelke, “Principles of object perception,” Cognitive sci-
ence, vol. 14, no. 1, pp. 29–56, 1990.
[23] Y. Ostrovsky, E. Meyers, S. Ganesh, U. Mathur, and P. Sinha,
“Visual parsing after recovery from blindness,” Psychological
Science, vol. 20, no. 12, pp. 1484–1491, 2009.
[24] S. E. Palmer, Vision science: Photons to phenomenology. MIT
press, 1999.
[25] M. Wertheimer, “Laws of organization in perceptual forms.”
1938.
[26] P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion
patterns in videos,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 3386–
3394.
[27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and
A. Sorkine-Hornung, “Learning video object segmentation from
static images,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2663–
2672.
[28] S. D. Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to
combine motion and appearance for fully automatic segmenta-
tion of generic objects in videos,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2117–2126.
[29] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint
learning for video object segmentation and optical flow,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 686–695.
[30] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “Monet:
Deep motion exploitation for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 1140–1148.
[31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo,
“Unsupervised video object segmentation with motion-based bi-
lateral networks,” in European Conference on Computer Vision,
2018, pp. 207–223.
[32] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
IEEE TRANSACTIONS ON IMAGE PROCESSING 12
M. Gross, and A. Sorkine-Hornung, “A benchmark dataset
and evaluation methodology for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 724–732.
[33] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´
aez, A. Sorkine-
Hornung, and L. Van Gool, “The 2017 davis challenge on video
object segmentation,arXiv preprint arXiv:1704.00675, 2017.
[34] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,
“Learning object class detectors from weakly annotated video,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 3282–3289.
[35] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic
gaze datasets and learnt saliency models for visual recognition,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 37, no. 7, pp. 1408–1424, 2014.
[36] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-
attentive transition for zero-shot video object segmentation,
in Proceedings of AAAI Conference on Artificial Intelligence,
2020.
[37] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung,
“Fully connected object proposals for video segmentation,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 3227–3234.
[38] I. Endres and D. Hoiem, “Category independent object propos-
als,” in European Conference on Computer Vision, 2010, pp.
575–588.
[39] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, “Video object
segmentation and tracking: A survey,” ACM Transactions on
Intelligent Systems and Technology, vol. 11, no. 4, pp. 1–47,
2020.
[40] W. Wang, J. Shen, and L. Shao, “Video salient object detection
via fully convolutional networks,IEEE Transactions on Image
Processing, vol. 27, no. 1, p. 3849, 2018.
[41] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See
more, know more: Unsupervised video object segmentation with
co-attention siamese networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019,
pp. 3623–3632.
[42] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Unsupervised
video object segmentation using motion saliency-guided spatio-
temporal propagation,” in European Conference on Computer
Vision, 2018, pp. 786–802.
[43] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. H. Hoi,
and H. Ling, “Learning unsupervised video object segmentation
through visual attention,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2019, pp.
3064–3074.
[44] P. Tokmakov, C. Schmid, and K. Alahari, “Learning to segment
moving objects,International Journal of Computer Vision, vol.
127, no. 3, pp. 282–301, 2019.
[45] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C.
Jay Kuo, “Instance embedding transfer to unsupervised video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 6526–
6535.
[46] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli,
“Unsupervised online video object segmentation with motion
property understanding,” IEEE Transactions on Image Process-
ing, vol. 29, pp. 237–249, 2019.
[47] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and
L. Van Gool, “Video object segmentation with episodic graph
memory networks,” in European Conference on Computer Vi-
sion, 2020.
[48] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video
object segmentation with visual memory,” in Proceedings of
the IEEE International Conference on Computer Vision, 2017,
pp. 4481–4490.
[49] M. Faisal, I. Akhter, M. Ali, and R. Hartley, “Exploiting
geometric constraints on dense trajectories for motion saliency,
Winter Conference on Applications of Computer Vision, 2019.
[50] K. Simonyan and A. Zisserman, “Two-stream convolutional
networks for action recognition in videos,” in Advances in
Neural Information Processing Systems, 2014, pp. 568–576.
[51] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal
residual networks for video action recognition,” in Advances in
Neural Information Processing Systems, 2016, pp. 3468–3476.
[52] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal
multiplier networks for video action recognition,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 4768–4777.
[53] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid
dilated deeper convlstm for video salient object detection,” in
European Conference on Computer Vision, 2018, pp. 715–731.
[54] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention
for video salient object detection,” in Proceedings of the IEEE
International Conference on Computer Vision, 2019, pp. 7274–
7283.
[55] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal
saliency networks for dynamic saliency prediction,IEEE
Transactions on Multimedia, vol. 20, no. 7, pp. 1688–1698,
2017.
[56] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep
learning based video saliency prediction approach,” in European
Conference on Computer Vision, 01 2018, pp. 625–642.
[57] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,
“Revisiting video saliency prediction in the deep learning era,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PP, pp. 1–1, 06 2019.
[58] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency
prediction using spatiotemporal residual attentive networks,
IEEE Transactions on Image Processing, vol. 29, pp. 1113–
1126, 2019.
[59] K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang, “Spatiotem-
poral cnn for video object segmentation,” arXiv preprint
arXiv:1904.02363, 2019.
[60] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-
vised learning of video representations using lstms,” in Pro-
ceedings of the International Conference on Machine Learning,
2015, pp. 843–852.
[61] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin, “Flow guided
recurrent neural encoder for video salient object detection,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 3243–3252.
[62] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,
“Supervising neural attention models for video captioning by
human gaze data,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 490–498.
[63] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau, “A
coherent computational approach to model bottom-up visual
attention,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 5, pp. 802–817, 2006.
[64] N. Bruce and J. Tsotsos, “Saliency based on information
maximization,” in Advances in Neural Information Processing
Systems, 2006, pp. 155–162.
[65] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,
in Advances in Neural Information Processing Systems, 2007,
pp. 545–552.
[66] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an
alternative to the feature integration model for visual search.
Journal of Experimental Psychology: Human perception and
performance, vol. 15, no. 3, p. 419, 1989.
[67] C. Koch and S. Ullman, “Shifts in selective visual attention:
towards the underlying neural circuitry,” in Matters of intelli-
gence. Springer, 1987, pp. 115–141.
[68] W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji,
“Revisiting video saliency: A large-scale benchmark and a new
model,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4894–4903.
IEEE TRANSACTIONS ON IMAGE PROCESSING 13
[69] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
lation by jointly learning to align and translate,” Proceedings
of the International Conference on Learning Representations,
2015.
[70] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 7132–7141.
[71] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Con-
volutional block attention module,” in European Conference on
Computer Vision, 2018, pp. 3–19.
[72] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked
attention networks for image question answering,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 21–29.
[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[74] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,
and X. Tang, “Residual attention network for image classifi-
cation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 3156–3164.
[75] S. Xie and Z. Tu, “Holistically-nested edge detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 1395–1403.
[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille, “Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected
crfs,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[77] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel
matters–improve semantic segmentation by global convolutional
network,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 4353–4361.
[78] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: Cnns
for optical flow using pyramid, warping, and cost volume,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8934–8943.
[79] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K.
Maninis, and L. Van Gool, “The 2019 davis challenge on
vos: Unsupervised multi-object segmentation,arXiv preprint
arXiv:1905.00737, 2019.
[80] Y. J. Koh and C.-S. Kim, “Primary object segmentation in
videos based on region augmentation and reduction,” in Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 7417–7425.
[81] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny,
and M. Jagersand, “Video object segmentation using teacher-
student adaptation in a human robot interaction (hri) setting,”
in International Conference on Robotics and Automation, 2019,
pp. 50–56.
[82] Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr,
“Anchor diffusion for unsupervised video object segmentation,
in Proceedings of the IEEE International Conference on Com-
puter Vision, 2019, pp. 931–940.
[83] K. He, G. Gkioxari, P. Doll´
ar, and R. Girshick, “Mask R-
CNN,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 2961–2969.
[84] J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-
generation, refinement and merging for video object segmenta-
tion,” in ACCV, 2018, pp. 565–580.
[85] J. Luiten, I. E. Zulfikar, and B. Leibe, “Unovost: Unsupervised
offline video object segmentation and tracking,” in Winter Con-
ference on Applications of Computer Vision, 2020, pp. 2000–
2009.
[86] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing
the semantic gap in saliency prediction by adapting deep neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2015, pp. 262–270.
[87] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
S. Cohen, and T. Huang, “Youtube-vos: Sequence-to-sequence
video object segmentation,” in European Conference on Com-
puter Vision, 2018, pp. 585–601.
[88] P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-
i Nieto, and K. McGuinness, “Simple vs complex temporal
recurrences for video saliency prediction,” in Proceedings of
the British Machine Vision Conference, 2019.
[89] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spa-
tial encoder-decoder network for video saliency detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2019, pp. 2394–2403.
[90] T.-W. Hui, X. Tang, and C. C. Loy, “Liteflownet: A lightweight
convolutional neural network for optical flow estimation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8981–8989.
[91] A. Ranjan and M. J. Black, “Optical flow estimation using a
spatial pyramid network,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2017, pp.
4161–4170.
... Therefore, this paper develops an adversarial attack method for VOS (See Fig. 1) and concentrates on the semi-supervised setting [18], [19], where the ground-truth mask of the target object is given in the first frame during inference. We consider the semi-supervised VOS as it is the most widely explored setting in VOS with very cheap annotation cost only on the first frame, and it gains more popularity in practice compared to unsupervised ones (zero-shot VOS [20]). Naturally, this work chooses to attack only the first frame by slightly perturbing its pixel values, thus indirectly attacking the subsequent frames. ...
... Unsupervised VOS models. They segment the objects in a video without any user annotation [61] [62], and our attacker is also applied to several unsupervised VOS models, including COSNet (CO-attention Siamese Network) [63], MATNet (Motion-Attentive Transition Network) [20], and FSNet (Full-duplex Strategy Network) [64]. Among them, COSNet adopts a global co-attention mechanism to capture the inherent correlation across all video frames in the video, and it only utilizes the appearance feature, thus avoiding the timeconsuming optical flow extraction like MATNet and FSNet. ...
... The perturbation is added to the first frame and all frames of each video, respectively. As shown in the table, the overall performance (J &F) drops by 2.7%, 2.8% and 2.3%, for COSNet [63], MATNet [20], and FSNet [64], respectively. The performance drops are much smaller than those for semisupervised VOS model with our ARA attacker, i.e., 6.9% to 11.6% in Table I. ...
Article
Full-text available
Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models.
... In [1], a two-branch network was designed to generate segmentation outputs and optical flow maps. Another commonly used networks take video frame and optical flow as inputs to generate segmentation results [2,3]. These networks employ optical flow information to enhance the object information and weaken the background information. ...
... To effectively fuse the motion and appearance information, [10] designed a hierarchical feature alignment network, [11] extracted superpixel-based component prototypes from the input RGB images and optical flow maps. With the emergence of the attention mechanism, a motion-attentive network with motion information was designed in [3]. In addition to motion information, other information was also used in the attention mechanism boosting the performance of UVOS. ...
... Table 2 shows the performance and the model size of different UVOS methods. We can see that our network obtains lower values in terms of Mean J and Mean F on videos than [3,5,8,9] and [36]. However, our model size is much smaller than these methods. ...
Article
Full-text available
This paper solves the task of unsupervised video object segmentation (UVOS) that segments the objects of interest through the entire videos without any annotation. In recent years, many unsupervised video object segmentation (UVOS) methods have been proposed. Although these methods perform well, they rely on networks with heavy weights, often leading to large model size. In order to reduce the model size while keeping a competitive performance, we propose a saliency-based dual-attention (SDA) method for UVOS in this paper. In our method, we take optical flow and video frames as inputs and extract the appearance information and motion information from optical flow and video frames. We design a two-branch network with appearance information and motion information. The information from these two branches is fused via a saliency-based dual-attention module to segment the primary object in one path. The saliency-based dual-attention module is composed of saliency attention and saliency-based reverse attention. To demonstrate the effectiveness of our network, we tested it on the DAVIS-2016 and SegtrackV2 datasets. Experimental results demonstrate that our method can achieve competitive results in terms of accuracy and model size.
... The attention mechanism uses deep neural networks to imitate human cognitive processes. The method has been widely applied in computer vision, with the characteristic of learning more skills to express [8,9,33] Zhou et al. [33] constructed a motion attention transfer (MATNet) attention framework based on human visual attention behavior for semantic segmentation tasks, solving the problem of insufficient datasets of basic facts. In addition, a new weakly supervised semantic segmentation group learning framework [8] is proposed. ...
... The attention mechanism uses deep neural networks to imitate human cognitive processes. The method has been widely applied in computer vision, with the characteristic of learning more skills to express [8,9,33] Zhou et al. [33] constructed a motion attention transfer (MATNet) attention framework based on human visual attention behavior for semantic segmentation tasks, solving the problem of insufficient datasets of basic facts. In addition, a new weakly supervised semantic segmentation group learning framework [8] is proposed. ...
Article
Full-text available
Vehicle re-identification research under surveillance cameras has yielded impressive results. However, the challenge of unmanned aerial vehicle (UAV)-based vehicle re-identification (ReID) presents a high degree of flexibility, mainly due to complicated shooting angles, occlusions, low discrimination of top–down features, and significant changes in vehicle scales. To address this, we propose a novel dual mixing attention network (DMANet) to extract discriminative features robust to variations in viewpoint. Specifically, we first present a plug-and-play dual mixing attention module (DMAM) to capture pixel-level pairwise relationships and channel dependencies, where DMAM is composed of spatial mixing attention (SMA) and channel mixing attention (CMA). First, the original feature is divided according to the spatial and channel dimensions to obtain multiple subspaces. Then, a learnable weight is applied to capture the dependencies between local features in the mixture space. Finally, the features extracted from all subspaces are aggregated to promote their comprehensive feature interaction. In addition, DMAM can be easily plugged into any depth of the backbone network to improve vehicle recognition. The experimental results show that the proposed structure performs better than the representative method in the UAV-based vehicle ReID. Our code and models will be published publicly.
... Then they proposed a modular and scalable framework and software architecture for the broadcast segmentation system. Zhou et al. [36] present a novel end-to-end learning neural network, MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. ...
Article
Full-text available
Story segmentation plays an important role in helping students quickly locate key knowledge points within massive and fragmented teaching video resources. It also serves as the basis for accurate searching of valuable teaching video clips. While keyframe-based story segmenta-tion has achieved remarkable results in news videos, achieving accurate story segmentation in teaching videos is challenging. Therefore, we propose an unsupervised teaching video story segmentation method called SegRewardGraph, which utilizes a subtitle length-rewarding strategy and semantic relatedness graphs. SegRewardGraph employs subtitle semantic similarity to divide video stories, ensuring semantic integrity and improving the accuracy of story segmentation. Specifically, SegRewardGraph first uses the Bidirectional Encoder Representation from Transformers (BERT) model combined with first-last-avg pooling to encode semantic vectors for sentences within video stories. Then, it computes similarities between all sentence vectors and utilizes the associations to build semantic relatedness graphs. A subtitle length-rewarding strategy is formulated to evaluate the segmentation effect. Additionally , boundary detection and boundary merging algorithms are designed based on the subtitle length-rewarding strategy to generate effective segmentation suggestions. Finally, the selected boundaries are mapped to keyframes, enabling semantic content-based seg-mentation of teaching videos. This study verifies the effectiveness of the proposed method using massive open online courses (MOOC) teaching video datasets. Experimental results demonstrate that the model proposed in this paper outperforms existing methods and achieves state-of-the-art results for teaching video segmentation tasks.
... These weights are used to adaptively and selectively fuse motion and appearance information, thus mitigating the interference caused by poor optical flow or obscure appearance features. Similarly, the motion-attentive transition network (MATNet) (Zhou et al., 2020) employs co-attention to interconvert appearance features with motion features, improving the unsupervised video object segmentation accuracy. In underwater videos, appearance information can be used to locate fish, while motion information contains more details of fish edges. ...
Article
Full-text available
Fish segmentation in underwater videos provides basic data for fish measurements, which is vital information that supports fish habitat monitoring and fishery resources survey. However, because of water turbidity and insufficient lighting, fish segmentation in underwater videos has low accuracy and poor robustness. Most previous work has utilized static fish appearance information while ignoring fish motion in underwater videos. Considering that motion contains more detail, this paper proposes a method that simultaneously combines appearance and motion information to guide fish segmentation in underwater videos. First, underwater videos are preprocessed to highlight fish in motion, and obtain high-quality underwater optical flow. Then, a multi-source guidance network (MSGNet) is presented to segment fish in complex underwater videos with degraded visual features. To enhance both fish appearance and motion information, a non-local-based multiple co-attention guidance module (M-CAGM) is applied in the encoder stage, in which the appearance and motion features from the intra-frame salient fish and the moving fish in video sequences are reciprocally enhanced. In addition, a feature adaptive fusion module (FAFM) is introduced in the decoder stage to avoid errors accumulated in the video sequences due to blurred fish or inaccurate optical flow. Experiments based on three publicly available datasets were designed to test the performance of the proposed model. The mean pixel accuracy ( mPA ) and mean intersection over union ( mIoU ) of MSGNet were 91.89% and 88.91% respectively with the mixed dataset. Compared with those of the advanced underwater fish segmentation and video object segmentation models, the mPA and mIoU of the proposed model significantly improved. The results showed that MSGNet achieves excellent segmentation performance in complex underwater videos and can provide an effective segmentation solution for fisheries resource assessment and ocean observation. The proposed model and code are exposed via Github ¹ .
Article
The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temporal consistency provided by the optical flow could be effectively supplemented by modeling in a structural form. This paper proposes a new hierarchical graph neural network (GNN) architecture, dubbed hierarchical graph pattern understanding (HGPU), for zero-shot video object segmentation (ZS-VOS). Inspired by the strong ability of GNNs in capturing structural relations, HGPU innovatively leverages motion cues (i.e., optical flow) to enhance the high-order representations from the neighbors of target frames. Specifically, a hierarchical graph pattern encoder with message aggregation is introduced to acquire different levels of motion and appearance features in a sequential manner. Furthermore, a decoder is designed for hierarchically parsing and understanding the transformed multi-modal contexts to achieve more accurate and robust results. HGPU achieves state-of-the-art performance on four publicly available benchmarks (DAVIS-16, YouTube-Objects, Long-Videos and DAVIS-17). Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HGPU .
Article
Semi-supervised video object segmentation is the task of segmenting the target in sequential frames given the ground truth mask in the first frame. The modern approaches usually utilize such a mask as pixel-level supervision and typically exploit pixel-to-pixel matching between the reference frame and current frame. However, the matching at pixel level, which overlooks the high-level information beyond local areas, often suffers from confusion caused by similar local appearances. In this paper, we present Prototypical Matching Networks (PMNet) - a novel architecture that integrates prototypes into matching-based video objection segmentation frameworks as high-level supervision. Specifically, PMNet first divides the foreground and background areas into several parts according to the similarity to the global prototypes. The part-level prototypes and instance-level prototypes are generated by encapsulating the semantic information of identical parts and identical instances, respectively. To model the correlation between prototypes, the prototype representations are propagated to each other by reasoning on a graph structure. Then, PMNet stores both the pixel-level features and prototypes in the memory bank as the target cues. Three affinities, i.e., pixel-to-pixel affinity, prototype-to-pixel affinity, and prototype-to-prototype affinity, are derived to measure the similarity between the query frame and the features in the memory bank. The features aggregated from the memory bank using these affinities provide powerful discrimination from both the pixel-level and prototype-level perspectives. Extensive experiments conducted on four benchmarks demonstrate superior results than the state-of-the-art video object segmentation techniques.
Chapter
How to make a segmentation model efficiently adapt to a specific video as well as online target appearance variations is a fundamental issue in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well to both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.
Article
In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e., DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts. Code is available at: https://github.com/tfzhou/MATNet.
Article
Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.