ArticlePDF Available

MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation


Abstract and Figures

In this paper, we present a novel end-to-end learning neural network, i.e., MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. To achieve this, an asymmetric attention block, named Motion-Attentive Transition (MAT), is proposed within a two-stream encoder network to firstly identify moving regions and then attend appearance learning to capture the full extent of objects. Putting MATs in different convolutional layers, our encoder becomes deeply interleaved, allowing for close hierarchical interactions between object apperance and motion. Such a biologically-inspired design is proven to be superb to conventional two-stream structures, which treat motion and appearance independently in separate streams and often suffer severe overfitting to object appearance. Moreover, we introduce a bridge network to modulate multi-scale spatiotemporal features into more compact, discriminative and scale-sensitive representations, which are subsequently fed into a boundary-aware decoder network to produce accurate segmentation with crisp boundaries. We perform extensive quantitative and qualitative experiments on four challenging public benchmarks, i.e., DAVIS16, DAVIS17, FBMS and YouTube-Objects. Results show that our method achieves compelling performance against current state-of-the-art ZVOS methods. To further demonstrate the generalization ability of our spatiotemporal learning framework, we extend MATNet to another relevant task: dynamic visual attention prediction (DVAP). The experiments on two popular datasets (i.e., Hollywood-2 and UCF-Sports) further verify the superiority of our model. Our implementations have been made publicly available at
Content may be subject to copyright.
MATNet: Motion-Attentive Transition Network for
Zero-Shot Video Object Segmentation
Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, Senior Member, IEEE
and Jianbing Shen, Senior Member, IEEE
Abstract—In this paper, we present a novel end-to-end learning
neural network, i.e., MATNet, for zero-shot video object segmen-
tation (ZVOS). Motivated by the human visual attention behavior,
MATNet leverages motion cues as a bottom-up signal to guide the
perception of object appearance. To achieve this, an asymmetric
attention block, named Motion-Attentive Transition (MAT), is
proposed within a two-stream encoder network to firstly identify
moving regions and then attend appearance learning to capture
the full extent of objects. Putting MATs in different convolutional
layers, our encoder becomes deeply interleaved, allowing for
close hierarchical interactions between object apperance and
motion. Such a biologically-inspired design is proven to be superb
to conventional two-stream structures, which treat motion and
appearance independently in separate streams and often suffer
severe overfitting to object appearance. Moreover, we introduce a
bridge network to modulate multi-scale spatiotemporal features
into more compact, discriminative and scale-sensitive representa-
tions, which are subsequently fed into a boundary-aware decoder
network to produce accurate segmentation with crisp boundaries.
We perform extensive quantitative and qualitative experiments
on four challenging public benchmarks, i.e., DAVIS16, DAVIS17 ,
FBMS and YouTube-Objects. Results show that our method
achieves compelling performance against current state-of-the-
art ZVOS methods. To further demonstrate the generalization
ability of our spatiotemporal learning framework, we extend
MATNet to another relevant task: dynamic visual attention
prediction (DVAP). The experiments on two popular datasets (i.e.,
Hollywood-2 and UCF-Sports) further verify the superiority of
our model1.
Index Terms—Video object segmentation, zero-shot, two-
stream, spatiotemporal representation, neural attention, dynamic
visual attention prediction.
The task of automatically identifying primary object(s) from
videos has gained significant attention over the past decade,
owing to its academic value and practical significance in many
areas, such as robotics [1], video compression [2], human-
object interaction [3] and autonomous driving [4]. However,
due to the lack of human intervention, in addition to typical
challenging factors posed by video data (e.g., occlusions,
This work was supported in part by the Beijing Natural Science Foundation
(No. L191004) and the National Natural Science Foundation of China (No.
61271374). (Corresponding author: Jianwu Li)
T. Zhou and J. Shen are with Inception Institute of Artifical Intelligence,
Abu Dhabi, UAE. (Email: {ztfei.debug, shenjianbingcg}
J. Li and S. Wang are with Beijing Laboratory of Intelligent Information-
Technology, School of Computer Science, Beijing Institute of Technology,
Beijing, China. (Email:
R. Tao is with School of Information and Electronics, Beijing Institute
of Technology, Beijing, China, and also with Beijing Key Laboratory of
Fractional Signals and Systems, Beijing, China.
1Our code is available at tfzhou/ MATNet
motion blur, object deformations, cluttered background), the
task suffers from great difficulties in accurately distinguishing
the most prominent objects throughout a video sequence.
Early non-learning methods are built upon handcrafted
features (e.g., motion boundary [5], saliency [6, 7, 8], point
trajectories [9, 10, 11, 12, 13]) and rely heavily on classic
heuristics in video segmentation (e.g., object proposal rank-
ing[14, 15], spatiotemporal coherency[5], long-term trajectory
clustering [9]). Although these methods can work in a purely
unsupervised way, they suffer the limited representability of
the handcrafted features. More recently, research has turned to-
wards the deep learning paradigm, with several studies[16, 17]
casting this problem as a zero-shot solution. These approaches
follow the zero-shot learning paradigm [18] to learn from
large-scale video data and can generalize well to test videos
that never appear in the training set, without any human
involvement. This is different from one-shot video object seg-
mentation (OVOS)[19], which requires first-frame annotations
for model adaption to test data in the inference phase.
Even before the era of deep learning, object motion has
always been considered as one of the most important cues for
automatic video object segmentation. This is largely inspired
by the human vision system (HVS), which has remarkable
motion perception capabilities to quickly orient human at-
tention to moving objects in dynamic scenes [20, 21]. In
fact, it has been demonstrated that infants [22] and newly
sighted congenitally blind people [23] tend to over-segment
static objects, even if they are strongly contrasted against
their surroundings; however, they can easily group things
together once the objects start moving, following the Gestalt
principle of common fate [24, 25]. These abilities enable us
to easily discover never-before-seen moving objects, before
knowing their particular semantic names. Motion surely does
not work alone. Recent studies [20] have revealed that, in
HVS, motion-based perception appears early, while static
perception is acquired later, and possibly bootstrapped by
motion cues to focus more on processing the most salient
objects. In this work, we take these biological mechanisms into
account and design our model to reflect such human behaviors,
i.e., first orienting rough attention to the moving parts of
objects, and then transferring the attention to object appearance
which provides a generic objectness prior for capturing the
whole picture of objects. In this way, our model is able to
learn a more effective spatiotemporal object representation,
encouraging more robust video object segmentation.
By considering knowledge propagation from object motion
to appearance, valuable temporal context can be exploited to
Boundary GT Boundary HEM
Segmentation GT
Encoder Decoder
473×473 Z2
Fig. 1: Pipeline of MATNet. The frame Iaand flow Imare first input into the interleaved encoder to extract multi-scale
spatiotemporal features {ˆ
i=2. At each residual stage i, we break the original information flow in ResNet. Instead, a MAT
module is proposed to create a new interleaved information flow by simultaneously considering motion Vm,i and appearance
Uiis further fed into the boundary-aware decoder via the bridge network to obtain boundary results Mb
segmentation results Ms.
alleviate possible ambiguities in object appearance (e.g., visual
similarity to the background), thus facilitating representation
learning. However, in the context of deep learning, current
segmentation models often overlook this potential. Most prior
works [26, 27, 28, 29] simply treat motion cues as being equal
to appearance and learn to predict segmentation masks from
motion or appearance independently. Some approaches [30,
31] utilize motion cues to enrich object representations, but
they rely on complex heuristics and only work at a single
scale, ignoring the critical hierarchical structure.
Motivated by these observations, we propose a Motion-
Attentive Transition Network (MATNet) for zero-shot video
object segmentation (ZVOS). Fig. 1 illustrates its pipeline,
which has an encoder-bridge-decoder structure. The core of
MATNet is a deeply interleaved two-stream encoder which
not only inherits the advantages of traditional two-stream
networks for multi-modal learning, but also progressively
transfers intermediate motion features to facilitate more robust
appearance learning. The transition is carried out by multiple
Motion-Attentive Transition (MAT) modules. Each MAT takes
as input the intermediate features of both the input image
and optical flow field at a convolutional stage, and produces
informative spatiotemporal features for the following stage.
For each MAT, an asymmetric attention mechanism is built
to first infer regions of interest based on optical flow, and
then transfer the inference to provide better selectivity for
appearance features. The deep interleaved structure captures
the intrinsic characteristics of the human vision system and
brings immediate improvement in segmentation accuracy.
Given the powerful spatiotemporal features from the en-
coder, we design a decoder network to infer pixel-accurate
object segmentation through a top-down refinement process.
The decoder progressively refines high-level semantic fea-
tures using spatially rich low-level features via a cascade
of Boundary-Aware Refinement (BAR) modules. Each BAR
is responsible for generating features with finer structures,
under the assistance of salient boundary detection. Beyond
traditional methods that connect the encoder and decoder via
skip connections, we introduce a lightweight attention module,
i.e., Scale-Sensitive Attention (SSA), to connect each pair
of encoder and decoder layers. SSA adaptively modulates
spatiotemporal convolution features before sending them to
the decoder. More specifically, SSA is a two-level attention
scheme in which the local level serves to highlight the most
informative features and suppress useless information, while
the global level helps to re-calibrate features for objects with
different scales.
MATNet can be easily instantiated with various backbones,
and optimized in an end-to-end manner. We perform exten-
sive experiments on four popular video object segmentation
datasets, i.e., DAVIS16 [32], DAVIS17 [33], FBMS [11] and
YouTube-Objects [34], in which the proposed method yields
consistent performance improvement over the state-of-the-arts.
Additionally, we showcase the advantages and generalizability
of our framework via the task of dynamic visual attention
prediction (DVAP). Our model is proven to generalize well to
the DVAP task and produce reliable dynamic-fixation predic-
tion results over two large-scale benchmarks, i.e., Hollywood-
2 [35] and UCF-Sports [35].
In summary, the main contributions of this paper are
three-fold: First, we propose a novel interleaved two-stream
network architecture to learn powerful spatiotemporal object
representations for ZVOS. This is achieved by an asymmetric
attention module, i.e., MAT, that accounts for object motion
and appearance interactions in a more comprehensive way.
Second, we introduce a boundary-aware decoder to obtain seg-
mentation with crisp object boundaries. The decoder learned
with a novel adapted cross-entropy loss produces accurate
boundaries in regions of primary objects. Third, based on
these designs, our MATNet consistently outperforms state-
of-the-art methods over several ZVOS benchmarks and also
shows superior performance in instance-level segmentation
and the DVAP task.
This paper builds upon our conference paper [36] and sig-
nificantly extends it in various aspects: First, to demonstrate
the effectiveness of our model, we extend it to an instance-
level segmentation setting, which is more challenging and
essential for practical cases in which multiple instances may
appear. Second, we examine our model for the DVAP task,
and it outperforms all specialized methods on two large-
scale benchmarks, demonstrating its generality. Third, we
also provide a more inclusive and insightful overview of the
recent work on video object segmentation, motion-aware video
analysis and dynamic visual attention prediction. Last but not
least, we report much more experimental results and conduct
more ablation studies (e.g., attribute-based analysis, impacts
of different optical flow methods) for thorough and in-depth
examinations of our model.
Our model is related to four lines of research, i.e., auto-
matic video object segmentation, motion-aware modeling in
video analysis, dynamic visual attention prediction and neural
attention. We will briefly discuss each of them.
A. Automatic Video Object Segmentation
A large number of methods have been proposed for auto-
matic (or unsupervised) video object segmentation, targeting at
segmenting conspicuous and eye-catching objects without any
human intervention. Many non-deep learning methods are
based on hand-crafted features and rely on certain heuristics
(e.g., saliency, object proposal ranking, trajectory clustering).
For instance, [6, 7, 8] take visual saliency as prior knowledge
to guide object segmentation, while [7, 14, 15, 37] infer the
object regions from hundreds of object candidates [38]. Object
motion is also widely used as a reliable cue for identifying
objects. [5] detects motion boundaries to determine foreground
regions. [9, 10, 11, 12, 13] take advantage of long-term point
trajectories for motion segmentation, making them more robust
to occlusions. Please refer to [39] for a more comprehensive
review of these approaches.
In recent years, with the renaissance of neural networks
in computer vision, deep learning based solutions are now
dominant in this field. Many approaches[16, 26, 28, 31, 40, 41,
42, 43, 44, 45, 46, 47] solve the task with zero-shot solutions,
which require no additional annotations during inference and
are thus more flexible for automatic video analysis. For exam-
ple, [43] proposes a dynamic visual attention-driven model for
video object segmentation, and [17, 41] mine higher-order re-
lations between video frames, resulting in more comprehensive
understanding of video content and more accurate foreground
estimation. However, these approaches only rely on object
appearance, and can thus easily fail in cases where objects are
visually similar to the background. To cope with this, many
approaches discover the motion patterns of objects [26] as
complementary cues to object appearance. This is typically
achieved within two-stream network architectures [28, 48],
in which an RGB image and the corresponding flow field
are separately processed by two independent networks and
the results are fused to produce the final segmentation. Some
methods [42, 49] design complex heuristics to fuse motion
and appearance for better segmentation. However, a major
drawback of these approaches is that they fail to consider
the importance of deep interactions between appearance and
motion in learning rich spatiotemporal features. To address
this issue, we propose a deep interleaved two-stream encoder,
in which a motion transition module is leveraged for more
effective representation learning.
B. Motion-Aware Modeling in Video Analysis
Deep learning models have been widely used in various
video-related tasks, such as action recognition [50, 51, 52],
video salient object detection [40, 53, 54] and dynamic visual
attention prediction [55, 56, 57, 58]. The most significant
difference between static images and videos is that objects
in videos are moving, which is a key factor that draws human
attention. Therefore, how to involve object motion into the
design of neural networks has been a critical issue in deep
learning-based video analysis.
Many approaches [40, 59] learn temporal coherence in-
formation by simply feeding consecutive frames into fully
convolutional networks. These methods are computationally
efficient; however, since they do not employ explicit motion
information (e.g., optical flow), they are sensitive to cluttered
and distracting backgrounds. Some other models consider
recurrent neural networks to capture long-range spatiotemporal
features [53, 60, 61]. However, all these models ignore the
complementary roles of spatial and temporal information.
This issue has been well addressed by the famous two-
stream ConvNet architecture proposed in [50], which consists
of spatial and temporal networks to better capture the comple-
mentary information of object appearance and motion. It has
achieved great success on human action recognition in videos.
Along this line, [51] injects residual connections into the
two-stream architecture to allow spatiotemporal interactions
between two modalities, while [44, 52] further improve such
a spatiotemporal residual network with multiplicative gating
functions. These two-stream architectures have also shown
strong performance in video object processing tasks, like video
object segmentation [28, 29, 54] and dynamic video attention
prediction [55, 58]. Despite this, current two-stream networks
tend to fuse motion and appearance features with a simple
gating mechanism and are limited in their use of local context.
In this work, we reconsider the interactions between object
motion and appearance with an asymmetric attention mod-
ule, which utilize motion-attentive features to promote the
appearance features, in a hierarchical manner. The powerful
representation ability of our model is verified in both ZVOS
and DVAP tasks.
C. Dynamic Visual Attention Prediction
Dynamic visual attention prediction, or dynamic fixation
prediction, is a close topic to ZVOS. Rather than targeting
at object-level saliency prediction, DVAP aims to identify
observers’ fixations during dynamic scene viewing. The task is
useful for machines to understand human attentional behaviors
and has shown great potential in many practical applications
(e.g., object segmentation [43], video captioning [62]). Early
DVAP methods [63, 64, 65] largely relied on hand-crafted,
biologically-inspired features (e.g., color, optical flow), the-
ories of visual attention in the cognitive area (e.g., guided
search [66], attention shift [67]). Recently, deep learning-
based methods become mainstream and generally yield better
performance. Representative works use two-stream networks
to account for multi-modal features [55, 58] or LSTMs for
sequence fixation prediction over consecutive frames [68].
Soft Attention
Soft Attention
1×1conv 1×1conv
softmax+fc softmax +fc
Fig. 2: Computational graph of MAT. and c
indicate matrix
multiplication and concatenation operations, respectively.
Although MATNet is originally designed for the object-
aware segmentation task, we show that it also achieves re-
markable performance on the DVAP task (V-C). This can
be largely attributed to the proposed encoder network, which
can provide informative spatiotemporal features to capture the
most important parts of the visual stimuli.
D. Neural Attention
Neural attention mechanisms, which are derived from hu-
man perception, have been widely studied in deep neural
networks and yield significant improvements for various tasks,
e.g., neural machine translation [69], object recognition [70,
71], and visual question answering [54, 72], to name a few
representative ones. Neural attention stimulates the human
selective attention mechanism, allowing the networks to focus
on the most informative parts of the inputs.
Neural attention mechanisms have also been used in recent
ZVOS approaches [17, 41], which aim to mine consistent
object patterns among video frames. Our idea is fundamentally
different with them. We propose an asymmetric attention
module (i.e., MAT) to mimic human attention behavior in
dynamic scenarios. It encourages more comprehensive in-
teractions between object motion and appearance, yielding
more powerful spatiotemporal features. Besides, we extend
MAT into a deeper version to conduct multi-step reasoning of
spatiotemporal attention, which can highlight more accurate
target regions, especially for complex scenarios. In addition,
MATs are incorporated into multiple convolutional layers,
leading to an entirely different network architecture, which
is expected to benefit various video analysis tasks.
A. Network Overview
We propose an end-to-end deep neural network, i.e., MAT-
Net, for ZVOS, which leverages motion cues to effectively
bootstrap the perception of object appearance. More specifi-
cally, our approach is designed as a unified of three tightly
coupled sub-networks: Interleaved Encoder Network,Bridge
Network and Boundary-Aware Decoder Network. The pipeline
is illustrated in Fig. 1.
(a) (b) (c) (d)
Fig. 3: Illustration of effects of the MAT module. (a) and (b)
are inputs of images and optical flow fields. (c) and (d) denote
feature maps in Vaand ˆ
Ua. As seen, the MAT module can
effectively emphasize important object regions and suppress
background responses, benefiting the segmentation.
1) Interleaved Encoder Network: The encoder resorts to a
two-stream structure to jointly capture the spatial and temporal
information, which has been proven effective in many related
video analysis tasks [50, 51, 52]. In contrast to previous works,
which treat the two streams equally, our encoder incorporates
a MAT module (III-B) into each network layer, which offers
a motion-to-appearance pathway for information exchange.
Such a design enables us to learn more powerful spatiotempo-
ral object representations. More technically, we take the first
five convolutional blocks of ResNet-101 [73] as the backbone
for each stream. Given an RGB frame IaRw×h×3and its
optical flow field ImRw×h×3, the encoder first extracts
intermediate appearance and motion features separately at the
i-th (i{2,3,4,5}) residual stage, denoted as Va,i RW×H×C
and Vm,i RW×H×C, where W,Hand Crepresent the spatial
width, height and channel number of the feature tensors,
respectively. The features are subsequently enhanced by a
MAT module FMAT as:
Um,i =FMAT(Va,i ,Vm,i),(1)
where ˆ
U·,i RW×H×Crepresents the enriched features. For
the i-th stage, the spatiotemporal object representation ˆ
obtained as ˆ
Um,i)RW×H×2Cwhich is
further fed into the down-stream decoder via a bridge network.
2) Bridge Network: The bridge network is responsible for
selecting informative spatiotemporal features for the decoder.
It is built upon several SSA modules (III-C), each of which
takes advantage of Uiat the i-th stage, attending it both
locally and globally to produce attentive feature Zi, with a
unified attention module. The local attention adopts channel-
wise and spatial-wise attention mechanisms to highlight the
correct object regions and suppress possible noise existing in
the redundant features, while the global attention aims to re-
calibrate the features to account for objects of different sizes.
3) Boundary-Aware Decoder Network: The decoder net-
work adopts a coarse-to-fine scheme to conduct segmentation
inference. It consists of four BAR modules (III-D), i.e.,
FBARi, i ∈ {2,3,4,5}, each corresponding to the i-th residual
block. From FBAR5to FBAR2, the resolution of feature maps
gradually increases by compensating for high-level coarse
features with more low-level details. The FBAR2produces the
finest feature maps, whose resolutions are 1/4of the input
image size. They are sequentially processed by three additional
layers, i.e., conv(3×3,1),upsampling and sigmoid, to obtain the
final mask output MsRw×h.
As follows, we will introduce the proposed modules (i.e.,
MAT, SSA, BAR) in detail. For simplicity, we omit the
subscript i.
B. Motion-Attentive Transition Module
Each MAT module is comprised of two soft attention units
and one attention transition unit, as depicted in Fig. 2. The
soft attention units help to emphasize the most informative
regions in the appearance or motion feature maps, while
the transition unit transfers the attentive motion features to
facilitate spatiotemporal feature learning.
1) Soft Attention: This unit softly weights the input feature
map Vm(or Va) at each spatial location. Taking Vmas an
example, this unit outputs a motion-attentive feature Um
RW×H×Cas follows:
Softmax attention: Am=softmax(Wm(Vm)),
Attention-enhanced feature: Uc
where Wmis a 1×1convolution that transforms Vmto
an importance map, which is normalized using a softmax
operation to generate an attention map AmRW×H, where
i=1 Ai
m= 1. Here, each value Ai
mis the probability
with which our model believes the corresponding location
is important. Uc
mand Vc
mindicate the 2D feature slices of
Umand Vmat the c-th channel, respectively. Jdenotes the
Hadamard product. Similarly, given Va, we can obtain the
appearance-attentive feature Uaby Eq. (2).
2) Attention Transition: To transfer motion-attentive fea-
tures Um, we first seek the affinity between Uaand Umin
a non-local manner, using the following multi-modal bilinear
mWUaR(W H)×(W H ),(3)
where WRC×Cis a trainable weight matrix. The affinity
matrix Scan effectively capture pairwise relationships between
the two feature spaces. However, it also introduces a huge
number of parameters, which increases the computational cost
and creates the risk of over-fitting. To overcome this problem,
Wis approximately factorized into two low-rank matrices P
dand QRC×C
d, where d(d > 1) is a reduction ratio.
Then, Eq. (3) can be rewritten as:
Ua= (P>Um)>(Q>
This operation is equal to applying channel-wise feature
transformations to Umand Uabefore computing the similarity.
Its advantages over Eq. (3) are three-fold: 1) It reduces the
number of parameters by 2/d times; 2) It requires much
fewer multiplication operations. For comparison, Eq. (3) needs
W H C2+W2H2Cmultiplications, while Eq. (4) only requires
(2W H C2+W2H2C)/d; 3) It helps to generate a compact
channel-wise feature representation for each modality.
Then, we normalize Srow-wise to derive an attention
map Srconditioned on motion features and achieve enhanced
appearance features ˆ
Motion conditioned attention: Sr=softmaxr(S),
Attention-enhanced feature: ˆ
(a) Image (b) Groundtruth (c) HED (d) Hard Example Mining
(e) Image (f) Groundtruth (g) Boundary w/o HEM (h) Boundary w/ HEM
Fig. 4: Illustration of hard example mining (HEM) for salient
object boundary detection. During training, for each training
image in (a), our method first estimates an edge map (c) using
off-the-shelf HED [75], and then determines hard pixels (d) to
facilitate training. For each test image in (e), we see that the
boundary results with HEM (h) are more accurate than those
without HEM (g).
where softmaxrindicates row-wise softmax.
3) Deep-MAT: For complex videos, using one MAT layer
to predict the attention is sub-optimal due to the noise intro-
duced by distractors which are irrelevant to the target regions.
Therefore, we extend MAT into Deep-MAT for multi-step rea-
soning of spatiotemporal attention. Deep-MAT progressively
refines attention via multiple MAT layers and can pinpoint
more accurate target regions. In particular, our deep MAT
consists of LMAT layers cascaded in depth (denoted by
MAT,· · · ,F(L)
MAT). Let ˆ
aand ˆ
mbe the input
features for F(l)
MAT. It then produces outputs ˆ
aand ˆ
which are further fed to F(l+1)
MAT in a recursive manner:
where ˆ
ais computed by Eq. (5) and ˆ
Eq. (2). In addition, we have ˆ
a=Vaand ˆ
It should be noted that stacking MAT layers directly leads to
an obvious drop in performance. Inspired by [74], we propose
to stack multiple MAT layers in a residual form as follows:
4) Discussion: In Fig. 3, we show the visual effects of the
MAT module. We can observe that with MAT, the feature maps
in Vaare well refined to produce more effective features in
Ua. The new features show tremendous properties with promi-
nent objects highlighted and distractors suppressed, which are
beneficial for accurate segmentation.
C. Scale-Sensitive Attention Module
The SSA module FSSA is extended from a simplified CBAM
FCBAM [71] by adding a global attention Fg. Given a feature
map URW×H×2C, our SSA module refines it as follows:
Z=FSSA(U) = Fg(FCBAM(U)) RW×H×2C.(8)
The CBAM module FCBAM consists of two sequential sub-
modules: channel and spatial attention, which can be formu-
lated as:
Channel attention: s=Fs(U),e=Fe(s),Zc=e?U,
Spatial attention: p=Fp(Zc),ZCBAM =pKZc,(9)
where Fsis a squeeze operator that gathers the global
spatial information of Uinto a vector sR2C, while Fe
is an excitation operator that captures channel-wise depen-
dencies and outputs an attention vector eR2C. Follow-
ing [70], Fsis implemented by applying avg pooling on
each feature channel, and Feis formed by four consec-
utive operations: fc(2C
16 )ReLU fc(2C)sigmoid.
ZcRW×H×2Cdenotes channel-wise attentive features, and
?indicates the channel-wise multiplication. In the spatial
attention, Fpexploits the inter-spatial relationship of Zc
and produces a spatial-wise attention map pRW×Hby
conv(7×7,1) sigmoid. Then, we achieve the attention
glimpse ZCBAM RW×H×2Cas the local-level feature.
The global attention Fgshares a similar spirit to the
channel attention layer in Eq. (9), in that it has the
same squeeze layer but modifies the excitation layer as
16 )fc(1) sigmoid to obtain a scale selection factor
gR1. It can then obtain scale-sensitive features Zgas
Z= (gZCBAM) + U.(10)
Note that we use identity mapping to avoid losing important
information on the regions with attention values close to 0.
D. Boundary-Aware Refinement Module
In the decoder network, each BAR FBARiaccepts two in-
puts, Zifrom the corresponding SSA module and Fifrom the
previous BAR. To obtain a sharp mask output, the BAR first
performs object boundary estimation using an extra boundary
detection module FBDRY, which compels the network to em-
phasize finer object details. The predicted boundary map is
then combined with the two inputs to produce finer features
for the next BAR module. This can be formulated as:
where FBDRY consists of a stack of convolutional layers and a
sigmoid layer (see Fig. 5), Mb
iRw×hindicates the boundary
map and Fi1is the output feature map of BARi. The full
computational graph of BARiis shown in Fig. 5.
BAR benefits from two key factors: the first is that we apply
Atrous Spatial Pyramid Pooling (ASPP) [76] on convolutional
features to transform them into a multi-scale representation.
This helps to enlarge the receptive field and obtain more
spatial details for decoding. Technically, ASPP consists of
multiple parallel dilated convolutional layers with different
sampling rates. In this paper, four dilated convolutional layers
are adopted, and the dilation rates are set as {2k}4
k=1. In this
way, each BAR module first extracts spatiotemporal features
on four scales, which are then concatenated together to em-
phasize multi-scale features. During decoding, these features
are futher concatenated with the boundary prediction Mb
i, and
Conv 3x3
Conv 1x1
Fig. 5: Computational graph of the BARimodule. Here,
‘Res’ is a residual block [73], while ‘UP’ denotes bilinear
upsampling. c
and indicate concatenation and element-
wise addition operations, respectively.
then progressively proccessed by a residual block (‘Res’ in
Fig. 5), an element-wise summation with Zi, and another
residual block to obtain more fine-grained features Fi1, as
shown in Fig. 5. Here the residual block is implemented by
two stacked 3×3convolutions with an identity shortcut [73].
The second benefit is that we introduce a heuristic method
for automatically mining hard negative pixels to support the
training of FBDRY. Specifically, for each training frame, we
use the popular off-the-shelf HED model [75] to predict a
boundary map E[0,1]w×h, wherein each value Ekrepresents
the probability of pixel kbeing an edge pixel. Then, pixel
kis regarded as a hard negative pixel if it has a high edge
probability (e.g., Ek>0.2) and falls outside the dilated ground-
truth region. If pixel kis a hard pixel, then its weights wk=
1+Ek; otherwise, wk= 1.
Then, wkis used to weight the following adaptive boundary
loss so that it can be penalized heavily if the hard pixels are
LBDRY(Mb,Gb) = Xkwk((1Gb
k) log(1Mb
where Mband Gbare the boundary prediction and ground-
truth, respectively.
Fig. 4 offers an illustration of the above hard example
mining (HEM) scheme. Clearly, by explicitly discovering
hard negative pixels, the network can produce more accurate
boundary predictions with background pixels well suppressed
(see Fig. 4 (g) and (h)).
E. Detailed Network Architecture
Our whole model is end-to-end trainable, because all the
components in MATNet are parameterized by neural networks.
At each stream of the encoder, we use the first five convolu-
tion blocks of ResNet-101 [73] as our backbone for feature
extraction. The spatiotemporal features in the last convolution
stage are fed into a global convolutional layer (GC in Fig. 1)
to enlarge the valid receptive field [77], which is implemented
by combining 1×77×1and 7×11×7convolutional
layers, following by a residual block.
Fig. 6: Qualitative results on four sequences. From top to bottom: dance-twirl from DAVIS16 ,dogs02 from FBMS, cat-0001
from YouTube-Objects and dogs-jump from DAVIS17.
1) Training Phase: Given an input frame IaR473×473×3,
we first compute its optical flow field ImR473×473×3using
PWC-Net [78] due to its high efficiency and accuracy. Then,
our MATNet predicts a segmentation mask Ms[0,1]473×473
and four boundary masks {Mb
i[0,1]473×473 }4
i=1 through
the decoder network. Let Gs∈ {0,1}473×473 be the binary
segmentation ground-truth, and Gb∈ {0,1}473×473 be the
boundary ground-truth which can be easily computed from
Gs. The overall loss function is formulated as:
LZVOS =LCE(Ms,Gs)+ 1
i=1 LBDRY(Mb
where LCE indicates the classic cross entropy loss, and LBDRY
is defined in Eq. (12).
2) Testing Phase: Once the network is trained, we apply it
to unseen videos. Given a test video, we resize all the frames
to 473×473, and feed each frame, along with its optical flow, to
the network for segmentation. We follow the common protocol
used in previous works [27, 30, 43] and employ CRF to obtain
the final binary segmentation results.
3) Runtime: Our model is implemented in PyTorch and
trained on a single Nvidia RTX 2080Ti GPU and an Intel(R)
Xeon Gold 5120 CPU. Testing is conducted on the same
machine. For each test frame of size 473×473, the forward
inference of our MATNet takes about 0.05s, while optical flow
estimation and CRF-based post-processing take about 0.2s and
0.5s, respectively.
In this section, we describe two extensions of our MATNet:
zero-shot video instance segmentation and dynamic visual
attention prediction. The former focuses on multi-object un-
supervised video segmentation [79], targeting at more fine-
grained results in multi-object scenarios. The latter aims at
predicting where people look over dynamic scenes.
A. Zero-Shot Video Instance Segmentation
To adapt our MATNet into an instance-level segmentation
setting, we modify our model into a saliency-driven instance
selection method. More specifically, for a test video V=
t=1 with Tframes, our approach takes three stages to
generate segmentation tracks for it. 1) Object proposal gener-
ation. For each frame It, we generate a collection of category-
agnostic segment proposals Pt={Pi
t}iusing COCO-trained
Mask R-CNN [83] for detecting generic objects. Our MATNet
is also applied to generate an object-level segmentation mask
t. Then, we compute a score Si
tfor each proposal:
MATNet =kPi
where Si
MRCNN denotes the detection score of Pi
tfrom Mask
R-CNN, while Si
MATNet measures its saliency score. The pro-
posals with small scores (Si
t<0.03) are discarded. 2) Short-
Term Tracklet Generation. Given the remaining proposals, we
further connect them temporally in a greedy manner. Firstly,
each proposal Pi
tis warped to the next frame using optical
flow, and we search for its matched proposal in Pt+1 by
evaluating the IoU scores. If the maximum IoU score is
above 0.1, the corresponding proposal is regarded as being
matched with Pi
t. 3) Tracklet Merging by Re-Identification
(ReID). We further merge short-term tracklets into a set of
consistent segmentation tracks using object re-identification.
The ReID embedding vector for each proposal is computed
using a pretrained ReID network [84]. For each tracklet, its
embedding is computed as the average embedding of all
proposals belonging to it. We use L2distance to measure
the similarities between two tracklets and adopt the merging
strategy in [85] to obtain final segmentation tracks.
B. Dynamic Visual Attention Prediction
Our MATNet is flexible to fit the DVAP task with modi-
fications in two aspects: 1) Network structure: Since bound-
TABLE I: Quantitative comparison of ZVOS methods on DAVIS16 val. The best result for each metric is boldfaced (This
note is also applied to other tables.). All the results are borrowed from the public leaderboard maintained by the DAVIS16
challenge ( soa compare.html). See V-A for details.
Measure SFL [29] FSEG[28] LVO [48] ARP [80] PDB [53] LSMO [44] MOT[81] EPO [49] AGS [43] COSNet [41] AGNN[17] AnDiff [82] MATNet
Mean67.4 70.7 75.9 76.2 77.2 78.2 77.2 80.6 79.7 80.5 80.7 81.7 82.4
JRecall81.4 83.5 89.1 91.1 90.1 89.1 87.8 95.2 91.1 93.1 94.0 90.9 94.5
Decay6.2 1.5 0.0 7.0 0.9 4.1 5.0 2.2 1.9 4.4 0.0 2.2 5.5
Mean66.7 65.3 72.1 70.6 74.5 75.9 77.4 75.5 77.4 79.5 79.1 80.5 80.7
FRecall77.1 73.8 83.4 83.5 84.4 84.7 84.4 87.9 85.8 89.5 90.5 85.1 90.2
Decay5.1 1.8 1.3 7.9 -0.2 3.5 3.3 2.4 1.6 5.0 0.0 0.6 4.5
TMean28.2 32.8 26.5 39.3 29.1 21.2 27.9 19.3 26.7 18.4 33.7 21.4 21.6
TABLE II: Quantitative results for each category on YouTube-
Objects over Mean J. See V-A for details.
Category [48] [29] [28] [53] [43] [41] [17]
Airplane (6) 86.2 65.6 81.7 78.0 87.7 81.1 81.1 72.9
Bird (6) 81.0 65.4 63.8 80.0 76.7 75.7 75.9 77.5
Boat (15) 68.5 59.9 72.3 58.9 72.2 71.3 70.7 66.9
Car (7) 69.3 64.0 74.9 76.5 78.6 77.6 78.1 79.0
Cat (16) 58.8 58.9 68.4 63.0 69.2 66.5 67.9 73.7
Cow (20) 68.5 51.2 68.0 64.1 64.6 69.8 69.7 67.4
Dog (27) 61.7 54.1 69.4 70.1 73.3 76.8 77.4 75.9
Horse (14) 53.9 64.8 60.4 67.6 64.4 67.4 67.3 63.2
Motorbike (10) 60.8 52.6 62.7 58.4 62.1 67.7 68.3 62.6
Train (5) 66.3 34.0 62.2 35.3 48.2 46.8 47.8 51.0
Mean J ↑ 67.5 57.1 68.4 65.5 69.7 70.5 70.8 69.0
TABLE III: Quantitative results on FBMS over Mean J(V-A).
Measure MSTP [42] FSEG [28] IET [45] OBN [31] PDB [53] COSNet [41] MATNet
Mean J ↑ 60.8 68.4 71.9 73.9 74.0 75.6 76.1
ary ground-truths are not available in this task, we discard
the object boundary constraints so that Eq. (11) becomes:
Fi1=FBARi(Zi,Fi). In this way, for BARi, more fine-
grained features Fi1are produced by relying on only the
features Fifrom BARi1as well as the corresponding convo-
lutional feature Zi. Besides, we also remove the unnecessary
concatenation operator in Fig. 5. All other modules are kept
unchanged. 2) Loss function: We consider the Kullback-Leibler
(KL) divergence loss LKL as our main learning objective. It is
more task-oriented and has been proven effective in [86]. The
overall loss function is:
LDVAP =LKL (Mv,Gv) + λLCE(Mv,Gv),(15)
where Mvand Gvare the attention prediction and ground-
truth, respectively. LKL (Mv,Gv) = PiGv
ilog( Gv
i).λ= 0.1
is a weight to balance the contributions of the two losses.
In this section, we first compare MATNet with state-of-
the-art models on our main task of interest, i.e., ZVOS,
on both object-level (V-A) and instance-level (V-B) settings.
Then, we investigate the performance of our model on the
DVAP task (V-C). For each task, we separately introduce the
corresponding standalone datasets and experimental results.
Finally, to gain a deeper insight into our model, we conduct
detailed ablation studies in V-D.
Fig. 7: Attribute-based comparison on DAVIS16 val. We
compare MATNet with three top-performing methods, i.e., An-
Diff [82], COSNet [41] and AGS [43]. For each method, Mean
Jis computed over all sequences with specified attributes.
Fig. 8: Attribute-based ablation study on DAVIS16 val. We
compare the Mean Jof different network variants under
various attributes.
A. Main Task: Zero-Shot Video Object Segmentation
1) Datasets: We carry out comprehensive experiments on
three popular datasets:
DAVIS16 [32] is one of the most popular video object seg-
mentation datasets, which consists of 50 high-quality videos
in total (30 for train and 20 for val). Each frame contains
pixel-wise annotations for foreground objects. For quantitative
evaluation, we use three standard metrics suggested by [32],
namely region similarity J, boundary accuracy F, and time
stability T.
YouTube-Objects [34] is a large dataset of 126 web videos
with 10 semantic object categories and more than 20,000
frames. Following its protocol, we use the region similarity
Jmetric to measure the performance on the whole dataset
without further training.
FBMS [11] consists of 59 video sequences with ground-truth
annotations provided in a subset of the frames. Following the
standard protocol [48], we do not use any sequence for training
and only evaluate on the val set consisting of 30 sequences.
2) Implementation Details: The training data consist of two
parts: i) all training data from DAVIS16 [32], including 30
videos with about 2K frames; ii) a subset of 12K frames
selected from the training set of YouTube-VOS [87], which is
obtained by sampling images every ten frames in each video.
In total, we use 14K training samples, basically matching
the current top-performing methods, i.e., AGNN [17], COS-
TABLE IV: Quantitative comparison of ZVOS methods on
DAVIS17 val. All the results are borrowed from the public
leaderboard of the DAVIS17 challenge (https://davischallenge.
org/davis2017/ soa compare.html). See V-B for details.
Measure RVOS [16] PDB [53] AGS [43] MATNet
J&FMean41.2 55.1 57.5 58.6
Mean36.8 53.2 55.5 56.7
JRecall40.2 58.9 61.6 65.2
Decay0.5 4.9 7.0 -3.6
Mean45.7 57.0 59.5 60.4
FRecall46.4 60.2 62.8 68.2
Decay1.7 6.8 9.0 1.8
Net [41] and AGS [43]. The entire network is trained using the
SGD optimizer with a learning rate of 1e-4 for the encoder and
the bridge network, and 1e-3 for the decoder. During training,
the batch size, momentum and weight decay are set to 2,0.9,
and 1e-5, respectively. The data are augmented online with
horizontal flipping and rotations covering a range of degrees
3) Performance on DAVIS16 val:We compare our MAT-
Net with the top performing ZVOS methods in the public
leaderboard of DAVIS16. The detailed results are reported
in Table I. We can observe that our MATNet achieves the
best performance compared to other methods. Specifically,
it outperforms the second-best method (i.e., AnDiff [82]) by
+0.7% and +0.2% in terms of Mean Jand Mean F, and
+3.6% and +5.1% in terms of Recall Jand Recall F.
In Table I, some of the deep learning-based models, e.g.,
FSEG [28], LVO [48], MOT [81], use motion cues to improve
segmentation. Our MATNet outperforms all of these methods
by a large margin. The reason lies in that these methods
learn motion and appearance features independently, without
considering the close interactions between them. In contrast,
our MATNet can learn more effective multi-modal object
representations with the interleaved encoder.
Fig. 7 shows the results of attribute-based study on
DAVIS16 [32] using 15 video attributes provided by the
dataset. Three top-performing ZVOS methods, i.e., An-
Diff [82], COSNet [41] and AGS [43], are selected for
comparison. Our model significantly outperforms them in
terms of many attributes (e.g., low resolution,fast motion,
dynamic background,motion blur,heterogeneous object, and
appearance change). This demonstrates the robustness of our
model against various challenges present in videos.
4) Performance on YouTube-Objects: Table II reports the
detailed results on YouTube-Objects. Our model shows
promising performance in most categories. It lags behind some
methods in the Airplane and Boat categories. This is mainly
because sequences in these categories contain slowly-moving
objects, which are often visually similar to their surroundings.
These factors may result in inaccurate estimation of optical
flow, thereby hurting the performance.
5) Performance on FBMS: For completeness, we also eval-
uate our method on FBMS. As shown in Table III, MATNet
produces the best results with 76.1% in Mean J, which
outperforms the second-best result, i.e., PDB, by 2.1%.
6) Qualitative results: Fig. 6 depicts sample results for
representative sequences from these three datasets. The dance-
TABLE V: Quantitative DVAP results on the val sets of
Hollywood-2 and UCF-Sports. See V-C for details.
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.913 0.542 0.757 0.623 3.086
SalEMA [88] 0.919 0.487 0.708 0.613 3.186
Hollywood-2 TASED [89] 0.918 0.507 0.768 0.646 3.302
STRA [58] 0.923 0.536 0.774 0.662 3.478
Ours 0.915 0.539 0.797 0.674 3.486
Dataset Methods AUC-JSIMs-AUCCCNSS
ACLNet [68] 0.897 0.406 0.744 0.510 2.567
SalEMA [88] 0.906 0.431 0.740 0.544 2.638
UCF-Sports TASED [89] 0.899 0.469 0.752 0.582 2.920
STRA [58] 0.910 0.479 0.751 0.593 3.018
Ours 0.901 0.503 0.783 0.625 3.291
twirl sequence from DAVIS16 contains many challenging fac-
tors, such as object deformation, motion blur and background
clutter. As seen, our method is robust to these challenges and
delineates the target with accurate contours. The effectiveness
is further proved in cat-0001 from YouTube-Objects, in which
the cat is visually similar to its surroundings and undergoes
large deformation. In addition, our model also works well in
dogs02, in which the target suffers large scale variations.
B. Additional Task: Zero-Shot Video Instance Segmentation
1) Datasets: DAVIS17 [33] extends DAVIS16 with another
70 sequences, leading to 120 videos in total. These videos are
split into 60 for train, 30 for val and 30 for test-dev.
Different from DAVIS16 , this dataset provides instance-level
annotations. Therefore, we use it to evaluate the performance
of our model in instance-level video object segmentation.
Following the standard evaluation setting, we measure the
performance in terms of region similarity J, contour accuracy
F, and their combination J&F.
2) Quantitative and Qualitative Results: Table IV reports
the performance of MATNet against three top-performing
models (i.e., RVOS [16], PDB [53] and AGS [43]). The results
clearly demonstrate that our model outperforms all of them by
a large margin. For instance, in terms of J&F, mean Jand
mean F, our model surpasses the second-best method (i.e.,
AGS), by 1.1%,1.2% and 0.9%, respectively.
Besides, some qualitative results on DAVIS17 are shown in
Fig. 6 (the last row), validating that our model yields high-
quality ZVOS results in the instance-level setting.
C. Additional Task: Dynamic Visual Attention Prediction
1) Datasets: Hollywood-2 [35] consists of 1,707 video
sequences (823 for train and 884 for test) collected from
69 Hollywood movies, covering 12 action categories (e.g.,
eating, kissing and running). The dataset focuses on more task-
driven scenes, e.g., movie scenes and human actions.
UCF-Sports [35] includes 150 videos covering 9 common
sports action categories, such as walking, diving and golfing.
Similar to Hollywood-2, the annotations in this dataset mainly
focus on action behaviors. The dataset is split into 103 videos
for train and 47 for test.
TABLE VI: Ablation study of MATNet on DAVIS16 val,
measured by the Mean Jand Mean F. See V-D for details.
Network Variant Mean J ↑ JMean F ↑ F
MATNet w/o MAT 79.5 -2.9 77.3 -3.4
MATNet w/o SSA 80.7 -1.7 79.7 -1.0
MATNet w/o HEM 81.4 -1.0 78.4 -2.3
MATNet w/ Res50 81.1 -1.3 79.3 -1.4
MATNet w/ Res101 82.4 -80.7 -
(a) Image (b) Groundtruth (c) w/o MAT (d) w/o SSA (e) w/o HEM (f) MATNet
Fig. 9: Qualitative results of ablation study.
2) Implementation Details: For each dataset, we use the
train set to train our model. The network is trained with
the same setting as in V-A, except that the training images
are resized to 360 ×360 for fair comparison with previous
works [58, 68, 89]. λin Eq. 15 is empirically set to 0.1.
3) Metrics: Following previous work [68], we report the
performance of our model using five metrics, namely Nor-
malized Scanpath Saliency (NSS), Similarity (SIM), Linear
Correlation Coefficient (CC), Area Under the Curve by Judd
(AUC-J) and shuffled AUC (s-AUC). NSS and CC measure the
correlation between the prediction and ground-truth saliency
map. SIM computes the similarity between two histograms,
while AUC-J and s-AUC are variants of the well-known
AUC metric. For each metric, higher scores indicate better
4) Quantitative Results: We compare our model with four
DVAP models, i.e., ACLNet [68], SalEMA [88], STRA [58]
and TASED [89]. The results of these methods are directly
obtained from the authors. As shown in Table V, MATNet
generally outperforms all the competitors across most of the
metrics, in both the Hollywood-2 and UCF-Sports datasets.
This verifies the strong generality of our model.
D. Ablation Study
Table VI summarizes the ablation analysis of MATNet on
DAVIS16 val.
1) Efficacy of MAT: We first study the effects of the MAT
module by comparing our full model to one following the same
architecture without MAT, denoted as MATNet w/o MAT.
The encoder in this network is thus equivalent to a standard
two-stream model, where convolution features from the two
streams are concatenated at each residual stage for object
representation. As shown in Table VI, this model encounters a
huge performance degradation (-2.9% in Mean Jand -3.4%
in Mean F), which verifies the effectiveness of MAT.
Moreover, we also evaluate the performance of MATNet
with a different number of MAT modules in each deep residual
MAT layer. The results in TableVII show that the performance
gradually improves as Lincreases, reaching saturation at L=
5. Based on this analysis, we choose L= 5 as the default
number of MAT modules in MATNet.
TABLE VII: Performance comparisons with different numbers
of MAT blocks cascaded in each MAT layer on DAVIS16 val.
See V-D for details.
Metric L= 0 L= 1 L= 3 L= 5 L= 7
Mean J ↑ 79.5 80.6 81.6 82.4 82.2
Mean F ↑ 77.3 80.3 80.7 80.7 80.6
TABLE VIII: Impacts of different optical flow methods on
DAVIS16 val. See V-D for details.
Flow Method Mean J ↑ Mean F ↑ Mean T ↓
LiteFlowNet [90] 80.9 79.3 23.2
SpyNet [91] 78.4 76.8 26.6
PWCNet [78] 82.4 80.7 21.6
2) Efficacy of SSA: To measure the effectiveness of the
SSA module, we design another network, MATNet w/o SSA,
by replacing the SSA block with a simple skip layer. As
can be observed, its performance is -1.7% lower than our
full model in terms of Mean J, and -1.0% lower in Mean
F. The performance drop is mainly caused by the redundant
spatiotemporal features from the encoder. Our SSA module
aims to eliminate the redundancy by only highlighting the
features that are beneficial to segmentation.
3) Efficacy of HEM: We also study the influence of using
HEM during training. HEM is expected to facilitate the learn-
ing of more accurate object boundaries, which should further
boost the segmentation procedure. The results in Table VI
(see MATNet w/o HEM) indicate the importance of HEM.
By directly controlling the loss function in Eq. (12), HEM
helps to improve the contour accuracy by 2.3%.
4) Impact of Backbone: To verify that the high performance
of our network is not mainly due to the powerful backbone, we
replace ResNet-101 with ResNet-50 to build another network,
i.e., MATNet w/ Res50. We see that the performance degrades
slightly, but the model still outperforms previous methods (e.g.,
AGNN [17], COSNet [41], AGS [43]). This further confirms
the effectiveness of the proposed modules.
5) Impact of Optical Flow: Table VIII reports the results
of MATNet on DAVIS16 val with three open-sourced op-
tical flow computation methods, i.e., PWC-Net [78], Lite-
FlowNet [90] and SpyNet [91]. They rank #23, #34 and
#143 in the public MPI Sintel Flow Benchmark (http://sintel., respectively. Generally, better optical
flow models lead to more accurate segmentation results, but
the performance does not change much, demonstrating the
robustness of our model against optical flow inputs.
6) Attribute Analysis: Fig. 8 illustrates the performance
comparison of different variants in the ablation study under
various video attributes. The performance is consistent with
that reported in Table VI. All three modules (i.e., MAT, SSA
and HEM) are critical for our model to improve performance.
7) Qualitative Comparison: Fig. 9 shows visual results of
the above ablation studies on two sequences. We see that all
of the network variants produce worse results compared with
our full model. It is worth noting that the MAT block has the
greatest visually influence on the performance.
In this paper, we proposed a novel MATNet for ZVOS.
We introduced a new way to learn rich spatiotemporal object
representations with an interleaved encoder, which encour-
ages knowledge propagation from motion to appearance in a
hierarchical manner. The spatiotemporal features are further
processed by a bridge network to produce more compact
representations, which are subsequently fed into a boundary-
aware decoder to obtain accurate segmentation in a top-down
fashion. We compared the proposed model with other state-of-
the-art ZVOS methods over four large-scale benchmarks and
the experimental results demonstrated that it achieves favor-
able performance against other contenders. Benefiting from
the powerful interleaved encoder for representation learning
in videos, our model also showed compelling performance
in the DVAP task. In the future, we will further extend it
to other video analysis tasks, such as action recognition and
video classification.
[1] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep
interactive object selection,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016, pp.
[2] H. Hadizadeh and I. V. Baji ´
c, “Saliency-aware video compres-
sion,” IEEE Transactions on Image Processing, vol. 23, no. 1,
pp. 19–33, 2013.
[3] T. Zhou, W. Wang, S. Qi, H. Ling, and J. Shen, “Cascaded
human-object interaction recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
2020, pp. 4263–4272.
[4] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmenta-
tion for autonomous driving with deep densely connected mrfs,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 669–677.
[5] A. Papazoglou and V. Ferrari, “Fast object segmentation in
unconstrained video,” in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 1777–1784.
[6] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic
video object segmentation,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2015, pp.
[7] A. Faktor and M. Irani, “Video segmentation by non-local
consensus voting,” in Proceedings of the British Machine Vision
Conference, 2014, pp. 8–20.
[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware
video object segmentation,IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 40, no. 1, pp. 20–33, 2017.
[9] T. Brox and J. Malik, “Object segmentation by long term
analysis of point trajectories,” in European Conference on
Computer Vision, 2010, pp. 282–295.
[10] P. Ochs and T. Brox, “Object segmentation in video: a hierarchi-
cal variational approach for turning point trajectories into dense
regions,” in Proceedings of the IEEE International Conference
on Computer Vision, 2011, pp. 1583–1590.
[11] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects
by long term video analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1187–
1200, 2013.
[12] M. Keuper, B. Andres, and T. Brox, “Motion trajectory seg-
mentation via minimum cost multicuts,” in Proceedings of the
IEEE International Conference on Computer Vision, 2015, pp.
[13] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by
tracing discontinuities in a trajectory embedding,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 1846–1853.
[14] D. Zhang, O. Javed, and M. Shah, “Video object segmentation
through spatially accurate and temporally dense extraction of
primary object regions,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2013, pp. 628–
[15] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video
object segmentation,” in Proceedings of the IEEE International
Conference on Computer Vision, 2011, pp. 1995–2002.
[16] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and
X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 5277–
[17] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao,
“Zero-shot video object segmentation via attentive graph neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 9236–9245.
[18] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell,
“Zero-shot learning with semantic output codes,” in Advances in
Neural Information Processing Systems, 2009, pp. 1410–1418.
[19] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´
e, D. Cre-
mers, and L. Van Gool, “One-shot video object segmentation,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 221–230.
[20] L. L. Cloutman, “Interaction between dorsal and ventral pro-
cessing streams: Where, when and how?” Brain and Language,
vol. 127, no. 2, pp. 251 – 263, 2013.
[21] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level
visual features have a causal influence on gaze during dynamic
scene viewing?” Journal of Vision, vol. 13, no. 9, pp. 144–144,
[22] E. S. Spelke, “Principles of object perception,” Cognitive sci-
ence, vol. 14, no. 1, pp. 29–56, 1990.
[23] Y. Ostrovsky, E. Meyers, S. Ganesh, U. Mathur, and P. Sinha,
“Visual parsing after recovery from blindness,” Psychological
Science, vol. 20, no. 12, pp. 1484–1491, 2009.
[24] S. E. Palmer, Vision science: Photons to phenomenology. MIT
press, 1999.
[25] M. Wertheimer, “Laws of organization in perceptual forms.”
[26] P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion
patterns in videos,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 3386–
[27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and
A. Sorkine-Hornung, “Learning video object segmentation from
static images,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 2663–
[28] S. D. Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to
combine motion and appearance for fully automatic segmenta-
tion of generic objects in videos,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2117–2126.
[29] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint
learning for video object segmentation and optical flow,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 686–695.
[30] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “Monet:
Deep motion exploitation for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 1140–1148.
[31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo,
“Unsupervised video object segmentation with motion-based bi-
lateral networks,” in European Conference on Computer Vision,
2018, pp. 207–223.
[32] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
M. Gross, and A. Sorkine-Hornung, “A benchmark dataset
and evaluation methodology for video object segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 724–732.
[33] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´
aez, A. Sorkine-
Hornung, and L. Van Gool, “The 2017 davis challenge on video
object segmentation,arXiv preprint arXiv:1704.00675, 2017.
[34] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,
“Learning object class detectors from weakly annotated video,”
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 3282–3289.
[35] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic
gaze datasets and learnt saliency models for visual recognition,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 37, no. 7, pp. 1408–1424, 2014.
[36] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-
attentive transition for zero-shot video object segmentation,
in Proceedings of AAAI Conference on Artificial Intelligence,
[37] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung,
“Fully connected object proposals for video segmentation,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 3227–3234.
[38] I. Endres and D. Hoiem, “Category independent object propos-
als,” in European Conference on Computer Vision, 2010, pp.
[39] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, “Video object
segmentation and tracking: A survey,” ACM Transactions on
Intelligent Systems and Technology, vol. 11, no. 4, pp. 1–47,
[40] W. Wang, J. Shen, and L. Shao, “Video salient object detection
via fully convolutional networks,IEEE Transactions on Image
Processing, vol. 27, no. 1, p. 3849, 2018.
[41] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See
more, know more: Unsupervised video object segmentation with
co-attention siamese networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019,
pp. 3623–3632.
[42] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Unsupervised
video object segmentation using motion saliency-guided spatio-
temporal propagation,” in European Conference on Computer
Vision, 2018, pp. 786–802.
[43] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. H. Hoi,
and H. Ling, “Learning unsupervised video object segmentation
through visual attention,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2019, pp.
[44] P. Tokmakov, C. Schmid, and K. Alahari, “Learning to segment
moving objects,International Journal of Computer Vision, vol.
127, no. 3, pp. 282–301, 2019.
[45] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C.
Jay Kuo, “Instance embedding transfer to unsupervised video
object segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 6526–
[46] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli,
“Unsupervised online video object segmentation with motion
property understanding,” IEEE Transactions on Image Process-
ing, vol. 29, pp. 237–249, 2019.
[47] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and
L. Van Gool, “Video object segmentation with episodic graph
memory networks,” in European Conference on Computer Vi-
sion, 2020.
[48] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video
object segmentation with visual memory,” in Proceedings of
the IEEE International Conference on Computer Vision, 2017,
pp. 4481–4490.
[49] M. Faisal, I. Akhter, M. Ali, and R. Hartley, “Exploiting
geometric constraints on dense trajectories for motion saliency,
Winter Conference on Applications of Computer Vision, 2019.
[50] K. Simonyan and A. Zisserman, “Two-stream convolutional
networks for action recognition in videos,” in Advances in
Neural Information Processing Systems, 2014, pp. 568–576.
[51] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal
residual networks for video action recognition,” in Advances in
Neural Information Processing Systems, 2016, pp. 3468–3476.
[52] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal
multiplier networks for video action recognition,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 4768–4777.
[53] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid
dilated deeper convlstm for video salient object detection,” in
European Conference on Computer Vision, 2018, pp. 715–731.
[54] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention
for video salient object detection,” in Proceedings of the IEEE
International Conference on Computer Vision, 2019, pp. 7274–
[55] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal
saliency networks for dynamic saliency prediction,IEEE
Transactions on Multimedia, vol. 20, no. 7, pp. 1688–1698,
[56] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep
learning based video saliency prediction approach,” in European
Conference on Computer Vision, 01 2018, pp. 625–642.
[57] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,
“Revisiting video saliency prediction in the deep learning era,
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PP, pp. 1–1, 06 2019.
[58] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency
prediction using spatiotemporal residual attentive networks,
IEEE Transactions on Image Processing, vol. 29, pp. 1113–
1126, 2019.
[59] K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang, “Spatiotem-
poral cnn for video object segmentation,” arXiv preprint
arXiv:1904.02363, 2019.
[60] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-
vised learning of video representations using lstms,” in Pro-
ceedings of the International Conference on Machine Learning,
2015, pp. 843–852.
[61] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin, “Flow guided
recurrent neural encoder for video salient object detection,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 3243–3252.
[62] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,
“Supervising neural attention models for video captioning by
human gaze data,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 490–498.
[63] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau, “A
coherent computational approach to model bottom-up visual
attention,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 5, pp. 802–817, 2006.
[64] N. Bruce and J. Tsotsos, “Saliency based on information
maximization,” in Advances in Neural Information Processing
Systems, 2006, pp. 155–162.
[65] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,
in Advances in Neural Information Processing Systems, 2007,
pp. 545–552.
[66] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an
alternative to the feature integration model for visual search.
Journal of Experimental Psychology: Human perception and
performance, vol. 15, no. 3, p. 419, 1989.
[67] C. Koch and S. Ullman, “Shifts in selective visual attention:
towards the underlying neural circuitry,” in Matters of intelli-
gence. Springer, 1987, pp. 115–141.
[68] W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji,
“Revisiting video saliency: A large-scale benchmark and a new
model,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4894–4903.
[69] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
lation by jointly learning to align and translate,” Proceedings
of the International Conference on Learning Representations,
[70] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 7132–7141.
[71] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Con-
volutional block attention module,” in European Conference on
Computer Vision, 2018, pp. 3–19.
[72] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked
attention networks for image question answering,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 21–29.
[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[74] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,
and X. Tang, “Residual attention network for image classifi-
cation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 3156–3164.
[75] S. Xie and Z. Tu, “Holistically-nested edge detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 1395–1403.
[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille, “Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected
crfs,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[77] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel
matters–improve semantic segmentation by global convolutional
network,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 4353–4361.
[78] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: Cnns
for optical flow using pyramid, warping, and cost volume,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8934–8943.
[79] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K.
Maninis, and L. Van Gool, “The 2019 davis challenge on
vos: Unsupervised multi-object segmentation,arXiv preprint
arXiv:1905.00737, 2019.
[80] Y. J. Koh and C.-S. Kim, “Primary object segmentation in
videos based on region augmentation and reduction,” in Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 7417–7425.
[81] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny,
and M. Jagersand, “Video object segmentation using teacher-
student adaptation in a human robot interaction (hri) setting,”
in International Conference on Robotics and Automation, 2019,
pp. 50–56.
[82] Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr,
“Anchor diffusion for unsupervised video object segmentation,
in Proceedings of the IEEE International Conference on Com-
puter Vision, 2019, pp. 931–940.
[83] K. He, G. Gkioxari, P. Doll´
ar, and R. Girshick, “Mask R-
CNN,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017, pp. 2961–2969.
[84] J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-
generation, refinement and merging for video object segmenta-
tion,” in ACCV, 2018, pp. 565–580.
[85] J. Luiten, I. E. Zulfikar, and B. Leibe, “Unovost: Unsupervised
offline video object segmentation and tracking,” in Winter Con-
ference on Applications of Computer Vision, 2020, pp. 2000–
[86] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing
the semantic gap in saliency prediction by adapting deep neural
networks,” in Proceedings of the IEEE International Conference
on Computer Vision, 2015, pp. 262–270.
[87] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
S. Cohen, and T. Huang, “Youtube-vos: Sequence-to-sequence
video object segmentation,” in European Conference on Com-
puter Vision, 2018, pp. 585–601.
[88] P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-
i Nieto, and K. McGuinness, “Simple vs complex temporal
recurrences for video saliency prediction,” in Proceedings of
the British Machine Vision Conference, 2019.
[89] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spa-
tial encoder-decoder network for video saliency detection,” in
Proceedings of the IEEE International Conference on Computer
Vision, 2019, pp. 2394–2403.
[90] T.-W. Hui, X. Tang, and C. C. Loy, “Liteflownet: A lightweight
convolutional neural network for optical flow estimation,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8981–8989.
[91] A. Ranjan and M. J. Black, “Optical flow estimation using a
spatial pyramid network,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2017, pp.
... Finally, the MCM is proposed to implicitly capture global dependence from adjacent frames. Unlike the previous optical flow-based UVOS methods [31], [33], [34], our approach can adaptively offset the current moment from neighboring frames at the feature level without explicit motion estimation. This is another advantage of our IMCNet. ...
... Inspired by the deformable convolution [46], [47], our MCM uses features from the adjacent frames to dynamically predict offsets of sampling convolution kernels. These dynamic kernels are gradually applied to neighboring features, and offsets are propagated to the current frame to facilitate precise motion compensation, similar to the motion transition based on optical flow in multi-modality models [31], [33], [34], [48], [49]. Moreover, an additional temporal-spatial fusion is employed after the cascading deformable operation to further improve the robustness of compensation. ...
... 2) Evaluation on YouTube-Objects: Quantitative result. Table V reports the results of several compared methods [24]- [26], [31], [36]- [38], [40]- [42] for different categories on the YouTube-Objects dataset. Our method achieves promising performance in most categories and the second-best overall results on mean J . ...
Full-text available
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence. Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based). To overcome the limitations, we propose an implicit motion-compensated network (IMCNet) combining complementary cues ($\textit{i.e.}$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level without estimating optical flows. The proposed IMCNet consists of an affinity computing module (ACM), an attention propagation module (APM), and a motion compensation module (MCM). The light-weight ACM extracts commonality between neighboring input frames based on appearance features. The APM then transmits global correlation in a top-down manner. Through coarse-to-fine iterative inspiring, the APM will refine object regions from multiple resolutions so as to efficiently avoid losing details. Finally, the MCM aligns motion information from temporally adjacent frames to the current frame which achieves implicit motion compensation at the feature level. We perform extensive experiments on $\textit{DAVIS}_{\textit{16}}$ and $\textit{YouTube-Objects}$. Our network achieves favorable performance while running at a faster speed compared to the state-of-the-art methods.
... Fan et al. [22] developed an attention-shift baseline and released a large-scale saliency-shift-aware dataset for the VSOD problem. Zhou et al. [54] have presented a novel end-to-end zero-shot video segmentation network. In this work, the proposed network follows the traditional bi-stream structure, yet, different from previous works, it newly devised a novel module to interact temporally with spatial. ...
Full-text available
The existing state-of-the-art (SOTA) video salient object detection (VSOD) models have widely followed short-term methodology, which dynamically determines the balance between spatial and temporal saliency fusion by solely considering the current consecutive limited frames. However, the short-term methodology has one critical limitation, which conflicts with the real mechanism of our visual system -- a typical long-term methodology. As a result, failure cases keep showing up in the results of the current SOTA models, and the short-term methodology becomes the major technical bottleneck. To solve this problem, this paper proposes a novel VSOD approach, which performs VSOD in a complete long-term way. Our approach converts the sequential VSOD, a sequential task, to a data mining problem, i.e., decomposing the input video sequence to object proposals in advance and then mining salient object proposals as much as possible in an easy-to-hard way. Since all object proposals are simultaneously available, the proposed approach is a complete long-term approach, which can alleviate some difficulties rooted in conventional short-term approaches. In addition, we devised an online updating scheme that can grasp the most representative and trustworthy pattern profile of the salient objects, outputting framewise saliency maps with rich details and smoothing both spatially and temporally. The proposed approach outperforms almost all SOTA models on five widely used benchmark datasets.
... However, it was very difficult for the network to distinguish the density level of an image because it is hard to be accurately defined. Inspired by the success of visual attention mechanisms for computer vision tasks [23][24][25][26][27][28][29], some works [30][31][32][33][34][35][36][37][38][39][40][41] used visual attention mechanisms to extract scale-aware feature for crowd counting. However, the attention mechanisms increased the spatial complexity and brought difficulties in training. ...
Full-text available
Accurate crowd counting is still challenging due to the variations of crowd heads. Most of crowd counting methods adopt multi-branch networks to extract multi-scale information. However, these networks are too complex to be optimized. To solve these problems, we propose an efficient scale-aware crowd counting network named SC2Net, which adopts the encoder-decoder framework. The encoder uses the first ten layers of VGG16 to extract the primary feature information. The decoder is mainly consisted of our proposed residual pyramid dilated convolution (ResPyDConv) modules to regress predicted density maps. Specifically, the ResPyDConv module is composed of pyramid dilated convolution (PyDConv). Each PyDConv adopts dilated convolutions with different dilated rates. PyDConv divides feature maps into different groups and extracts multi-scale feature information. Extensive experiments are conducted on ShanghaiTech, UCF_CC_50, UCF_QNRF, and NWPU_Crowd datasets. Qualitative and quantitive results show the superiority of our proposed network to the other state-of-the-art methods.
... It is widely applied in computer vision tasks and not limited to vehicle Re-ID. For example, Zhou et al. [24] proposed a Motion-attentive Transition (MATNet) attention framework motivated by human visual attention behavior for semantic segmentation tasks. To solve the problem of insufficient ground-truth datasets, a novel group-wise learning framework for weakly supervised semantic segmentation was proposed [25]. ...
Full-text available
Vehicle Re-identification (Re-ID) has become a research hotspot along with the rapid development of video surveillance. Attention mechanisms are utilized in vehicle Re-ID networks but often miss the attention alignment across views. In this paper, we propose a novel Attentive Part-based Alignment Network (APANet) to learn robust, diverse, and discriminative features for vehicle Re-ID. To be specific, in order to enhance the discrimination of part features, two part-level alignment mechanisms are proposed in APANet, consisting of Part-level Orthogonality Loss (POL) and Part-level Attention Alignment Loss (PAAL). Furthermore, POL aims to maximize the diversity of part features via an orthogonal penalty among parts whilst PAAL learns view-invariant features by means of realizing attention alignment in a part-level fashion. Moreover, we propose a Multi-receptive-field Attention (MA) module to adopt an efficient and cost-effective pyramid structure. The pyramid structure is capable of employing more fine-grained and heterogeneous-scale spatial attention information through multi-receptive-field streams. In addition, the improved TriHard loss and Inter-group Feature Centroid Loss (IFCL) function are utilized to optimize both the inter-group and intra-group distance. Extensive experiments demonstrate the superiority of our model over multiple existing state-of-the-art approaches on two popular vehicle Re-ID benchmarks.
... Moreover, we integrate sharpness aware minimization (SAM) [20] with the loss function to improve model generalizability. We further investigate and incorporate three attention mechanisms [21][22][23][24][25] with convolutional neural networks (CNNs) so that the networks can focus on relevant image regions. As a final experiment, we explore four ensembles that consist of different attention networks and loss functions with SAM for automatic kidney volume segmentation in patients with ADPKD. ...
Full-text available
Early detection of the autosomal dominant polycystic kidney disease (ADPKD) is crucial as it is one of the most common causes of end-stage renal disease (ESRD) and kidney failure. The total kidney volume (TKV) can be used as a biomarker to quantify disease progression. The TKV calculation requires accurate delineation of kidney volumes, which is usually performed manually by an expert physician. However, this is time-consuming and automated segmentation is warranted. Furthermore, the scarcity of large annotated datasets hinders the development of deep learning solutions. In this work, we address this problem by implementing three attention mechanisms into the U-Net to improve TKV estimation. Additionally, we implement a cosine loss function that works well on image classification tasks with small datasets. Lastly, we apply a technique called sharpness aware minimization (SAM) that helps improve the generalizability of networks. Our results show significant improvements (p-value < 0.05) over the reference kidney segmentation U-Net. We show that the attention mechanisms and/or the cosine loss with SAM can achieve a dice score (DSC) of 0.918, a mean symmetric surface distance (MSSD) of 1.20 mm with the mean TKV difference of −1.72%, and R2 of 0.96 while using only 100 MRI datasets for training and testing. Furthermore, we tested four ensembles and obtained improvements over the best individual network, achieving a DSC and MSSD of 0.922 and 1.09 mm, respectively.
... A. Evaluation Protocols 1) Competitors: To achieve a systematic evaluation, we elaborately select 13 recent polyp/object segmentation approaches as the competitors in our VPS benchmark, including five image-based (i.e., UNet [33], UNet++ [34], ACSNet [37], PraNet [45], SANet [39]) and eight video-based (i.e., COS-Net [65], MAT [66], PCSA [59], 2/3D [2], AMD [67], DCF [68], FSNet [69], and PNSNet [4]). For a fair comparison, all the competitors utilize the same dataset as our PNS+, under their corresponding default training settings. ...
In the deep learning era, we present the first comprehensive video polyp segmentation (VPS) study. Over the years, developments in VPS are not moving forward with ease due to the lack of large-scale fine-grained segmentation annotations. To tackle this issue, we first introduce a high-quality per-frame annotated VPS dataset, named SUN-SEG, which includes 158,690 frames from the famous SUN dataset. We provide additional annotations with diverse types, i.e., attribute, object mask, boundary, scribble, and polygon. Second, we design a simple but efficient baseline, dubbed PNS+, consisting of a global encoder, a local encoder, and normalized self-attention (NS) blocks. The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term feature representations, which are then progressively updated by two NS blocks. Extensive experiments show that PNS+ achieves the best performance and real-time inference speed (170fps), making it a promising solution for the VPS task. Third, we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons. Benchmark results are available at https: //
... One-stage MOT methods are not optimal in accuracy. MOT based on detecting largely benefit from the development of target detection, so they are better than one-stage methods in accuracy.A novel end-to-end learning neural network, MATNet [10], leverages motion cues as a bottom-up signal to guide the perception of object appearance. SORT [11] is a typical tracking by detecting the MOT method, using a simple Kalman filter and Hungarian algorithm for tracking, which achieving the most accurate results at that time. ...
Full-text available
Multi-object tracking aims to assign a uniform ID for the same target in continuous frames, which is widely used in autonomous driving, security monitoring, etc. In the previous work, the low-scoring box, which inevitably contained occluded target, was filtered by Non-Maximum Suppression (NMS) in a detection stage with a confidence threshold. In order to track occluded target effectively, in this paper, we propose a method of NMS performing later. The NMS works in tracking rather than the detection stage. More candidate boxes that contain the occluded target are reserved for trajectory matching. In addition, unrelated boxes are discarded according to the Intersection over Union (IOU) between the predicted and detected box. Furthermore, an unsupervised pre-trained person re-identification (ReID) model is applied to improve the domain adaptability. In addition, the bicubic interpolation is used to increase the resolution of low-scoring boxes. Extensive experiments on the MOT17 and MOT20 datasets have proven the effectiveness of tracking occluded targets of the proposed method, which achieves an MOTA of 78.3%.
Full-text available
Human Activity Recognition (HAR) is nowadays widely used in intelligent perception and medical detection, and the use of traditional neural networks and deep learning methods has made great progress in this field in recent years. However, most of the existing methods assume that the data has independent identical distribution (I.I.D.) and ignore the data variability of different individual volunteers. In addition, most deep learning models are characterized by many parameters and high resources consumption, making it difficult to run in real time on embedded devices. To address these problems, this paper proposes a Gate Recurrent Units (GRU) network fusing the channel attention and the temporal attention for human activity recognition method without I.I.D. By using channel attention to mitigate sensor data bias, GRU and the temporal attention are used to capture important motion moments and aggregate temporal features to reduce model parameters. Experimental results show that our model outperforms existing methods in terms of classification accuracy on datasets without I.I.D., and reduces the number of model parameters and resources consumption, which can be easily used in low-resource embedded devices.
Video object segmentation (VOS) is a highly challenging task with wide prospects of applications. Many complex situations of target objects such as tiny-sizes, deformations, occlusions, etc., influence segmentation accuracy adversely. To solve these problems, we propose an effective video object segmentation method based on multi-level target models and feature integration (MTMFI-VOS). The multi-level target models focus on more crucial details of target appearances and can get finer segmentation results even on tiny objects. The feature integration module integrates temporal information adaptively and can capture dynamic changes of the target objects. Extensive experiments are conducted on VOS benchmarks: DAVIS-16 validation set, DAVIS-17 validation set, and DAVIS-17 test-dev set. Compared with the state-of-the-art algorithms, our method achieves competitive accuracy and meets the demand of the real-time speed. Codes and trained models are available in
Video foreground detection (VFD), as one of the basic pre-processing tasks, is very essential for subsequent target tracking and recognition. However, due to the interference of shadow, dynamic background, and camera jitter, constructing a suitable detection network is still challenging. Recently, convolution neural networks have proved its reliability in many fields with their powerful feature extraction ability. Therefore, an interactive spatio-temporal feature learning network (ISFLN) for VFD is proposed in this paper. First, we obtain the deep and shallow spatio-temporal information of two paths with multi-level and multi-scale. The deep feature is conducive to enhancing feature identification capabilities, while the shallow feature is dedicated to fine boundary segmentation. Specifically, an interactive multi-scale feature extraction module (IMFEM) is designed to facilitate the information transmission between different types of features. Then, a multi-level feature enhancement module (MFEM), which provides precise object knowledge for decoder, is proposed to guide the coding information of each layer by the fusion spatio-temporal difference characteristic. Experimental results on LASIESTA, CDnet2014, INO, and AICD datasets demonstrate that the proposed ISFLN is more effective than the existing advanced methods.
Conference Paper
Full-text available
Video salient object detection aims at discovering the most visually distinctive objects in a video. How to effectively take object motion into consideration during video salient object detection is a critical issue. Existing state-of-the-art methods either do not explicitly model and harvest motion cues or ignore spatial contexts within optical flow images. In this paper, we develop a multi-task motion guided video salient object detection network, which learns to accomplish two sub-tasks using two sub-networks, one sub-network for salient object detection in still images and the other for motion saliency detection in optical flow images. We further introduce a series of novel motion guided attention modules, which utilize the motion saliency sub-network to attend and enhance the sub-network for still images. These two sub-networks learn to adapt to each other by end-to-end training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on a wide range of benchmarks. We hope our simple and effective approach will serve as a solid baseline and help ease future research in video salient object detection. Code and models will be made available.
Conference Paper
Full-text available
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this task as a process of iterative information fusion over video graphs. Specifically , AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. The underlying pair-wise relations are described by a differentiable attention mechanism. Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case. To further demonstrate the generaliz-ability of our framework, we extend AGNN to an additional task: image object co-segmentation (IOCS). We perform experiments on two famous IOCS datasets and observe again the superiority of our AGNN model. The extensive experiments verify that AGNN is able to learn the underlying semantic/appearance relationships among video frames or related images, and discover the common objects.
Full-text available
This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e, effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at
Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.