ArticlePDF Available

MATNet: Motion-Attentive Transition Network for Zero-Shot Video Object Segmentation

August 2020
IEEE Transactions on Image Processing PP(29):8326-8338

August 2020
PP(29):8326-8338

DOI:10.1109/TIP.2020.3013162

Authors:

Tianfei Zhou

ETH Zurich

Jianwu Li

Beijing Institute of Technology

Shunzhou Wang

Beijing Institute of Technology

Show all 5 authorsHide

In this paper, we present a novel end-to-end learning neural network, i.e., MATNet, for zero-shot video object segmentation (ZVOS). Motivated by the human visual attention behavior, MATNet leverages motion cues as a bottom-up signal to guide the perception of object appearance. To achieve this, an asymmetric attention block, named Motion-Attentive Transition (MAT), is proposed within a two-stream encoder network to firstly identify moving regions and then attend appearance learning to capture the full extent of objects. Putting MATs in different convolutional layers, our encoder becomes deeply interleaved, allowing for close hierarchical interactions between object apperance and motion. Such a biologically-inspired design is proven to be superb to conventional two-stream structures, which treat motion and appearance independently in separate streams and often suffer severe overfitting to object appearance. Moreover, we introduce a bridge network to modulate multi-scale spatiotemporal features into more compact, discriminative and scale-sensitive representations, which are subsequently fed into a boundary-aware decoder network to produce accurate segmentation with crisp boundaries. We perform extensive quantitative and qualitative experiments on four challenging public benchmarks, i.e., DAVIS16, DAVIS17, FBMS and YouTube-Objects. Results show that our method achieves compelling performance against current state-of-the-art ZVOS methods. To further demonstrate the generalization ability of our spatiotemporal learning framework, we extend MATNet to another relevant task: dynamic visual attention prediction (DVAP). The experiments on two popular datasets (i.e., Hollywood-2 and UCF-Sports) further verify the superiority of our model. Our implementations have been made publicly available at https://github.com/tfzhou/MATNet.

Computational graph of MAT. ⊗ and c indicate matrix multiplication and concatenation operations, respectively.

…

Illustration of effects of the MAT module. (a) and (b) are inputs of images and optical flow fields. (c) and (d) denote feature maps in V a andˆUandˆ andˆU a . As seen, the MAT module can effectively emphasize important object regions and suppress background responses, benefiting the segmentation.

…

Illustration of hard example mining (HEM) for salient object boundary detection. During training, for each training image in (a), our method first estimates an edge map (c) using off-the-shelf HED [75], and then determines hard pixels (d) to facilitate training. For each test image in (e), we see that the boundary results with HEM (h) are more accurate than those without HEM (g).

…

Computational graph of the BAR i module. Here, 'Res' is a residual block [73], while 'UP' denotes bilinear upsampling. c and ⊕ indicate concatenation and elementwise addition operations, respectively.

…

Qualitative results on four sequences. From top to bottom: dance-twirl from DAVIS 16 , dogs02 from FBMS, cat-0001 from YouTube-Objects and dogs-jump from DAVIS 17 .

…

Figures - uploaded by Tianfei Zhou

Content may be subject to copyright.

Content uploaded by Tianfei Zhou

Content may be subject to copyright.

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

MATNet: Motion-Attentive Transition Network for

Zero-Shot Video Object Segmentation

Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, Senior Member, IEEE

and Jianbing Shen, Senior Member, IEEE

Abstract—In this paper, we present a novel end-to-end learning

neural network, i.e., MATNet, for zero-shot video object segmen-

tation (ZVOS). Motivated by the human visual attention behavior,

MATNet leverages motion cues as a bottom-up signal to guide the

perception of object appearance. To achieve this, an asymmetric

attention block, named Motion-Attentive Transition (MAT), is

proposed within a two-stream encoder network to ﬁrstly identify

moving regions and then attend appearance learning to capture

the full extent of objects. Putting MATs in different convolutional

layers, our encoder becomes deeply interleaved, allowing for

close hierarchical interactions between object apperance and

motion. Such a biologically-inspired design is proven to be superb

to conventional two-stream structures, which treat motion and

appearance independently in separate streams and often suffer

severe overﬁtting to object appearance. Moreover, we introduce a

bridge network to modulate multi-scale spatiotemporal features

into more compact, discriminative and scale-sensitive representa-

tions, which are subsequently fed into a boundary-aware decoder

network to produce accurate segmentation with crisp boundaries.

We perform extensive quantitative and qualitative experiments

on four challenging public benchmarks, i.e., DAVIS16, DAVIS17 ,

FBMS and YouTube-Objects. Results show that our method

achieves compelling performance against current state-of-the-

art ZVOS methods. To further demonstrate the generalization

ability of our spatiotemporal learning framework, we extend

MATNet to another relevant task: dynamic visual attention

prediction (DVAP). The experiments on two popular datasets (i.e.,

Hollywood-2 and UCF-Sports) further verify the superiority of

our model1.

Index Terms—Video object segmentation, zero-shot, two-

stream, spatiotemporal representation, neural attention, dynamic

visual attention prediction.

I. INTRODUCTION

The task of automatically identifying primary object(s) from

videos has gained signiﬁcant attention over the past decade,

owing to its academic value and practical signiﬁcance in many

areas, such as robotics [1], video compression [2], human-

object interaction [3] and autonomous driving [4]. However,

due to the lack of human intervention, in addition to typical

challenging factors posed by video data (e.g., occlusions,

This work was supported in part by the Beijing Natural Science Foundation

(No. L191004) and the National Natural Science Foundation of China (No.

61271374). (Corresponding author: Jianwu Li)

T. Zhou and J. Shen are with Inception Institute of Artiﬁcal Intelligence,

Abu Dhabi, UAE. (Email: {ztfei.debug, shenjianbingcg}@gmail.com)

J. Li and S. Wang are with Beijing Laboratory of Intelligent Information-

Technology, School of Computer Science, Beijing Institute of Technology,

Beijing, China. (Email: ljw@bit.edu.cn)

R. Tao is with School of Information and Electronics, Beijing Institute

of Technology, Beijing, China, and also with Beijing Key Laboratory of

Fractional Signals and Systems, Beijing, China.

1Our code is available at https://github.com/ tfzhou/ MATNet

motion blur, object deformations, cluttered background), the

task suffers from great difﬁculties in accurately distinguishing

the most prominent objects throughout a video sequence.

Early non-learning methods are built upon handcrafted

features (e.g., motion boundary [5], saliency [6, 7, 8], point

trajectories [9, 10, 11, 12, 13]) and rely heavily on classic

heuristics in video segmentation (e.g., object proposal rank-

ing[14, 15], spatiotemporal coherency[5], long-term trajectory

clustering [9]). Although these methods can work in a purely

unsupervised way, they suffer the limited representability of

the handcrafted features. More recently, research has turned to-

wards the deep learning paradigm, with several studies[16, 17]

casting this problem as a zero-shot solution. These approaches

follow the zero-shot learning paradigm [18] to learn from

large-scale video data and can generalize well to test videos

that never appear in the training set, without any human

involvement. This is different from one-shot video object seg-

mentation (OVOS)[19], which requires ﬁrst-frame annotations

for model adaption to test data in the inference phase.

Even before the era of deep learning, object motion has

always been considered as one of the most important cues for

automatic video object segmentation. This is largely inspired

by the human vision system (HVS), which has remarkable

motion perception capabilities to quickly orient human at-

tention to moving objects in dynamic scenes [20, 21]. In

fact, it has been demonstrated that infants [22] and newly

sighted congenitally blind people [23] tend to over-segment

static objects, even if they are strongly contrasted against

their surroundings; however, they can easily group things

together once the objects start moving, following the Gestalt

principle of common fate [24, 25]. These abilities enable us

to easily discover never-before-seen moving objects, before

knowing their particular semantic names. Motion surely does

not work alone. Recent studies [20] have revealed that, in

HVS, motion-based perception appears early, while static

perception is acquired later, and possibly bootstrapped by

motion cues to focus more on processing the most salient

objects. In this work, we take these biological mechanisms into

account and design our model to reﬂect such human behaviors,

i.e., ﬁrst orienting rough attention to the moving parts of

objects, and then transferring the attention to object appearance

which provides a generic objectness prior for capturing the

whole picture of objects. In this way, our model is able to

learn a more effective spatiotemporal object representation,

encouraging more robust video object segmentation.

By considering knowledge propagation from object motion

to appearance, valuable temporal context can be exploited to

IEEE TRANSACTIONS ON IMAGE PROCESSING 2

Res1

Res2

Res3

Res4

Res5

Res1

Res2

Res3

Res4

Res5

CC C C

SSA

MAT

SSA SSA SSA

BAR5

BAR4

BAR3

BAR2

Boundary GT Boundary HEM

Segmentation GT

MAT MAT

Encoder Decoder

473×473 Z2

Vmˆ

Lbdry

LCE

Fig. 1: Pipeline of MATNet. The frame Iaand ﬂow Imare ﬁrst input into the interleaved encoder to extract multi-scale

spatiotemporal features {ˆ

Ui}5

i=2. At each residual stage i, we break the original information ﬂow in ResNet. Instead, a MAT

module is proposed to create a new interleaved information ﬂow by simultaneously considering motion Vm,i and appearance

Va,i.ˆ

Uiis further fed into the boundary-aware decoder via the bridge network to obtain boundary results Mb

2∼Mb

5and

segmentation results Ms.

alleviate possible ambiguities in object appearance (e.g., visual

similarity to the background), thus facilitating representation

learning. However, in the context of deep learning, current

segmentation models often overlook this potential. Most prior

works [26, 27, 28, 29] simply treat motion cues as being equal

to appearance and learn to predict segmentation masks from

motion or appearance independently. Some approaches [30,

31] utilize motion cues to enrich object representations, but

they rely on complex heuristics and only work at a single

scale, ignoring the critical hierarchical structure.

Motivated by these observations, we propose a Motion-

Attentive Transition Network (MATNet) for zero-shot video

object segmentation (ZVOS). Fig. 1 illustrates its pipeline,

which has an encoder-bridge-decoder structure. The core of

MATNet is a deeply interleaved two-stream encoder which

not only inherits the advantages of traditional two-stream

networks for multi-modal learning, but also progressively

transfers intermediate motion features to facilitate more robust

appearance learning. The transition is carried out by multiple

Motion-Attentive Transition (MAT) modules. Each MAT takes

as input the intermediate features of both the input image

and optical ﬂow ﬁeld at a convolutional stage, and produces

informative spatiotemporal features for the following stage.

For each MAT, an asymmetric attention mechanism is built

to ﬁrst infer regions of interest based on optical ﬂow, and

then transfer the inference to provide better selectivity for

appearance features. The deep interleaved structure captures

the intrinsic characteristics of the human vision system and

brings immediate improvement in segmentation accuracy.

Given the powerful spatiotemporal features from the en-

coder, we design a decoder network to infer pixel-accurate

object segmentation through a top-down reﬁnement process.

The decoder progressively reﬁnes high-level semantic fea-

tures using spatially rich low-level features via a cascade

of Boundary-Aware Reﬁnement (BAR) modules. Each BAR

is responsible for generating features with ﬁner structures,

under the assistance of salient boundary detection. Beyond

traditional methods that connect the encoder and decoder via

skip connections, we introduce a lightweight attention module,

i.e., Scale-Sensitive Attention (SSA), to connect each pair

of encoder and decoder layers. SSA adaptively modulates

spatiotemporal convolution features before sending them to

the decoder. More speciﬁcally, SSA is a two-level attention

scheme in which the local level serves to highlight the most

informative features and suppress useless information, while

the global level helps to re-calibrate features for objects with

different scales.

MATNet can be easily instantiated with various backbones,

and optimized in an end-to-end manner. We perform exten-

sive experiments on four popular video object segmentation

datasets, i.e., DAVIS16 [32], DAVIS17 [33], FBMS [11] and

YouTube-Objects [34], in which the proposed method yields

consistent performance improvement over the state-of-the-arts.

Additionally, we showcase the advantages and generalizability

of our framework via the task of dynamic visual attention

prediction (DVAP). Our model is proven to generalize well to

the DVAP task and produce reliable dynamic-ﬁxation predic-

tion results over two large-scale benchmarks, i.e., Hollywood-

2 [35] and UCF-Sports [35].

In summary, the main contributions of this paper are

three-fold: First, we propose a novel interleaved two-stream

network architecture to learn powerful spatiotemporal object

representations for ZVOS. This is achieved by an asymmetric

attention module, i.e., MAT, that accounts for object motion

and appearance interactions in a more comprehensive way.

Second, we introduce a boundary-aware decoder to obtain seg-

mentation with crisp object boundaries. The decoder learned

with a novel adapted cross-entropy loss produces accurate

boundaries in regions of primary objects. Third, based on

these designs, our MATNet consistently outperforms state-

of-the-art methods over several ZVOS benchmarks and also

shows superior performance in instance-level segmentation

and the DVAP task.

This paper builds upon our conference paper [36] and sig-

niﬁcantly extends it in various aspects: First, to demonstrate

the effectiveness of our model, we extend it to an instance-

level segmentation setting, which is more challenging and

essential for practical cases in which multiple instances may

appear. Second, we examine our model for the DVAP task,

and it outperforms all specialized methods on two large-

IEEE TRANSACTIONS ON IMAGE PROCESSING 3

scale benchmarks, demonstrating its generality. Third, we

also provide a more inclusive and insightful overview of the

recent work on video object segmentation, motion-aware video

analysis and dynamic visual attention prediction. Last but not

least, we report much more experimental results and conduct

more ablation studies (e.g., attribute-based analysis, impacts

of different optical ﬂow methods) for thorough and in-depth

examinations of our model.

II. RE LATE D WORK

Our model is related to four lines of research, i.e., auto-

matic video object segmentation, motion-aware modeling in

video analysis, dynamic visual attention prediction and neural

attention. We will brieﬂy discuss each of them.

A. Automatic Video Object Segmentation

A large number of methods have been proposed for auto-

matic (or unsupervised) video object segmentation, targeting at

segmenting conspicuous and eye-catching objects without any

human intervention. Many non-deep learning methods are

based on hand-crafted features and rely on certain heuristics

(e.g., saliency, object proposal ranking, trajectory clustering).

For instance, [6, 7, 8] take visual saliency as prior knowledge

to guide object segmentation, while [7, 14, 15, 37] infer the

object regions from hundreds of object candidates [38]. Object

motion is also widely used as a reliable cue for identifying

objects. [5] detects motion boundaries to determine foreground

regions. [9, 10, 11, 12, 13] take advantage of long-term point

trajectories for motion segmentation, making them more robust

to occlusions. Please refer to [39] for a more comprehensive

review of these approaches.

In recent years, with the renaissance of neural networks

in computer vision, deep learning based solutions are now

dominant in this ﬁeld. Many approaches[16, 26, 28, 31, 40, 41,

42, 43, 44, 45, 46, 47] solve the task with zero-shot solutions,

which require no additional annotations during inference and

are thus more ﬂexible for automatic video analysis. For exam-

ple, [43] proposes a dynamic visual attention-driven model for

video object segmentation, and [17, 41] mine higher-order re-

lations between video frames, resulting in more comprehensive

understanding of video content and more accurate foreground

estimation. However, these approaches only rely on object

appearance, and can thus easily fail in cases where objects are

visually similar to the background. To cope with this, many

approaches discover the motion patterns of objects [26] as

complementary cues to object appearance. This is typically

achieved within two-stream network architectures [28, 48],

in which an RGB image and the corresponding ﬂow ﬁeld

are separately processed by two independent networks and

the results are fused to produce the ﬁnal segmentation. Some

methods [42, 49] design complex heuristics to fuse motion

and appearance for better segmentation. However, a major

drawback of these approaches is that they fail to consider

the importance of deep interactions between appearance and

motion in learning rich spatiotemporal features. To address

this issue, we propose a deep interleaved two-stream encoder,

in which a motion transition module is leveraged for more

effective representation learning.

B. Motion-Aware Modeling in Video Analysis

Deep learning models have been widely used in various

video-related tasks, such as action recognition [50, 51, 52],

video salient object detection [40, 53, 54] and dynamic visual

attention prediction [55, 56, 57, 58]. The most signiﬁcant

difference between static images and videos is that objects

in videos are moving, which is a key factor that draws human

attention. Therefore, how to involve object motion into the

design of neural networks has been a critical issue in deep

learning-based video analysis.

Many approaches [40, 59] learn temporal coherence in-

formation by simply feeding consecutive frames into fully

convolutional networks. These methods are computationally

efﬁcient; however, since they do not employ explicit motion

information (e.g., optical ﬂow), they are sensitive to cluttered

and distracting backgrounds. Some other models consider

recurrent neural networks to capture long-range spatiotemporal

features [53, 60, 61]. However, all these models ignore the

complementary roles of spatial and temporal information.

This issue has been well addressed by the famous two-

stream ConvNet architecture proposed in [50], which consists

of spatial and temporal networks to better capture the comple-

mentary information of object appearance and motion. It has

achieved great success on human action recognition in videos.

Along this line, [51] injects residual connections into the

two-stream architecture to allow spatiotemporal interactions

between two modalities, while [44, 52] further improve such

a spatiotemporal residual network with multiplicative gating

functions. These two-stream architectures have also shown

strong performance in video object processing tasks, like video

object segmentation [28, 29, 54] and dynamic video attention

prediction [55, 58]. Despite this, current two-stream networks

tend to fuse motion and appearance features with a simple

gating mechanism and are limited in their use of local context.

In this work, we reconsider the interactions between object

motion and appearance with an asymmetric attention mod-

ule, which utilize motion-attentive features to promote the

appearance features, in a hierarchical manner. The powerful

representation ability of our model is veriﬁed in both ZVOS

and DVAP tasks.

C. Dynamic Visual Attention Prediction

Dynamic visual attention prediction, or dynamic ﬁxation

prediction, is a close topic to ZVOS. Rather than targeting

at object-level saliency prediction, DVAP aims to identify

observers’ ﬁxations during dynamic scene viewing. The task is

useful for machines to understand human attentional behaviors

and has shown great potential in many practical applications

(e.g., object segmentation [43], video captioning [62]). Early

DVAP methods [63, 64, 65] largely relied on hand-crafted,

biologically-inspired features (e.g., color, optical ﬂow), the-

ories of visual attention in the cognitive area (e.g., guided

search [66], attention shift [67]). Recently, deep learning-

based methods become mainstream and generally yield better

performance. Representative works use two-stream networks

to account for multi-modal features [55, 58] or LSTMs for

sequence ﬁxation prediction over consecutive frames [68].

IEEE TRANSACTIONS ON IMAGE PROCESSING 4

CCC

Soft Attention

Attention

Transition

Uˆ

1×1conv 1×1conv

softmax+fc softmax +fc

softmaxr

Fig. 2: Computational graph of MAT. ⊗and c

indicate matrix

multiplication and concatenation operations, respectively.

Although MATNet is originally designed for the object-

aware segmentation task, we show that it also achieves re-

markable performance on the DVAP task (V-C). This can

be largely attributed to the proposed encoder network, which

can provide informative spatiotemporal features to capture the

most important parts of the visual stimuli.

D. Neural Attention

Neural attention mechanisms, which are derived from hu-

man perception, have been widely studied in deep neural

networks and yield signiﬁcant improvements for various tasks,

e.g., neural machine translation [69], object recognition [70,

71], and visual question answering [54, 72], to name a few

representative ones. Neural attention stimulates the human

selective attention mechanism, allowing the networks to focus

on the most informative parts of the inputs.

Neural attention mechanisms have also been used in recent

ZVOS approaches [17, 41], which aim to mine consistent

object patterns among video frames. Our idea is fundamentally

different with them. We propose an asymmetric attention

module (i.e., MAT) to mimic human attention behavior in

dynamic scenarios. It encourages more comprehensive in-

teractions between object motion and appearance, yielding

more powerful spatiotemporal features. Besides, we extend

MAT into a deeper version to conduct multi-step reasoning of

spatiotemporal attention, which can highlight more accurate

target regions, especially for complex scenarios. In addition,

MATs are incorporated into multiple convolutional layers,

leading to an entirely different network architecture, which

is expected to beneﬁt various video analysis tasks.

III. PROP OS ED METHOD

A. Network Overview

We propose an end-to-end deep neural network, i.e., MAT-

Net, for ZVOS, which leverages motion cues to effectively

bootstrap the perception of object appearance. More speciﬁ-

cally, our approach is designed as a uniﬁed of three tightly

coupled sub-networks: Interleaved Encoder Network,Bridge

Network and Boundary-Aware Decoder Network. The pipeline

is illustrated in Fig. 1.

(a) (b) (c) (d)

Fig. 3: Illustration of effects of the MAT module. (a) and (b)

are inputs of images and optical ﬂow ﬁelds. (c) and (d) denote

feature maps in Vaand ˆ

Ua. As seen, the MAT module can

effectively emphasize important object regions and suppress

background responses, beneﬁting the segmentation.

1) Interleaved Encoder Network: The encoder resorts to a

two-stream structure to jointly capture the spatial and temporal

information, which has been proven effective in many related

video analysis tasks [50, 51, 52]. In contrast to previous works,

which treat the two streams equally, our encoder incorporates

a MAT module (III-B) into each network layer, which offers

a motion-to-appearance pathway for information exchange.

Such a design enables us to learn more powerful spatiotempo-

ral object representations. More technically, we take the ﬁrst

ﬁve convolutional blocks of ResNet-101 [73] as the backbone

for each stream. Given an RGB frame Ia∈Rw×h×3and its

optical ﬂow ﬁeld Im∈Rw×h×3, the encoder ﬁrst extracts

intermediate appearance and motion features separately at the

i-th (i∈{2,3,4,5}) residual stage, denoted as Va,i ∈RW×H×C

and Vm,i ∈RW×H×C, where W,Hand Crepresent the spatial

width, height and channel number of the feature tensors,

respectively. The features are subsequently enhanced by a

MAT module FMAT as:

Ua,i,ˆ

Um,i =FMAT(Va,i ,Vm,i),(1)

where ˆ

U·,i ∈RW×H×Crepresents the enriched features. For

the i-th stage, the spatiotemporal object representation ˆ

Uiis

obtained as ˆ

Ui=Concat(ˆ

Ua,i,ˆ

Um,i)∈RW×H×2Cwhich is

further fed into the down-stream decoder via a bridge network.

2) Bridge Network: The bridge network is responsible for

selecting informative spatiotemporal features for the decoder.

It is built upon several SSA modules (III-C), each of which

takes advantage of Uiat the i-th stage, attending it both

locally and globally to produce attentive feature Zi, with a

uniﬁed attention module. The local attention adopts channel-

wise and spatial-wise attention mechanisms to highlight the

correct object regions and suppress possible noise existing in

the redundant features, while the global attention aims to re-

calibrate the features to account for objects of different sizes.

3) Boundary-Aware Decoder Network: The decoder net-

work adopts a coarse-to-ﬁne scheme to conduct segmentation

inference. It consists of four BAR modules (III-D), i.e.,

FBARi, i ∈ {2,3,4,5}, each corresponding to the i-th residual

block. From FBAR5to FBAR2, the resolution of feature maps

gradually increases by compensating for high-level coarse

features with more low-level details. The FBAR2produces the

ﬁnest feature maps, whose resolutions are 1/4of the input

IEEE TRANSACTIONS ON IMAGE PROCESSING 5

image size. They are sequentially processed by three additional

layers, i.e., conv(3×3,1),upsampling and sigmoid, to obtain the

ﬁnal mask output Ms∈Rw×h.

As follows, we will introduce the proposed modules (i.e.,

MAT, SSA, BAR) in detail. For simplicity, we omit the

subscript i.

B. Motion-Attentive Transition Module

Each MAT module is comprised of two soft attention units

and one attention transition unit, as depicted in Fig. 2. The

soft attention units help to emphasize the most informative

regions in the appearance or motion feature maps, while

the transition unit transfers the attentive motion features to

facilitate spatiotemporal feature learning.

1) Soft Attention: This unit softly weights the input feature

map Vm(or Va) at each spatial location. Taking Vmas an

example, this unit outputs a motion-attentive feature Um∈

RW×H×Cas follows:

Softmax attention: Am=softmax(Wm(Vm)),

Attention-enhanced feature: Uc

m=AmKVc

m,(2)

where Wmis a 1×1convolution that transforms Vmto

an importance map, which is normalized using a softmax

operation to generate an attention map Am∈RW×H, where

PW×H

i=1 Ai

m= 1. Here, each value Ai

mis the probability

with which our model believes the corresponding location

is important. Uc

mand Vc

mindicate the 2D feature slices of

Umand Vmat the c-th channel, respectively. Jdenotes the

Hadamard product. Similarly, given Va, we can obtain the

appearance-attentive feature Uaby Eq. (2).

2) Attention Transition: To transfer motion-attentive fea-

tures Um, we ﬁrst seek the afﬁnity between Uaand Umin

a non-local manner, using the following multi-modal bilinear

model:

S=U>

mWUa∈R(W H)×(W H ),(3)

where W∈RC×Cis a trainable weight matrix. The afﬁnity

matrix Scan effectively capture pairwise relationships between

the two feature spaces. However, it also introduces a huge

number of parameters, which increases the computational cost

and creates the risk of over-ﬁtting. To overcome this problem,

Wis approximately factorized into two low-rank matrices P∈

RC×C

dand Q∈RC×C

d, where d(d > 1) is a reduction ratio.

Then, Eq. (3) can be rewritten as:

S=U>

mPQ>

Ua= (P>Um)>(Q>

Ua).(4)

This operation is equal to applying channel-wise feature

transformations to Umand Uabefore computing the similarity.

Its advantages over Eq. (3) are three-fold: 1) It reduces the

number of parameters by 2/d times; 2) It requires much

fewer multiplication operations. For comparison, Eq. (3) needs

W H C2+W2H2Cmultiplications, while Eq. (4) only requires

(2W H C2+W2H2C)/d; 3) It helps to generate a compact

channel-wise feature representation for each modality.

Then, we normalize Srow-wise to derive an attention

map Srconditioned on motion features and achieve enhanced

appearance features ˆ

Ua∈RW×H×C:

Motion conditioned attention: Sr=softmaxr(S),

Attention-enhanced feature: ˆ

Ua=UaSr,(5)

(a) Image (b) Groundtruth (c) HED (d) Hard Example Mining

(e) Image (f) Groundtruth (g) Boundary w/o HEM (h) Boundary w/ HEM

Fig. 4: Illustration of hard example mining (HEM) for salient

object boundary detection. During training, for each training

image in (a), our method ﬁrst estimates an edge map (c) using

off-the-shelf HED [75], and then determines hard pixels (d) to

facilitate training. For each test image in (e), we see that the

boundary results with HEM (h) are more accurate than those

without HEM (g).

where softmaxrindicates row-wise softmax.

3) Deep-MAT: For complex videos, using one MAT layer

to predict the attention is sub-optimal due to the noise intro-

duced by distractors which are irrelevant to the target regions.

Therefore, we extend MAT into Deep-MAT for multi-step rea-

soning of spatiotemporal attention. Deep-MAT progressively

reﬁnes attention via multiple MAT layers and can pinpoint

more accurate target regions. In particular, our deep MAT

consists of LMAT layers cascaded in depth (denoted by

F(1)

MAT,F(2)

MAT,· · · ,F(L)

MAT). Let ˆ

U(l−1)

aand ˆ

U(l−1)

mbe the input

features for F(l)

MAT. It then produces outputs ˆ

U(l)

aand ˆ

U(l)

which are further fed to F(l+1)

MAT in a recursive manner:

U(l)

a,ˆ

U(l)

m=F(l)

MAT(ˆ

U(l−1)

a,ˆ

U(l−1)

m),(6)

where ˆ

U(l)

ais computed by Eq. (5) and ˆ

U(l)

m=U(l−1)

mfollowing

Eq. (2). In addition, we have ˆ

U(0)

a=Vaand ˆ

U(0)

m=Vm.

It should be noted that stacking MAT layers directly leads to

an obvious drop in performance. Inspired by [74], we propose

to stack multiple MAT layers in a residual form as follows:

U(l)

a=ˆ

U(l−1)

a+U(l−1)

aSr,

U(l)

m=ˆ

U(l−1)

m+U(l−1)

(7)

4) Discussion: In Fig. 3, we show the visual effects of the

MAT module. We can observe that with MAT, the feature maps

in Vaare well reﬁned to produce more effective features in

Ua. The new features show tremendous properties with promi-

nent objects highlighted and distractors suppressed, which are

beneﬁcial for accurate segmentation.

C. Scale-Sensitive Attention Module

The SSA module FSSA is extended from a simpliﬁed CBAM

FCBAM [71] by adding a global attention Fg. Given a feature

map U∈RW×H×2C, our SSA module reﬁnes it as follows:

Z=FSSA(U) = Fg(FCBAM(U)) ∈RW×H×2C.(8)

IEEE TRANSACTIONS ON IMAGE PROCESSING 6

The CBAM module FCBAM consists of two sequential sub-

modules: channel and spatial attention, which can be formu-

lated as:

Channel attention: s=Fs(U),e=Fe(s),Zc=e?U,

Spatial attention: p=Fp(Zc),ZCBAM =pKZc,(9)

where Fsis a squeeze operator that gathers the global

spatial information of Uinto a vector s∈R2C, while Fe

is an excitation operator that captures channel-wise depen-

dencies and outputs an attention vector e∈R2C. Follow-

ing [70], Fsis implemented by applying avg pooling on

each feature channel, and Feis formed by four consec-

utive operations: fc(2C

16 )→ReLU →fc(2C)→sigmoid.

Zc∈RW×H×2Cdenotes channel-wise attentive features, and

?indicates the channel-wise multiplication. In the spatial

attention, Fpexploits the inter-spatial relationship of Zc

and produces a spatial-wise attention map p∈RW×Hby

conv(7×7,1) →sigmoid. Then, we achieve the attention

glimpse ZCBAM ∈RW×H×2Cas the local-level feature.

The global attention Fgshares a similar spirit to the

channel attention layer in Eq. (9), in that it has the

same squeeze layer but modiﬁes the excitation layer as

fc(2C

16 )→fc(1) →sigmoid to obtain a scale selection factor

g∈R1. It can then obtain scale-sensitive features Zgas

follows:

Z= (g∗ZCBAM) + U.(10)

Note that we use identity mapping to avoid losing important

information on the regions with attention values close to 0.

D. Boundary-Aware Reﬁnement Module

In the decoder network, each BAR FBARiaccepts two in-

puts, Zifrom the corresponding SSA module and Fifrom the

previous BAR. To obtain a sharp mask output, the BAR ﬁrst

performs object boundary estimation using an extra boundary

detection module FBDRY, which compels the network to em-

phasize ﬁner object details. The predicted boundary map is

then combined with the two inputs to produce ﬁner features

for the next BAR module. This can be formulated as:

i=FBDRY(Fi),

Fi−1=FBARi(Zi,Fi,Mb

i),(11)

where FBDRY consists of a stack of convolutional layers and a

sigmoid layer (see Fig. 5), Mb

i∈Rw×hindicates the boundary

map and Fi−1is the output feature map of BARi. The full

computational graph of BARiis shown in Fig. 5.

BAR beneﬁts from two key factors: the ﬁrst is that we apply

Atrous Spatial Pyramid Pooling (ASPP) [76] on convolutional

features to transform them into a multi-scale representation.

This helps to enlarge the receptive ﬁeld and obtain more

spatial details for decoding. Technically, ASPP consists of

multiple parallel dilated convolutional layers with different

sampling rates. In this paper, four dilated convolutional layers

are adopted, and the dilation rates are set as {2k}4

k=1. In this

way, each BAR module ﬁrst extracts spatiotemporal features

on four scales, which are then concatenated together to em-

phasize multi-scale features. During decoding, these features

are futher concatenated with the boundary prediction Mb

i, and

ASPP

Res

Conv 3x3

Conv 1x1

Sigmoid

Boundary

FBDRY

FBARi

Fi−1

Fig. 5: Computational graph of the BARimodule. Here,

‘Res’ is a residual block [73], while ‘UP’ denotes bilinear

upsampling. c

and ⊕indicate concatenation and element-

wise addition operations, respectively.

then progressively proccessed by a residual block (‘Res’ in

Fig. 5), an element-wise summation with Zi, and another

residual block to obtain more ﬁne-grained features Fi−1, as

shown in Fig. 5. Here the residual block is implemented by

two stacked 3×3convolutions with an identity shortcut [73].

The second beneﬁt is that we introduce a heuristic method

for automatically mining hard negative pixels to support the

training of FBDRY. Speciﬁcally, for each training frame, we

use the popular off-the-shelf HED model [75] to predict a

boundary map E∈[0,1]w×h, wherein each value Ekrepresents

the probability of pixel kbeing an edge pixel. Then, pixel

kis regarded as a hard negative pixel if it has a high edge

probability (e.g., Ek>0.2) and falls outside the dilated ground-

truth region. If pixel kis a hard pixel, then its weights wk=

1+Ek; otherwise, wk= 1.

Then, wkis used to weight the following adaptive boundary

loss so that it can be penalized heavily if the hard pixels are

misclassiﬁed:

LBDRY(Mb,Gb) = −Xkwk((1−Gb

k) log(1−Mb

+Gb

klog(Mb

k)),

(12)

where Mband Gbare the boundary prediction and ground-

truth, respectively.

Fig. 4 offers an illustration of the above hard example

mining (HEM) scheme. Clearly, by explicitly discovering

hard negative pixels, the network can produce more accurate

boundary predictions with background pixels well suppressed

(see Fig. 4 (g) and (h)).

E. Detailed Network Architecture

Our whole model is end-to-end trainable, because all the

components in MATNet are parameterized by neural networks.

At each stream of the encoder, we use the ﬁrst ﬁve convolu-

tion blocks of ResNet-101 [73] as our backbone for feature

extraction. The spatiotemporal features in the last convolution

stage are fed into a global convolutional layer (GC in Fig. 1)

to enlarge the valid receptive ﬁeld [77], which is implemented

by combining 1×7→7×1and 7×1→1×7convolutional

layers, following by a residual block.

IEEE TRANSACTIONS ON IMAGE PROCESSING 7

Fig. 6: Qualitative results on four sequences. From top to bottom: dance-twirl from DAVIS16 ,dogs02 from FBMS, cat-0001

from YouTube-Objects and dogs-jump from DAVIS17.

1) Training Phase: Given an input frame Ia∈R473×473×3,

we ﬁrst compute its optical ﬂow ﬁeld Im∈R473×473×3using

PWC-Net [78] due to its high efﬁciency and accuracy. Then,

our MATNet predicts a segmentation mask Ms∈[0,1]473×473

and four boundary masks {Mb

i∈[0,1]473×473 }4

i=1 through

the decoder network. Let Gs∈ {0,1}473×473 be the binary

segmentation ground-truth, and Gb∈ {0,1}473×473 be the

boundary ground-truth which can be easily computed from

Gs. The overall loss function is formulated as:

LZVOS =LCE(Ms,Gs)+ 1

NXN=4

i=1 LBDRY(Mb

i,Gb),(13)

where LCE indicates the classic cross entropy loss, and LBDRY

is deﬁned in Eq. (12).

2) Testing Phase: Once the network is trained, we apply it

to unseen videos. Given a test video, we resize all the frames

to 473×473, and feed each frame, along with its optical ﬂow, to

the network for segmentation. We follow the common protocol

used in previous works [27, 30, 43] and employ CRF to obtain

the ﬁnal binary segmentation results.

3) Runtime: Our model is implemented in PyTorch and

trained on a single Nvidia RTX 2080Ti GPU and an Intel(R)

Xeon Gold 5120 CPU. Testing is conducted on the same

machine. For each test frame of size 473×473, the forward

inference of our MATNet takes about 0.05s, while optical ﬂow

estimation and CRF-based post-processing take about 0.2s and

0.5s, respectively.

IV. EXTENSION OF MATNET

In this section, we describe two extensions of our MATNet:

zero-shot video instance segmentation and dynamic visual

attention prediction. The former focuses on multi-object un-

supervised video segmentation [79], targeting at more ﬁne-

grained results in multi-object scenarios. The latter aims at

predicting where people look over dynamic scenes.

A. Zero-Shot Video Instance Segmentation

To adapt our MATNet into an instance-level segmentation

setting, we modify our model into a saliency-driven instance

selection method. More speciﬁcally, for a test video V=

{It}T

t=1 with Tframes, our approach takes three stages to

generate segmentation tracks for it. 1) Object proposal gener-

ation. For each frame It, we generate a collection of category-

agnostic segment proposals Pt={Pi

t}iusing COCO-trained

Mask R-CNN [83] for detecting generic objects. Our MATNet

is also applied to generate an object-level segmentation mask

t. Then, we compute a score Si

tfor each proposal:

t=Si

MATNet ∗Si

MRCNN,

MATNet =kPi

t∩Ms

kPi

tk,(14)

where Si

MRCNN denotes the detection score of Pi

tfrom Mask

R-CNN, while Si

MATNet measures its saliency score. The pro-

posals with small scores (Si

t<0.03) are discarded. 2) Short-

Term Tracklet Generation. Given the remaining proposals, we

further connect them temporally in a greedy manner. Firstly,

each proposal Pi

tis warped to the next frame using optical

ﬂow, and we search for its matched proposal in Pt+1 by

evaluating the IoU scores. If the maximum IoU score is

above 0.1, the corresponding proposal is regarded as being

matched with Pi

t. 3) Tracklet Merging by Re-Identiﬁcation

(ReID). We further merge short-term tracklets into a set of

consistent segmentation tracks using object re-identiﬁcation.

The ReID embedding vector for each proposal is computed

using a pretrained ReID network [84]. For each tracklet, its

embedding is computed as the average embedding of all

proposals belonging to it. We use L2distance to measure

the similarities between two tracklets and adopt the merging

strategy in [85] to obtain ﬁnal segmentation tracks.

B. Dynamic Visual Attention Prediction

Our MATNet is ﬂexible to ﬁt the DVAP task with modi-

ﬁcations in two aspects: 1) Network structure: Since bound-

IEEE TRANSACTIONS ON IMAGE PROCESSING 8

TABLE I: Quantitative comparison of ZVOS methods on DAVIS16 val. The best result for each metric is boldfaced (This

note is also applied to other tables.). All the results are borrowed from the public leaderboard maintained by the DAVIS16

challenge (https://davischallenge.org/davis2016/ soa compare.html). See V-A for details.

Measure SFL [29] FSEG[28] LVO [48] ARP [80] PDB [53] LSMO [44] MOT[81] EPO [49] AGS [43] COSNet [41] AGNN[17] AnDiff [82] MATNet

Mean↑67.4 70.7 75.9 76.2 77.2 78.2 77.2 80.6 79.7 80.5 80.7 81.7 82.4

JRecall↑81.4 83.5 89.1 91.1 90.1 89.1 87.8 95.2 91.1 93.1 94.0 90.9 94.5

Decay↓6.2 1.5 0.0 7.0 0.9 4.1 5.0 2.2 1.9 4.4 0.0 2.2 5.5

Mean↑66.7 65.3 72.1 70.6 74.5 75.9 77.4 75.5 77.4 79.5 79.1 80.5 80.7

FRecall↑77.1 73.8 83.4 83.5 84.4 84.7 84.4 87.9 85.8 89.5 90.5 85.1 90.2

Decay↓5.1 1.8 1.3 7.9 -0.2 3.5 3.3 2.4 1.6 5.0 0.0 0.6 4.5

TMean↓28.2 32.8 26.5 39.3 29.1 21.2 27.9 19.3 26.7 18.4 33.7 21.4 21.6

TABLE II: Quantitative results for each category on YouTube-

Objects over Mean J. See V-A for details.

LVO SFL FSEG PDB AGS COSNet AGNN MATNet

Category [48] [29] [28] [53] [43] [41] [17]

Airplane (6) 86.2 65.6 81.7 78.0 87.7 81.1 81.1 72.9

Bird (6) 81.0 65.4 63.8 80.0 76.7 75.7 75.9 77.5

Boat (15) 68.5 59.9 72.3 58.9 72.2 71.3 70.7 66.9

Car (7) 69.3 64.0 74.9 76.5 78.6 77.6 78.1 79.0

Cat (16) 58.8 58.9 68.4 63.0 69.2 66.5 67.9 73.7

Cow (20) 68.5 51.2 68.0 64.1 64.6 69.8 69.7 67.4

Dog (27) 61.7 54.1 69.4 70.1 73.3 76.8 77.4 75.9

Horse (14) 53.9 64.8 60.4 67.6 64.4 67.4 67.3 63.2

Motorbike (10) 60.8 52.6 62.7 58.4 62.1 67.7 68.3 62.6

Train (5) 66.3 34.0 62.2 35.3 48.2 46.8 47.8 51.0

Mean J ↑ 67.5 57.1 68.4 65.5 69.7 70.5 70.8 69.0

TABLE III: Quantitative results on FBMS over Mean J(V-A).

Measure MSTP [42] FSEG [28] IET [45] OBN [31] PDB [53] COSNet [41] MATNet

Mean J ↑ 60.8 68.4 71.9 73.9 74.0 75.6 76.1

ary ground-truths are not available in this task, we discard

the object boundary constraints so that Eq. (11) becomes:

Fi−1=FBARi(Zi,Fi). In this way, for BARi, more ﬁne-

grained features Fi−1are produced by relying on only the

features Fifrom BARi−1as well as the corresponding convo-

lutional feature Zi. Besides, we also remove the unnecessary

concatenation operator in Fig. 5. All other modules are kept

unchanged. 2) Loss function: We consider the Kullback-Leibler

(KL) divergence loss LKL as our main learning objective. It is

more task-oriented and has been proven effective in [86]. The

overall loss function is:

LDVAP =LKL (Mv,Gv) + λLCE(Mv,Gv),(15)

where Mvand Gvare the attention prediction and ground-

truth, respectively. LKL (Mv,Gv) = PiGv

ilog( Gv

i).λ= 0.1

is a weight to balance the contributions of the two losses.

V. EXPERIMENTS

In this section, we ﬁrst compare MATNet with state-of-

the-art models on our main task of interest, i.e., ZVOS,

on both object-level (V-A) and instance-level (V-B) settings.

Then, we investigate the performance of our model on the

DVAP task (V-C). For each task, we separately introduce the

corresponding standalone datasets and experimental results.

Finally, to gain a deeper insight into our model, we conduct

detailed ablation studies in V-D.

Fig. 7: Attribute-based comparison on DAVIS16 val. We

compare MATNet with three top-performing methods, i.e., An-

Diff [82], COSNet [41] and AGS [43]. For each method, Mean

Jis computed over all sequences with speciﬁed attributes.

Fig. 8: Attribute-based ablation study on DAVIS16 val. We

compare the Mean Jof different network variants under

various attributes.

A. Main Task: Zero-Shot Video Object Segmentation

1) Datasets: We carry out comprehensive experiments on

three popular datasets:

DAVIS16 [32] is one of the most popular video object seg-

mentation datasets, which consists of 50 high-quality videos

in total (30 for train and 20 for val). Each frame contains

pixel-wise annotations for foreground objects. For quantitative

evaluation, we use three standard metrics suggested by [32],

namely region similarity J, boundary accuracy F, and time

stability T.

YouTube-Objects [34] is a large dataset of 126 web videos

with 10 semantic object categories and more than 20,000

frames. Following its protocol, we use the region similarity

Jmetric to measure the performance on the whole dataset

without further training.

FBMS [11] consists of 59 video sequences with ground-truth

annotations provided in a subset of the frames. Following the

standard protocol [48], we do not use any sequence for training

and only evaluate on the val set consisting of 30 sequences.

2) Implementation Details: The training data consist of two

parts: i) all training data from DAVIS16 [32], including 30

videos with about 2K frames; ii) a subset of 12K frames

selected from the training set of YouTube-VOS [87], which is

obtained by sampling images every ten frames in each video.

In total, we use 14K training samples, basically matching

the current top-performing methods, i.e., AGNN [17], COS-

IEEE TRANSACTIONS ON IMAGE PROCESSING 9

TABLE IV: Quantitative comparison of ZVOS methods on

DAVIS17 val. All the results are borrowed from the public

leaderboard of the DAVIS17 challenge (https://davischallenge.

org/davis2017/ soa compare.html). See V-B for details.

Measure RVOS [16] PDB [53] AGS [43] MATNet

J&FMean↑41.2 55.1 57.5 58.6

Mean↑36.8 53.2 55.5 56.7

JRecall↑40.2 58.9 61.6 65.2

Decay↓0.5 4.9 7.0 -3.6

Mean↑45.7 57.0 59.5 60.4

FRecall↑46.4 60.2 62.8 68.2

Decay↓1.7 6.8 9.0 1.8

Net [41] and AGS [43]. The entire network is trained using the

SGD optimizer with a learning rate of 1e-4 for the encoder and

the bridge network, and 1e-3 for the decoder. During training,

the batch size, momentum and weight decay are set to 2,0.9,

and 1e-5, respectively. The data are augmented online with

horizontal ﬂipping and rotations covering a range of degrees

(−10,10).

3) Performance on DAVIS16 val:We compare our MAT-

Net with the top performing ZVOS methods in the public

leaderboard of DAVIS16. The detailed results are reported

in Table I. We can observe that our MATNet achieves the

best performance compared to other methods. Speciﬁcally,

it outperforms the second-best method (i.e., AnDiff [82]) by

+0.7% and +0.2% in terms of Mean Jand Mean F, and

+3.6% and +5.1% in terms of Recall Jand Recall F.

In Table I, some of the deep learning-based models, e.g.,

FSEG [28], LVO [48], MOT [81], use motion cues to improve

segmentation. Our MATNet outperforms all of these methods

by a large margin. The reason lies in that these methods

learn motion and appearance features independently, without

considering the close interactions between them. In contrast,

our MATNet can learn more effective multi-modal object

representations with the interleaved encoder.

Fig. 7 shows the results of attribute-based study on

DAVIS16 [32] using 15 video attributes provided by the

dataset. Three top-performing ZVOS methods, i.e., An-

Diff [82], COSNet [41] and AGS [43], are selected for

comparison. Our model signiﬁcantly outperforms them in

terms of many attributes (e.g., low resolution,fast motion,

dynamic background,motion blur,heterogeneous object, and

appearance change). This demonstrates the robustness of our

model against various challenges present in videos.

4) Performance on YouTube-Objects: Table II reports the

detailed results on YouTube-Objects. Our model shows

promising performance in most categories. It lags behind some

methods in the Airplane and Boat categories. This is mainly

because sequences in these categories contain slowly-moving

objects, which are often visually similar to their surroundings.

These factors may result in inaccurate estimation of optical

ﬂow, thereby hurting the performance.

5) Performance on FBMS: For completeness, we also eval-

uate our method on FBMS. As shown in Table III, MATNet

produces the best results with 76.1% in Mean J, which

outperforms the second-best result, i.e., PDB, by 2.1%.

6) Qualitative results: Fig. 6 depicts sample results for

representative sequences from these three datasets. The dance-

TABLE V: Quantitative DVAP results on the val sets of

Hollywood-2 and UCF-Sports. See V-C for details.

Dataset Methods AUC-J↑SIM↑s-AUC↑CC↑NSS↑

ACLNet [68] 0.913 0.542 0.757 0.623 3.086

SalEMA [88] 0.919 0.487 0.708 0.613 3.186

Hollywood-2 TASED [89] 0.918 0.507 0.768 0.646 3.302

STRA [58] 0.923 0.536 0.774 0.662 3.478

Ours 0.915 0.539 0.797 0.674 3.486

Dataset Methods AUC-J↑SIM↑s-AUC↑CC↑NSS↑

ACLNet [68] 0.897 0.406 0.744 0.510 2.567

SalEMA [88] 0.906 0.431 0.740 0.544 2.638

UCF-Sports TASED [89] 0.899 0.469 0.752 0.582 2.920

STRA [58] 0.910 0.479 0.751 0.593 3.018

Ours 0.901 0.503 0.783 0.625 3.291

twirl sequence from DAVIS16 contains many challenging fac-

tors, such as object deformation, motion blur and background

clutter. As seen, our method is robust to these challenges and

delineates the target with accurate contours. The effectiveness

is further proved in cat-0001 from YouTube-Objects, in which

the cat is visually similar to its surroundings and undergoes

large deformation. In addition, our model also works well in

dogs02, in which the target suffers large scale variations.

B. Additional Task: Zero-Shot Video Instance Segmentation

1) Datasets: DAVIS17 [33] extends DAVIS16 with another

70 sequences, leading to 120 videos in total. These videos are

split into 60 for train, 30 for val and 30 for test-dev.

Different from DAVIS16 , this dataset provides instance-level

annotations. Therefore, we use it to evaluate the performance

of our model in instance-level video object segmentation.

Following the standard evaluation setting, we measure the

performance in terms of region similarity J, contour accuracy

F, and their combination J&F.

2) Quantitative and Qualitative Results: Table IV reports

the performance of MATNet against three top-performing

models (i.e., RVOS [16], PDB [53] and AGS [43]). The results

clearly demonstrate that our model outperforms all of them by

a large margin. For instance, in terms of J&F, mean Jand

mean F, our model surpasses the second-best method (i.e.,

AGS), by 1.1%,1.2% and 0.9%, respectively.

Besides, some qualitative results on DAVIS17 are shown in

Fig. 6 (the last row), validating that our model yields high-

quality ZVOS results in the instance-level setting.

C. Additional Task: Dynamic Visual Attention Prediction

1) Datasets: Hollywood-2 [35] consists of 1,707 video

sequences (823 for train and 884 for test) collected from

69 Hollywood movies, covering 12 action categories (e.g.,

eating, kissing and running). The dataset focuses on more task-

driven scenes, e.g., movie scenes and human actions.

UCF-Sports [35] includes 150 videos covering 9 common

sports action categories, such as walking, diving and golﬁng.

Similar to Hollywood-2, the annotations in this dataset mainly

focus on action behaviors. The dataset is split into 103 videos

for train and 47 for test.

IEEE TRANSACTIONS ON IMAGE PROCESSING 10

TABLE VI: Ablation study of MATNet on DAVIS16 val,

measured by the Mean Jand Mean F. See V-D for details.

Network Variant Mean J ↑ ∆JMean F ↑ ∆F

MATNet w/o MAT 79.5 -2.9 77.3 -3.4

MATNet w/o SSA 80.7 -1.7 79.7 -1.0

MATNet w/o HEM 81.4 -1.0 78.4 -2.3

MATNet w/ Res50 81.1 -1.3 79.3 -1.4

MATNet w/ Res101 82.4 -80.7 -

(a) Image (b) Groundtruth (c) w/o MAT (d) w/o SSA (e) w/o HEM (f) MATNet

Fig. 9: Qualitative results of ablation study.

2) Implementation Details: For each dataset, we use the

train set to train our model. The network is trained with

the same setting as in V-A, except that the training images

are resized to 360 ×360 for fair comparison with previous

works [58, 68, 89]. λin Eq. 15 is empirically set to 0.1.

3) Metrics: Following previous work [68], we report the

performance of our model using ﬁve metrics, namely Nor-

malized Scanpath Saliency (NSS), Similarity (SIM), Linear

Correlation Coefﬁcient (CC), Area Under the Curve by Judd

(AUC-J) and shufﬂed AUC (s-AUC). NSS and CC measure the

correlation between the prediction and ground-truth saliency

map. SIM computes the similarity between two histograms,

while AUC-J and s-AUC are variants of the well-known

AUC metric. For each metric, higher scores indicate better

performance.

4) Quantitative Results: We compare our model with four

DVAP models, i.e., ACLNet [68], SalEMA [88], STRA [58]

and TASED [89]. The results of these methods are directly

obtained from the authors. As shown in Table V, MATNet

generally outperforms all the competitors across most of the

metrics, in both the Hollywood-2 and UCF-Sports datasets.

This veriﬁes the strong generality of our model.

D. Ablation Study

Table VI summarizes the ablation analysis of MATNet on

DAVIS16 val.

1) Efﬁcacy of MAT: We ﬁrst study the effects of the MAT

module by comparing our full model to one following the same

architecture without MAT, denoted as MATNet w/o MAT.

The encoder in this network is thus equivalent to a standard

two-stream model, where convolution features from the two

streams are concatenated at each residual stage for object

representation. As shown in Table VI, this model encounters a

huge performance degradation (-2.9% in Mean Jand -3.4%

in Mean F), which veriﬁes the effectiveness of MAT.

Moreover, we also evaluate the performance of MATNet

with a different number of MAT modules in each deep residual

MAT layer. The results in TableVII show that the performance

gradually improves as Lincreases, reaching saturation at L=

5. Based on this analysis, we choose L= 5 as the default

number of MAT modules in MATNet.

TABLE VII: Performance comparisons with different numbers

of MAT blocks cascaded in each MAT layer on DAVIS16 val.

See V-D for details.

Metric L= 0 L= 1 L= 3 L= 5 L= 7

Mean J ↑ 79.5 80.6 81.6 82.4 82.2

Mean F ↑ 77.3 80.3 80.7 80.7 80.6

TABLE VIII: Impacts of different optical ﬂow methods on

DAVIS16 val. See V-D for details.

Flow Method Mean J ↑ Mean F ↑ Mean T ↓

LiteFlowNet [90] 80.9 79.3 23.2

SpyNet [91] 78.4 76.8 26.6

PWCNet [78] 82.4 80.7 21.6

2) Efﬁcacy of SSA: To measure the effectiveness of the

SSA module, we design another network, MATNet w/o SSA,

by replacing the SSA block with a simple skip layer. As

can be observed, its performance is -1.7% lower than our

full model in terms of Mean J, and -1.0% lower in Mean

F. The performance drop is mainly caused by the redundant

spatiotemporal features from the encoder. Our SSA module

aims to eliminate the redundancy by only highlighting the

features that are beneﬁcial to segmentation.

3) Efﬁcacy of HEM: We also study the inﬂuence of using

HEM during training. HEM is expected to facilitate the learn-

ing of more accurate object boundaries, which should further

boost the segmentation procedure. The results in Table VI

(see MATNet w/o HEM) indicate the importance of HEM.

By directly controlling the loss function in Eq. (12), HEM

helps to improve the contour accuracy by 2.3%.

4) Impact of Backbone: To verify that the high performance

of our network is not mainly due to the powerful backbone, we

replace ResNet-101 with ResNet-50 to build another network,

i.e., MATNet w/ Res50. We see that the performance degrades

slightly, but the model still outperforms previous methods (e.g.,

AGNN [17], COSNet [41], AGS [43]). This further conﬁrms

the effectiveness of the proposed modules.

5) Impact of Optical Flow: Table VIII reports the results

of MATNet on DAVIS16 val with three open-sourced op-

tical ﬂow computation methods, i.e., PWC-Net [78], Lite-

FlowNet [90] and SpyNet [91]. They rank #23, #34 and

#143 in the public MPI Sintel Flow Benchmark (http://sintel.

is.tue.mpg.de/results), respectively. Generally, better optical

ﬂow models lead to more accurate segmentation results, but

the performance does not change much, demonstrating the

robustness of our model against optical ﬂow inputs.

6) Attribute Analysis: Fig. 8 illustrates the performance

comparison of different variants in the ablation study under

various video attributes. The performance is consistent with

that reported in Table VI. All three modules (i.e., MAT, SSA

and HEM) are critical for our model to improve performance.

7) Qualitative Comparison: Fig. 9 shows visual results of

the above ablation studies on two sequences. We see that all

of the network variants produce worse results compared with

our full model. It is worth noting that the MAT block has the

greatest visually inﬂuence on the performance.

IEEE TRANSACTIONS ON IMAGE PROCESSING 11

VI. CONCLUSION

In this paper, we proposed a novel MATNet for ZVOS.

We introduced a new way to learn rich spatiotemporal object

representations with an interleaved encoder, which encour-

ages knowledge propagation from motion to appearance in a

hierarchical manner. The spatiotemporal features are further

processed by a bridge network to produce more compact

representations, which are subsequently fed into a boundary-

aware decoder to obtain accurate segmentation in a top-down

fashion. We compared the proposed model with other state-of-

the-art ZVOS methods over four large-scale benchmarks and

the experimental results demonstrated that it achieves favor-

able performance against other contenders. Beneﬁting from

the powerful interleaved encoder for representation learning

in videos, our model also showed compelling performance

in the DVAP task. In the future, we will further extend it

to other video analysis tasks, such as action recognition and

video classiﬁcation.

REFERENCES

[1] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep

interactive object selection,” in Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, 2016, pp.

373–381.

[2] H. Hadizadeh and I. V. Baji ´

c, “Saliency-aware video compres-

sion,” IEEE Transactions on Image Processing, vol. 23, no. 1,

pp. 19–33, 2013.

[3] T. Zhou, W. Wang, S. Qi, H. Ling, and J. Shen, “Cascaded

human-object interaction recognition,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition,

2020, pp. 4263–4272.

[4] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmenta-

tion for autonomous driving with deep densely connected mrfs,”

in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2016, pp. 669–677.

[5] A. Papazoglou and V. Ferrari, “Fast object segmentation in

unconstrained video,” in Proceedings of the IEEE International

Conference on Computer Vision, 2013, pp. 1777–1784.

[6] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic

video object segmentation,” in Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, 2015, pp.

3395–3402.

[7] A. Faktor and M. Irani, “Video segmentation by non-local

consensus voting,” in Proceedings of the British Machine Vision

Conference, 2014, pp. 8–20.

[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware

video object segmentation,” IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, vol. 40, no. 1, pp. 20–33, 2017.

[9] T. Brox and J. Malik, “Object segmentation by long term

analysis of point trajectories,” in European Conference on

Computer Vision, 2010, pp. 282–295.

[10] P. Ochs and T. Brox, “Object segmentation in video: a hierarchi-

cal variational approach for turning point trajectories into dense

regions,” in Proceedings of the IEEE International Conference

on Computer Vision, 2011, pp. 1583–1590.

[11] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects

by long term video analysis,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1187–

1200, 2013.

[12] M. Keuper, B. Andres, and T. Brox, “Motion trajectory seg-

mentation via minimum cost multicuts,” in Proceedings of the

IEEE International Conference on Computer Vision, 2015, pp.

3271–3279.

[13] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by

tracing discontinuities in a trajectory embedding,” in Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2012, pp. 1846–1853.

[14] D. Zhang, O. Javed, and M. Shah, “Video object segmentation

through spatially accurate and temporally dense extraction of

primary object regions,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2013, pp. 628–

635.

[15] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video

object segmentation,” in Proceedings of the IEEE International

Conference on Computer Vision, 2011, pp. 1995–2002.

[16] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and

X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video

object segmentation,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2019, pp. 5277–

5286.

[17] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao,

“Zero-shot video object segmentation via attentive graph neural

networks,” in Proceedings of the IEEE International Conference

on Computer Vision, 2019, pp. 9236–9245.

[18] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell,

“Zero-shot learning with semantic output codes,” in Advances in

Neural Information Processing Systems, 2009, pp. 1410–1418.

[19] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´

e, D. Cre-

mers, and L. Van Gool, “One-shot video object segmentation,”

in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2017, pp. 221–230.

[20] L. L. Cloutman, “Interaction between dorsal and ventral pro-

cessing streams: Where, when and how?” Brain and Language,

vol. 127, no. 2, pp. 251 – 263, 2013.

[21] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level

visual features have a causal inﬂuence on gaze during dynamic

scene viewing?” Journal of Vision, vol. 13, no. 9, pp. 144–144,

2013.

[22] E. S. Spelke, “Principles of object perception,” Cognitive sci-

ence, vol. 14, no. 1, pp. 29–56, 1990.

[23] Y. Ostrovsky, E. Meyers, S. Ganesh, U. Mathur, and P. Sinha,

“Visual parsing after recovery from blindness,” Psychological

Science, vol. 20, no. 12, pp. 1484–1491, 2009.

[24] S. E. Palmer, Vision science: Photons to phenomenology. MIT

press, 1999.

[25] M. Wertheimer, “Laws of organization in perceptual forms.”

1938.

[26] P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion

patterns in videos,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2017, pp. 3386–

3394.

[27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and

A. Sorkine-Hornung, “Learning video object segmentation from

static images,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2017, pp. 2663–

2672.

[28] S. D. Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to

combine motion and appearance for fully automatic segmenta-

tion of generic objects in videos,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2017,

pp. 2117–2126.

[29] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segﬂow: Joint

learning for video object segmentation and optical ﬂow,” in

Proceedings of the IEEE International Conference on Computer

Vision, 2017, pp. 686–695.

[30] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “Monet:

Deep motion exploitation for video object segmentation,” in

Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 1140–1148.

[31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo,

“Unsupervised video object segmentation with motion-based bi-

lateral networks,” in European Conference on Computer Vision,

2018, pp. 207–223.

[32] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,

IEEE TRANSACTIONS ON IMAGE PROCESSING 12

M. Gross, and A. Sorkine-Hornung, “A benchmark dataset

and evaluation methodology for video object segmentation,” in

Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2016, pp. 724–732.

[33] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´

aez, A. Sorkine-

Hornung, and L. Van Gool, “The 2017 davis challenge on video

object segmentation,” arXiv preprint arXiv:1704.00675, 2017.

[34] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari,

“Learning object class detectors from weakly annotated video,”

in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2012, pp. 3282–3289.

[35] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic

gaze datasets and learnt saliency models for visual recognition,”

IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, vol. 37, no. 7, pp. 1408–1424, 2014.

[36] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-

attentive transition for zero-shot video object segmentation,”

in Proceedings of AAAI Conference on Artiﬁcial Intelligence,

2020.

[37] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung,

“Fully connected object proposals for video segmentation,” in

Proceedings of the IEEE International Conference on Computer

Vision, 2015, pp. 3227–3234.

[38] I. Endres and D. Hoiem, “Category independent object propos-

als,” in European Conference on Computer Vision, 2010, pp.

575–588.

[39] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, “Video object

segmentation and tracking: A survey,” ACM Transactions on

Intelligent Systems and Technology, vol. 11, no. 4, pp. 1–47,

2020.

[40] W. Wang, J. Shen, and L. Shao, “Video salient object detection

via fully convolutional networks,” IEEE Transactions on Image

Processing, vol. 27, no. 1, p. 3849, 2018.

[41] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See

more, know more: Unsupervised video object segmentation with

co-attention siamese networks,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2019,

pp. 3623–3632.

[42] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Unsupervised

video object segmentation using motion saliency-guided spatio-

temporal propagation,” in European Conference on Computer

Vision, 2018, pp. 786–802.

[43] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. H. Hoi,

and H. Ling, “Learning unsupervised video object segmentation

through visual attention,” in Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, 2019, pp.

3064–3074.

[44] P. Tokmakov, C. Schmid, and K. Alahari, “Learning to segment

moving objects,” International Journal of Computer Vision, vol.

127, no. 3, pp. 282–301, 2019.

[45] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C.

Jay Kuo, “Instance embedding transfer to unsupervised video

object segmentation,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2018, pp. 6526–

6535.

[46] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli,

“Unsupervised online video object segmentation with motion

property understanding,” IEEE Transactions on Image Process-

ing, vol. 29, pp. 237–249, 2019.

[47] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and

L. Van Gool, “Video object segmentation with episodic graph

memory networks,” in European Conference on Computer Vi-

sion, 2020.

[48] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video

object segmentation with visual memory,” in Proceedings of

the IEEE International Conference on Computer Vision, 2017,

pp. 4481–4490.

[49] M. Faisal, I. Akhter, M. Ali, and R. Hartley, “Exploiting

geometric constraints on dense trajectories for motion saliency,”

Winter Conference on Applications of Computer Vision, 2019.

[50] K. Simonyan and A. Zisserman, “Two-stream convolutional

networks for action recognition in videos,” in Advances in

Neural Information Processing Systems, 2014, pp. 568–576.

[51] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal

residual networks for video action recognition,” in Advances in

Neural Information Processing Systems, 2016, pp. 3468–3476.

[52] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal

multiplier networks for video action recognition,” in Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2017, pp. 4768–4777.

[53] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid

dilated deeper convlstm for video salient object detection,” in

European Conference on Computer Vision, 2018, pp. 715–731.

[54] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention

for video salient object detection,” in Proceedings of the IEEE

International Conference on Computer Vision, 2019, pp. 7274–

7283.

[55] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal

saliency networks for dynamic saliency prediction,” IEEE

Transactions on Multimedia, vol. 20, no. 7, pp. 1688–1698,

2017.

[56] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep

learning based video saliency prediction approach,” in European

Conference on Computer Vision, 01 2018, pp. 625–642.

[57] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,

“Revisiting video saliency prediction in the deep learning era,”

IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, vol. PP, pp. 1–1, 06 2019.

[58] Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency

prediction using spatiotemporal residual attentive networks,”

IEEE Transactions on Image Processing, vol. 29, pp. 1113–

1126, 2019.

[59] K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang, “Spatiotem-

poral cnn for video object segmentation,” arXiv preprint

arXiv:1904.02363, 2019.

[60] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-

vised learning of video representations using lstms,” in Pro-

ceedings of the International Conference on Machine Learning,

2015, pp. 843–852.

[61] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin, “Flow guided

recurrent neural encoder for video salient object detection,” in

Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 3243–3252.

[62] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,

“Supervising neural attention models for video captioning by

human gaze data,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2017, pp. 490–498.

[63] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau, “A

coherent computational approach to model bottom-up visual

attention,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 28, no. 5, pp. 802–817, 2006.

[64] N. Bruce and J. Tsotsos, “Saliency based on information

maximization,” in Advances in Neural Information Processing

Systems, 2006, pp. 155–162.

[65] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,”

in Advances in Neural Information Processing Systems, 2007,

pp. 545–552.

[66] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an

alternative to the feature integration model for visual search.”

Journal of Experimental Psychology: Human perception and

performance, vol. 15, no. 3, p. 419, 1989.

[67] C. Koch and S. Ullman, “Shifts in selective visual attention:

towards the underlying neural circuitry,” in Matters of intelli-

gence. Springer, 1987, pp. 115–141.

[68] W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji,

“Revisiting video saliency: A large-scale benchmark and a new

model,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2018, pp. 4894–4903.

IEEE TRANSACTIONS ON IMAGE PROCESSING 13

[69] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-

lation by jointly learning to align and translate,” Proceedings

of the International Conference on Learning Representations,

2015.

[70] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”

in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 7132–7141.

[71] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Con-

volutional block attention module,” in European Conference on

Computer Vision, 2018, pp. 3–19.

[72] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked

attention networks for image question answering,” in Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2016, pp. 21–29.

[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[74] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,

and X. Tang, “Residual attention network for image classiﬁ-

cation,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2017, pp. 3156–3164.

[75] S. Xie and Z. Tu, “Holistically-nested edge detection,” in

Proceedings of the IEEE International Conference on Computer

Vision, 2015, pp. 1395–1403.

[76] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

A. L. Yuille, “Deeplab: Semantic image segmentation with

deep convolutional nets, atrous convolution, and fully connected

crfs,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 40, no. 4, pp. 834–848, 2017.

[77] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel

matters–improve semantic segmentation by global convolutional

network,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2017, pp. 4353–4361.

[78] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: Cnns

for optical ﬂow using pyramid, warping, and cost volume,” in

Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 8934–8943.

[79] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K.

Maninis, and L. Van Gool, “The 2019 davis challenge on

vos: Unsupervised multi-object segmentation,” arXiv preprint

arXiv:1905.00737, 2019.

[80] Y. J. Koh and C.-S. Kim, “Primary object segmentation in

videos based on region augmentation and reduction,” in Pro-

ceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2017, pp. 7417–7425.

[81] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny,

and M. Jagersand, “Video object segmentation using teacher-

student adaptation in a human robot interaction (hri) setting,”

in International Conference on Robotics and Automation, 2019,

pp. 50–56.

[82] Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr,

“Anchor diffusion for unsupervised video object segmentation,”

in Proceedings of the IEEE International Conference on Com-

puter Vision, 2019, pp. 931–940.

[83] K. He, G. Gkioxari, P. Doll´

ar, and R. Girshick, “Mask R-

CNN,” in Proceedings of the IEEE International Conference

on Computer Vision, 2017, pp. 2961–2969.

[84] J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-

generation, reﬁnement and merging for video object segmenta-

tion,” in ACCV, 2018, pp. 565–580.

[85] J. Luiten, I. E. Zulﬁkar, and B. Leibe, “Unovost: Unsupervised

ofﬂine video object segmentation and tracking,” in Winter Con-

ference on Applications of Computer Vision, 2020, pp. 2000–

2009.

[86] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing

the semantic gap in saliency prediction by adapting deep neural

networks,” in Proceedings of the IEEE International Conference

on Computer Vision, 2015, pp. 262–270.

[87] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,

S. Cohen, and T. Huang, “Youtube-vos: Sequence-to-sequence

video object segmentation,” in European Conference on Com-

puter Vision, 2018, pp. 585–601.

[88] P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-

i Nieto, and K. McGuinness, “Simple vs complex temporal

recurrences for video saliency prediction,” in Proceedings of

the British Machine Vision Conference, 2019.

[89] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spa-

tial encoder-decoder network for video saliency detection,” in

Proceedings of the IEEE International Conference on Computer

Vision, 2019, pp. 2394–2403.

[90] T.-W. Hui, X. Tang, and C. C. Loy, “Liteﬂownet: A lightweight

convolutional neural network for optical ﬂow estimation,” in

Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 8981–8989.

[91] A. Ranjan and M. J. Black, “Optical ﬂow estimation using a

spatial pyramid network,” in Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, 2017, pp.

4161–4170.

Research on Rapid Recognition of Moving Small Targets by Robotic Arms Based on Attention Mechanisms

Article

Full-text available

May 2024

For small target objects on fast-moving conveyor belts, traditional vision detection algorithms equipped with conventional robotic arms struggle to capture the long and short-range pixel dependencies crucial for accurate detection. This leads to high miss rates and low precision. In this study, we integrate the traditional EMA (efficient multi-scale attention) algorithm with the c2f (channel-to-pixel) module from the original YOLOv8, alongside a Faster-Net module designed based on partial convolution concepts. This fusion results in the Faster-EMA-Net module, which greatly enhances the ability of the algorithm and robotic technologies to extract pixel dependencies for small targets, and improves perception of dynamic small target objects. Furthermore, by incorporating a small target semantic information enhancement layer into the multiscale feature fusion network, we aim to extract more expressive features for small targets, thereby boosting detection accuracy. We also address issues with training time and subpar performance on small targets in the original YOLOv8 algorithm by improving the loss function. Through experiments, we demonstrate that our attention-based visual detection algorithm effectively enhances accuracy and recall rates for fast-moving small targets, meeting the demands of real industrial scenarios. Our approach to target detection using industrial robotic arms is both practical and cutting-edge.

Region Aware Video Object Segmentation With Deep Motion Modeling

Article

Full-text available

Mar 2024
IEEE T IMAGE PROCESS

Current semi-supervised video object segmentation (VOS) methods often employ the entire features of one frame to predict object masks and update memory. This introduces significant redundant computations. To reduce redundancy, we introduce a Region Aware Video Object Segmentation (RAVOS) approach, which predicts regions of interest (ROIs) for efficient object segmentation and memory storage. RAVOS includes a fast object motion tracker to predict object ROIs in the next frame. For efficient segmentation, object features are extracted based on the ROIs, and an object decoder is designed for object-level segmentation. For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects. In addition to RAVOS, we also propose a large-scale occluded VOS dataset, dubbed OVOS, to benchmark the performance of VOS models under occlusions. Evaluation on DAVIS and YouTube-VOS benchmarks and our new OVOS dataset show that our method achieves state-of-the-art performance with significantly faster inference time, e.g., 86.1 J&F at 42 FPS on DAVIS and 84.4 J&F at 23 FPS on YouTube-VOS. Project page: ravos.netlify.app.

Target Detection and Segmentation Technology for Zero-shot Learning

Article

Full-text available

Mar 2024

Zero-shot learning (ZSL) in the field of computer vision refers to enabling the model to recognize and understand categories that have not been encountered during the training phase. It is particularly critical for object detection and segmentation tasks, because these tasks require the model to have good generalization capabilities to unknown categories. Object detection requires the model to determine the location of the object, while segmentation further requires the precise demarcation of the object's boundaries. In ZSL research, knowledge representation and transfer are core issues. Researchers have tried to use semantic attributes as a knowledge bridge to connect categories seen during the training phase and categories not seen during the testing phase. These attributes may be color, shape, etc., but this method requires accurate attribute annotation, which is often not easy to achieve in practice. Therefore, researchers have begun to explore the use of non-visual information such as knowledge maps and text descriptions to enrich the recognition capabilities of models, but this also introduces the challenge of information integration and alignment. At present, ZSL has made certain progress in target detection and segmentation tasks, but there is still a significant gap compared with traditional supervised learning. This is mainly due to the limited ability of ZSL models to generalize to new categories. To this end, researchers have begun to explore combining ZSL with other technologies, such as generative adversarial networks (GANs) and reinforcement learning, to enhance the model's detection and segmentation capabilities for new categories. Future research needs to focus on several aspects. The first is how to design a more effective knowledge representation and transfer mechanism so that the model can better utilize existing knowledge. The second step is to develop new algorithms to improve the performance of ZSL in complex environments. In addition, research should focus on how to reduce the dependence on computing resources so that the ZSL method can run effectively in resource-limited environments. In summary, the research on target detection and segmentation technology of zero-shot learning is a cutting-edge topic in the field of computer vision. Despite the challenges, with the deepening of research, we expect these technologies to contribute to improving the generalization ability and intelligence level of computer vision systems.

A deep learning and transfer learning model for intra-change detection in images

Article

Full-text available

Jan 2024
J SUPERCOMPUT

On-shelf availability (OSA) refers to the number of products available in saleable condition to a customer at the place he expects and at the time he wants to buy the product. OSA of the products in retail stores is important to enhance the customer shopping experience and profitability. The products on the retail store shelves are arranged strategically. This arrangement may be changed during shopping, because it is observed that customers may not move the products in the same or at the right place when they pick them up from the shelf. In such real-time scenarios, it would be appropriate to perform shelf image analysis to detect and identify these misplaced objects. This will help the shopkeeper to maintain a large-scale store effortlessly, otherwise it will require massive human effort and labour along with time. Many computer vision-based technology solutions have been developed by researchers to solve this problem. This article discusses a convolutional neural network (CNN)-based method for classifying shelf imagery into the correct semantic classes (that is whether it has misplaced products or not). The proposed architecture was evaluated with a modified COIL-100 dataset. The results of the proposed CNN model were compared with the variant of the proposed model in addition to lightweight pre-trained deep learning models viz; mobilenetV2, densenet121, efficientnetb0, efficientnetb3, and NASNetMobile using transfer learning (TL) concept. According to this study, it has been found that the TL-based MobilenetV2 model is lightweight and achieves better classification performance with 91.28% accuracy. Furthermore, a CNN-based model with 11 user-defined layers provides an accuracy of 90.36%. This shows the efficacy of the proposed models for change detection.

Iterative feedback-based models for image and video polyp segmentation

Article

May 2024
COMPUT BIOL MED

SVNet: Supervoxel Network for Video Oversegmentation

Conference Paper

Apr 2024

SFSANet: Multi-scale Object Detection In Remote Sensing Image Based on Semantic Fusion and Scale Adaptability

Article

Jan 2024

In the field of computer vision, remote sensing image object detection plays an important role. Although the object detection algorithm has made significant progress, there are still problems in detecting objects with multi-scale in remote sensing image. Due to the insufficient utilization of object feature information, the detection accuracy of multi-scale objects is very low. To address the aforementioned issues, this paper proposes an effective object detection algorithm for remote sensing image based on semantic fusion and scale adaptability, known as SFSANet. Firstly, in view of the problem that the existing methods ignore the semantic differences between different scale feature maps, the semantic fusion (SF) module is proposed to enrich the semantic information and improve the ability to classify and locate objects. Next, to address the issue of the objects being easily interfered in complex background and the detection performance is poor, the spatial location attention (SLA) module is constructed to suppress background information and make key objects more prominent. Additionally, the scale adaptability module (SA) is designed to enrich the expression of feature information, realize the integration of global and local information, and ensure the integrity of image structure. Finally, we adopt the SIoU loss function as the localization loss to expedite model convergence. In order to verify the effectiveness of the proposed method, we conduct experiments on the mainstream datasets DIOR and NWPU VHR-10, which fully demonstrate the superiority of the proposed method.

Polyp Segmentation Using UNet and ENet

Conference Paper

Dec 2023

Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation

Article

Mar 2024
IEEE T IMAGE PROCESS

Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS .

AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Article

Full-text available

Nov 2023

The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.

Video Object Segmentation with Episodic Graph Memory Networks

Chapter

Dec 2020

How to make a segmentation model efficiently adapt to a specific video as well as online target appearance variations is a fundamental issue in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well to both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.

Cascaded Human-Object Interaction Recognition

Conference Paper

Jun 2020

Motion-Attentive Transition for Zero-Shot Video Object Segmentation

Article

Apr 2020

In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e., DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts. Code is available at: https://github.com/tfzhou/MATNet.

Video Object Segmentation and Tracking: A Survey

Article

May 2020

Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.

EpO-Net: Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency

Conference Paper