Available via license: CC BY 4.0

Content may be subject to copyright.

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Probabilistic Graph Attention Network with

Conditional Kernels for Pixel-Wise Prediction

Dan Xu, Xavier Alameda-Pineda, Senior Member, IEEE, Wanli Ouyang, Senior Member, IEEE, Elisa

Ricci, Member, IEEE, Xiaogang Wang, Senior Member, IEEE, Nicu Sebe, Senior Member, IEEE

Abstract

—Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various

pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a

fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-scale

feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or

concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields

(AG-CRFs) model for learning and fusing multi-scale representations in a principled manner. In order to further improve the learning

capacity of the network structure, we propose to exploit feature dependant conditional kernels within the deep probabilistic framework.

Extensive experiments are conducted on four publicly available datasets (i.e. BSDS500, NYUD-V2, KITTI and Pascal-Context) and on

three challenging pixel-wise prediction problems involving both discrete and continuous labels (i.e. monocular depth estimation, object

contour prediction and semantic segmentation). Quantitative and qualitative results demonstrate the effectiveness of the proposed latent

AG-CRF model and the overall probabilistic graph attention network with feature conditional kernels for structured feature learning and

pixel-wise prediction.

Index Terms

—Structured representation learning, attention model, conditional random ﬁelds, conditional kernels, pixel-wise prediction

F

1 INTRODUCTION

T

HE capability to effectively exploit multi-scale feature

representations is considered a crucial factor for achiev-

ing accurate predictions for the pixel-level prediction in

both traditional [1] and CNN-based [2], [3], [4], [5], [6]

approaches. Restricting the attention on deep learning-based

solutions, existing methods [2], [4] typically derive multi-

scale representations by adopting standard CNN architec-

tures and directly considering the feature maps associated

to different inner semantic layers. These maps are highly

complementary: while the features from the shallow layers

are responsible for predicting low-level details, the ones

from the deeper layers are devoted to encode the high-level

semantic structure of the objects. Traditionally, concatenation

and weighted average are very popular strategies to combine

multi-scale representations (see Figure 2.a). While these

strategies generally lead to an increased prediction accuracy

with a comparison to single-scale models, they severely sim-

plify the complex structured relationship between multi-scale

feature maps. The motivational cornerstone of this study is

the following research question: is it worth modelling and

exploiting complex relationships between multiple scales of

•

Dan Xu is with the Department of Computer Science and Engi-

neering, Hong Kong University of Science and Technology. E-mail:

danxu@cse.ust.hk.

•

Xavier Alameda-Pineda is with the Perception Group, INRIA. E-mail:

xavier.alameda-pineda@inria.fr.

•

Wanli Ouyang is with the Department of Electrical and Information Engi-

neering, the University of Sydney. E-mail: wanli.ouyang@sydney.edu.au.

•

Xiaogang Wang is with the Department of Electronic Engineering, the

Chinese University of Hong Kong. E-mail: xgwang@ee.cuhk.edu.hk

•

Elisa Ricci and Nicu Sebe are with the Department of Information

Engineering and Computer Science, University of Trento, Italy. E-mail:

{elisa.ricci, niculae.sebe}@unitn.it.

Manuscript received April 19, 2005; revised August 26, 2015.

Fig. 1. The proposed model targets multi-scale structured deep represen-

tation learning and could be applied into different pixel-wise prediction

problems involving both discrete and continuous prediction variables,

i.e. (a) monocular depth estimation on KITTI, (b) object contour detection

on BSDS500, and (c) semantic segmentation on Pascal-Context. The

ﬁrst, second and third columns are input RGB images, ground-truth and

predicted results, respectively.

deep representations for pixel-wise prediction?

Inspired by the success of recent works employing

graphical models within deep CNN architectures [7], [8]

for structured prediction, we propose a probabilistic graph

attention network structure base on a novel Attention-

Gated Conditional Random Fields (AG-CRFs), which allows

to learn effective feature representations at each scale by

exploiting the information available from other scales. This

is achieved by incorporating an attention mechanism [9]

seamlessly integrated into the multi-scale learning process

under the form of gates [10]. Intuitively, the attention

arXiv:2101.02843v1 [cs.CV] 8 Jan 2021

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

mechanism will further enhance the learning of the multi-

scale representation fusion by controlling the information

ﬂow (i.e. messages) among the feature maps, thus improving

the overall performance of the model. To further increase

the learning capacity of the model, we introduce feature

dependant conditional kernels for predicting the attentions

and the messages, enabling them conditioned on related

feature context while not shared by all the feature inputs.

In contrast to previous works [7], [8] aiming at structured

modelling on the prediction level, our model focuses on the

feature level, which leads to a much higher ﬂexibility when

applied to different pixel-wise prediction problems involving

both continuous and discrete prediction variables.

We implement the proposed AG-CRF as a neural net-

work module, and integrate it into a hierarchical multi-

scale CNN framework, deﬁning a novel probabilistic graph

attention network structure, termed as PGA-Net for pixel-

wise prediction. The hierarchical network is able to learn

richer multi-scale features than conventional CNNs, the

representational power of which is further enhanced by the

proposed conditional kernel AG-CRF model. We extensively

evaluate the effectiveness of the proposed model on three

different continuous and discrete pixel-wise prediction tasks

(see Figure 1), i.e. object contour prediction, monocular depth

estimation and semantic segmentation, and on multiple

challenging benchmarks (BSDS500 [11], NYUD-V2 [12],

KITTI [13] and Pascal-Context [14]). The results demonstrate

that our approach is able to learn rich and effective deep

structured representations, thus showing very competitive

performance to state-of-the-art methods on these tasks.

This paper extends our earlier work [15] through propos-

ing a new feature dependant conditional kernel strategy,

further re-elaborating the related works, providing more

methodological details, and signiﬁcantly expanding the

experiments and analysis by demonstrating the effective-

ness on another two popular pixel-wise prediction tasks

(monocular depth estimation and semantic segmentation).

Multi-scale deep features are widely demonstrated very

effective. (e.g. [2], [16]). The importance of our work is in joint

probabilistic modelling of the relationship among multi-scale

deep features using the conditional random ﬁelds and the

attention mechanism. To summarize, the contribution of this

paper is threefold:

•

First, we propose a structured probabilistic graph net-

work for effectively learning and fusing multi-scale deep

representations. We learn the multi-scale features using

a probabilistic graphical attention model, which is a

principled way of modelling the statistical relationship

among multi-scale features.

•

Second, we design an attention guided CRF graph model

which models the attention as gating for controlling the

message passing among features of different scales. As the

passed message among the feature maps are not always

useful, the attention mechanism is especially introduced

to control the message passing ﬂow among feature maps

of different scales. The attention is incorporated in the

probabilistic graphical model. We also introduce a condi-

tional kernel strategy for feature dependant attention and

message learning.

•

Third, extensive experiments are conducted on three

distinct pixel-wise prediction tasks and on four different

challenging datasets, demonstrating that the proposed

model and framework signiﬁcantly outperform previous

methods integrating multi-scale information from different

semantic network layers, and show very competitive

performance on all the tasks compared with the state-of-

the-art methods. The proposed model is generic in multi-

scale feature learning and can be ﬂexibly employed in

other continuous and discrete prediction problems.

The remainder of this paper is organized as follows. Sec. 2

introduces the related works, and then we illustrate the

proposed approach in Sec. 3, and in Sec. 4 we present the

details of the model implementation in deep networks. The

experimental evaluation and analysis are elaborated in Sec. 5.

We ﬁnally conclude the paper in Sec. 6.

2 RE LATED WORK

2.1 Pixel-wise Prediction

We review previous works with deep learning networks on

three important pixel-wise prediction tasks, i.e. contour detec-

tion, monocular depth estimation and semantic segmentation,

on which we extensively demonstrate the effectiveness of the

proposed approach.

Contour Detection.

In the last few years several deep learn-

ing models have been proposed for detecting contours [2],

[4], [17], [18], [19], [20] or crisp boundaries [21]. Among

these, some works explicitly focused on devising multi-scale

CNN models in order to boost performance. For instance,

the Holistically-Nested Edge Detection method [2] employed

multiple side outputs derived from the inner layers of a

primary CNN and combine them for the ﬁnal prediction.

Liu et al. [20] introduced a framework to learn rich deep

representations by concatenating features derived from all

convolutional layers of VGG16. Bertasius et al. [18] consid-

ered skip-layer CNNs to jointly combine feature maps from

multiple layers. Maninis et al. [4] proposed Convolutional

Oriented Boundaries (COB), where features from different

layers are fused to compute oriented contours and region

hierarchies. However, these works combine the multi-scale

representations from different layers adopting concatenation

and weighted averaging schemes while not considering the

dependency between the features.

Monocular Depth Estimation.

There are existing recent

works on monocular depth estimation based on deep

CNNs [7], [22], [23], [24], [25], [26], [27], [28], [29]. Among

them, Eigen et al. [30] proposed a multi-scale network

architecture for the task via considering two cascaded

networks, performing a coarse to ﬁne reﬁnement of the

depth prediction. They also further extend this framework

to deal with multiple pixel-level predictions, such as surface

normal estimation and semantic segmentation. Fu et al. [26]

presented a novel DORN method to cast the monocular

depth estimation as a deep ordinal regression problem.

Lee et al. [29] designed a network module which utilizes

the multi-scale local planar as guidance to learn more

effective structure features for depth estimation. Wang et

al. [23] introduced a CNN-CRF framework for joint depth

estimation and semantic segmentation. The most related

work to ours is [7], which introduced a continuous CRF

model for end-to-end learning depth regression with a

front-end CNN. Xu et al. [8] improved [7] by presenting

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

(a) Multi-Scale Neural Network (b) Multi-Scale CRFs

fs1

fs1

I

I

hs1

hs1

hs

hs

hs+1

hs+1

···

···

(c) Attention-Gated CRFs

fs1

I

gs1,s

gs,s+1

hs1

hs

hs+1

···

···

···

···

h

<latexit sha1_base64="wIyEJK7sn8fEJ6BYDDLMwpE4P7I=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgpssK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCbDIfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGtnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qjWdRRhjM4h0vw4AYa0IQWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPmt5ES</latexit>

<latexit sha1_base64="wIyEJK7sn8fEJ6BYDDLMwpE4P7I=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgpssK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCbDIfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGtnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qjWdRRhjM4h0vw4AYa0IQWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPmt5ES</latexit>

<latexit sha1_base64="wIyEJK7sn8fEJ6BYDDLMwpE4P7I=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgpssK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCbDIfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGtnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qjWdRRhjM4h0vw4AYa0IQWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPmt5ES</latexit>

h

h

fs

<latexit sha1_base64="42wBjz8FqsmjH45kBP/GE8ICA6o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclUQEXRbcdFnBPqAJZTK9aYdOJmFmIpTQ33DjQhG3/ow7/8ZJm4W2Hhg4nHMv98wJU8G1cd1vZ2Nza3tnt7JX3T84PDqunZx2dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9L/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEkZ5NB/qYa3uNtwFyDrxSlKHEu1h7csfJSyLURomqNYDz01NkFNlOBM4r/qZxpSyKR3jwFJJY9RBvsg8J5dWGZEoUfZJQxbq742cxlrP4tBOFhn1qleI/3mDzER3Qc5lmhmUbHkoygQxCSkKICOukBkxs4QyxW1WwiZUUWZsTVVbgrf65XXSvW54bsN7uKk3W2UdFTiHC7gCD26hCS1oQwcYpPAMr/DmZM6L8+58LEc3nHLnDP7A+fwBdLiR9g==</latexit>

<latexit sha1_base64="42wBjz8FqsmjH45kBP/GE8ICA6o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclUQEXRbcdFnBPqAJZTK9aYdOJmFmIpTQ33DjQhG3/ow7/8ZJm4W2Hhg4nHMv98wJU8G1cd1vZ2Nza3tnt7JX3T84PDqunZx2dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9L/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEkZ5NB/qYa3uNtwFyDrxSlKHEu1h7csfJSyLURomqNYDz01NkFNlOBM4r/qZxpSyKR3jwFJJY9RBvsg8J5dWGZEoUfZJQxbq742cxlrP4tBOFhn1qleI/3mDzER3Qc5lmhmUbHkoygQxCSkKICOukBkxs4QyxW1WwiZUUWZsTVVbgrf65XXSvW54bsN7uKk3W2UdFTiHC7gCD26hCS1oQwcYpPAMr/DmZM6L8+58LEc3nHLnDP7A+fwBdLiR9g==</latexit>

<latexit sha1_base64="42wBjz8FqsmjH45kBP/GE8ICA6o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclUQEXRbcdFnBPqAJZTK9aYdOJmFmIpTQ33DjQhG3/ow7/8ZJm4W2Hhg4nHMv98wJU8G1cd1vZ2Nza3tnt7JX3T84PDqunZx2dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9L/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEkZ5NB/qYa3uNtwFyDrxSlKHEu1h7csfJSyLURomqNYDz01NkFNlOBM4r/qZxpSyKR3jwFJJY9RBvsg8J5dWGZEoUfZJQxbq742cxlrP4tBOFhn1qleI/3mDzER3Qc5lmhmUbHkoygQxCSkKICOukBkxs4QyxW1WwiZUUWZsTVVbgrf65XXSvW54bsN7uKk3W2UdFTiHC7gCD26hCS1oQwcYpPAMr/DmZM6L8+58LEc3nHLnDP7A+fwBdLiR9g==</latexit>

fs

fs

fs+1

<latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>

<latexit sha1_base64="N9HmrOX3Qa29yn6mho8wKupr/e8=">AAAB7nicbVBNS8NAFHzxs9aq0auXxSIIQkm86FHw4rGC/YA2hM120y7dbMLuS6GE/BMvHhTx53jz37hpe9DWgYVh5j3e7ESZFAY979vZ2t7Z3duvHdQPG0fHJ+5po2vSXDPeYalMdT+ihkuheAcFSt7PNKdJJHkvmj5Ufm/GtRGpesZ5xoOEjpWIBaNopdB1hwnFSRQXcRkW5tovQ7fptbwFyCbxV6QJK7RD92s4SlmecIVMUmMGvpdhUFCNgkle1oe54RllUzrmA0sVTbgJikXyklxaZUTiVNunkCzU3xsFTYyZJ5GdrHKada8S//MGOcZ3QSFUliNXbHkoziXBlFQ1kJHQnKGcW0KZFjYrYROqKUNbVt2W4K9/eZN0b1q+1/KfPKjBOVzAFfhwC/fwCG3oAIMZvMAbvDuF8+p8LOvacla9ncEfOJ8/PaSSOQ==</latexit>

<latexit sha1_base64="N9HmrOX3Qa29yn6mho8wKupr/e8=">AAAB7nicbVBNS8NAFHzxs9aq0auXxSIIQkm86FHw4rGC/YA2hM120y7dbMLuS6GE/BMvHhTx53jz37hpe9DWgYVh5j3e7ESZFAY979vZ2t7Z3duvHdQPG0fHJ+5po2vSXDPeYalMdT+ihkuheAcFSt7PNKdJJHkvmj5Ufm/GtRGpesZ5xoOEjpWIBaNopdB1hwnFSRQXcRkW5tovQ7fptbwFyCbxV6QJK7RD92s4SlmecIVMUmMGvpdhUFCNgkle1oe54RllUzrmA0sVTbgJikXyklxaZUTiVNunkCzU3xsFTYyZJ5GdrHKada8S//MGOcZ3QSFUliNXbHkoziXBlFQ1kJHQnKGcW0KZFjYrYROqKUNbVt2W4K9/eZN0b1q+1/KfPKjBOVzAFfhwC/fwCG3oAIMZvMAbvDuF8+p8LOvacla9ncEfOJ8/PaSSOQ==</latexit>

<latexit sha1_base64="p9K1VIjKac8cPpLXXCQOvMmd0A4=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCVxo8uCmy4r2Ae0IUymk3boZBJmJoUQ8iduXCji1j9x5984abPQ1gMDh3Pu5Z45QcKZ0o7zbdW2tnd29+r7jYPDo+MT+/Ssr+JUEtojMY/lMMCKciZoTzPN6TCRFEcBp4Ng/lD6gwWVisXiSWcJ9SI8FSxkBGsj+bY9jrCeBWEeFn6ubtzCt5tOy1kCbRK3Ik2o0PXtr/EkJmlEhSYcKzVynUR7OZaaEU6LxjhVNMFkjqd0ZKjAEVVevkxeoCujTFAYS/OERkv190aOI6WyKDCTZU617pXif94o1eG9lzORpJoKsjoUphzpGJU1oAmTlGieGYKJZCYrIjMsMdGmrIYpwV3/8ibp37Zcp+U+Os12p6qjDhdwCdfgwh20oQNd6AGBBTzDK7xZufVivVsfq9GaVe2cwx9Ynz+YtZOf</latexit>

<latexit sha1_base64="+wJFw1OtY/wZ3jN7PJxxT/mqa+w=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCWRgi4LbrqsYB/QhjCZTtqhk0mYmRRKyJ+4caGIW//EnX/jpM1CWw8MHM65l3vmBAlnSjvOt1XZ2t7Z3avu1w4Oj45P7NOznopTSWiXxDyWgwArypmgXc00p4NEUhwFnPaD2UPh9+dUKhaLJ71IqBfhiWAhI1gbybftUYT1NAizMPczdePmvl13Gs4SaJO4JalDiY5vf43GMUkjKjThWKmh6yTay7DUjHCa10apogkmMzyhQ0MFjqjysmXyHF0ZZYzCWJonNFqqvzcyHCm1iAIzWeRU614h/ucNUx3eexkTSaqpIKtDYcqRjlFRAxozSYnmC0MwkcxkRWSKJSbalFUzJbjrX94kvduG6zTcx2a91S7rqMIFXMI1uHAHLWhDB7pAYA7P8ApvVma9WO/Wx2q0YpU75/AH1ucPmfWTow==</latexit>

<latexit sha1_base64="+wJFw1OtY/wZ3jN7PJxxT/mqa+w=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCWRgi4LbrqsYB/QhjCZTtqhk0mYmRRKyJ+4caGIW//EnX/jpM1CWw8MHM65l3vmBAlnSjvOt1XZ2t7Z3avu1w4Oj45P7NOznopTSWiXxDyWgwArypmgXc00p4NEUhwFnPaD2UPh9+dUKhaLJ71IqBfhiWAhI1gbybftUYT1NAizMPczdePmvl13Gs4SaJO4JalDiY5vf43GMUkjKjThWKmh6yTay7DUjHCa10apogkmMzyhQ0MFjqjysmXyHF0ZZYzCWJonNFqqvzcyHCm1iAIzWeRU614h/ucNUx3eexkTSaqpIKtDYcqRjlFRAxozSYnmC0MwkcxkRWSKJSbalFUzJbjrX94kvduG6zTcx2a91S7rqMIFXMI1uHAHLWhDB7pAYA7P8ApvVma9WO/Wx2q0YpU75/AH1ucPmfWTow==</latexit>

<latexit sha1_base64="+wJFw1OtY/wZ3jN7PJxxT/mqa+w=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCWRgi4LbrqsYB/QhjCZTtqhk0mYmRRKyJ+4caGIW//EnX/jpM1CWw8MHM65l3vmBAlnSjvOt1XZ2t7Z3avu1w4Oj45P7NOznopTSWiXxDyWgwArypmgXc00p4NEUhwFnPaD2UPh9+dUKhaLJ71IqBfhiWAhI1gbybftUYT1NAizMPczdePmvl13Gs4SaJO4JalDiY5vf43GMUkjKjThWKmh6yTay7DUjHCa10apogkmMzyhQ0MFjqjysmXyHF0ZZYzCWJonNFqqvzcyHCm1iAIzWeRU614h/ucNUx3eexkTSaqpIKtDYcqRjlFRAxozSYnmC0MwkcxkRWSKJSbalFUzJbjrX94kvduG6zTcx2a91S7rqMIFXMI1uHAHLWhDB7pAYA7P8ApvVma9WO/Wx2q0YpU75/AH1ucPmfWTow==</latexit>

fs+1

<latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>

<latexit sha1_base64="N9HmrOX3Qa29yn6mho8wKupr/e8=">AAAB7nicbVBNS8NAFHzxs9aq0auXxSIIQkm86FHw4rGC/YA2hM120y7dbMLuS6GE/BMvHhTx53jz37hpe9DWgYVh5j3e7ESZFAY979vZ2t7Z3duvHdQPG0fHJ+5po2vSXDPeYalMdT+ihkuheAcFSt7PNKdJJHkvmj5Ufm/GtRGpesZ5xoOEjpWIBaNopdB1hwnFSRQXcRkW5tovQ7fptbwFyCbxV6QJK7RD92s4SlmecIVMUmMGvpdhUFCNgkle1oe54RllUzrmA0sVTbgJikXyklxaZUTiVNunkCzU3xsFTYyZJ5GdrHKada8S//MGOcZ3QSFUliNXbHkoziXBlFQ1kJHQnKGcW0KZFjYrYROqKUNbVt2W4K9/eZN0b1q+1/KfPKjBOVzAFfhwC/fwCG3oAIMZvMAbvDuF8+p8LOvacla9ncEfOJ8/PaSSOQ==</latexit>

<latexit sha1_base64="p9K1VIjKac8cPpLXXCQOvMmd0A4=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCVxo8uCmy4r2Ae0IUymk3boZBJmJoUQ8iduXCji1j9x5984abPQ1gMDh3Pu5Z45QcKZ0o7zbdW2tnd29+r7jYPDo+MT+/Ssr+JUEtojMY/lMMCKciZoTzPN6TCRFEcBp4Ng/lD6gwWVisXiSWcJ9SI8FSxkBGsj+bY9jrCeBWEeFn6ubtzCt5tOy1kCbRK3Ik2o0PXtr/EkJmlEhSYcKzVynUR7OZaaEU6LxjhVNMFkjqd0ZKjAEVVevkxeoCujTFAYS/OERkv190aOI6WyKDCTZU617pXif94o1eG9lzORpJoKsjoUphzpGJU1oAmTlGieGYKJZCYrIjMsMdGmrIYpwV3/8ibp37Zcp+U+Os12p6qjDhdwCdfgwh20oQNd6AGBBTzDK7xZufVivVsfq9GaVe2cwx9Ynz+YtZOf</latexit>

fs+1

<latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>

<latexit sha1_base64="p9K1VIjKac8cPpLXXCQOvMmd0A4=">AAAB+XicbVDLSsNAFL2pr1pfUZduBosgCCVxo8uCmy4r2Ae0IUymk3boZBJmJoUQ8iduXCji1j9x5984abPQ1gMDh3Pu5Z45QcKZ0o7zbdW2tnd29+r7jYPDo+MT+/Ssr+JUEtojMY/lMMCKciZoTzPN6TCRFEcBp4Ng/lD6gwWVisXiSWcJ9SI8FSxkBGsj+bY9jrCeBWEeFn6ubtzCt5tOy1kCbRK3Ik2o0PXtr/EkJmlEhSYcKzVynUR7OZaaEU6LxjhVNMFkjqd0ZKjAEVVevkxeoCujTFAYS/OERkv190aOI6WyKDCTZU617pXif94o1eG9lzORpJoKsjoUphzpGJU1oAmTlGieGYKJZCYrIjMsMdGmrIYpwV3/8ibp37Zcp+U+Os12p6qjDhdwCdfgwh20oQNd6AGBBTzDK7xZufVivVsfq9GaVe2cwx9Ynz+YtZOf</latexit>

Fig. 2. An illustration of different schemes for multi-scale deep feature learning and fusion. (a) the traditional approach (e.g. concatenation, weighted

averaging), (b) the proposed CRF implementing multi-scale feature fusion (c) the proposed Attention-Gated-CRF based approach.

a multi-scale continuous CRF model to learn the multi-

scale predictions and fusion. However, these approaches

purely focus on modelling the structure of the predictions,

therefore answering to speciﬁc problems, and leading to task-

dependent models and associated architectures. Differently,

our work is focusing on statistical modeling on the structured

features, thus being more ﬂexible to be applied to different

continuous or discrete prediction tasks.

Semantic Segmentation.

As an important task in scene

understanding, semantic segmentation has received wide

attention in recent years. Long et al. [31] proposed a fully con-

volutional network for the task which signiﬁcantly improved

the performance and reduced the network parameters. The

dilated convolution [5], [32] was devised in order to obtain

bigger receptive ﬁeld of the features, further boosting the

segmentation performance. OCNet [33] introduced an object-

context pooling strategy based on afﬁnity learning among

pixels to capture a global context for feature reﬁnement.

Other main-stream directions mainly explored multi-scale

feature learning and model ensembling [16], [34], designing

convolutional encoder-decoder network structures [35], [36]

and performing end-to-end structure prediction with CRF

models [37], [38], [39]. A more close work to ours in the

literature is the GloRe approach [40] which utilizes a graph

convolution model for learning generic representation for

semantic segmentation. However, its modelling is only for

single-scale and not in a probabilistic graph formulation.

There are also some existing works which explored

joint deep learning of more than one pixel-wise prediction

tasks [41], [42], [43], [44]. Speciﬁcally, Xu et al. [41] designed

a PAD-Net architecture which learns multiple auxiliary pixel-

wise tasks and presents a multi-task distillation module to

combine the predictions from the different auxiliary tasks

to help more important ﬁnal tasks. Vandenhende and Van

Gool et al. [44] further improved the PAD-Net by introducing

a method for feature propagation of different multi-task

representations. However, these works are focusing more on

empirical design for learning interaction between different

pixel-wise tasks, while our model targets at statistical prob-

abilistic modeling for structured multi-scale feature fusion

which could provide a theoretic explanation and thus it is

beneﬁcial for a more effective deep module design.

2.2 Deep Multi-scale Learning

The importance of combining multi-scale information has

been widely revealed in various computer vision tasks [2],

[45], [46], [47] For instance, Xie et al. [2] proposed a fully

convolutional neural network with deep supervision for edge

detection, which employs a weighted averaging strategy for

the combination of multi-scale side outputs. The skipping-

layer networks are also very popular for learning multi-scale

representations, where the features obtained from different

semantic layers of a backbone CNN are combined in an

output layer to produce more robust representations. Sun and

Wang et al. [48], [49] proposed a HRNet architecture which

aims to enhance high-resolution representations via aggre-

gating the representations with multi-scale resolutions from

different network stages. To aggregate multi-scale contexts of

features, the dilation or à trous convolution [50] structures are

devised, which could be applied in embedded in different

convolutional layers in a deep network to obtain a larger

receptive ﬁeld. Yang et al. [51] introduced DAG-CNNs to

combine multi-scale features produced from different ReLU

layers using element-wise addition operation. Huang et

al. [52] recently proposed a multi-scale network architecture

using densely skipping connections to pass feature ﬂow at

different scales. However, in these works, the multi-scale

representations or predictions are typically combined via

using simple concatenation or weighted averaging operation.

We are also not aware of previous works exploring multi-

scale representation learning and fusion within a probabilistic

CRF graph framework. Besides, we also involve learning

an attention mechanism as gating in the graph model for

controlling the message passing of the feature variables.

2.3 Attention Models

Attention models [53] have been successfully exploited in

deep learning for various tasks such as image classiﬁca-

tion [54], speech recognition [55], image caption genera-

tion [56] and language translation [57]. Fu et al. [58] recently

proposed a dual attention model considering a combination

of both the spatial- and the channel-wise attentions for

semantic segmentation. However, to our knowledge, this

work is the ﬁrst to introduce a structured attention model for

for both discrete and continuous prediction tasks. Further-

more, we are not aware of previous studies integrating the

attention mechanism into a probabilistic (CRF) framework

to control the message passing between hidden variables.

We model the attention as gates [10], which have been

used in previous deep models such as restricted Boltzman

machine for unsupervised feature learning [59], LSTM for

sequence learning [60] and CNN for image classiﬁcation [61].

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

However, none of these works explores the possibility of

jointly learning multi-scale deep representations and an

attention model within a uniﬁed probabilistic graphical

model.

2.4 Structured Learning based on CRFs

The conditional random ﬁelds (CRFs) were widely used for

probabilistic structured modeling in the non-deep-learning

era for a wide range of problems, such as object recogni-

tion [62], information extraction [63] and pixel-wise semantic

labeling [64], [65], [66], [67]. Speciﬁcally, Boykov et al. [67]

proposed a combinational graph cut algorithm integrating

cues of boundaries, regions and shapes for semantic seg-

mentation. Krähenbühl and Koltun [66] designed a fully-

connected CRF model with gaussian pair-wise potentials and

accordingly proposed an efﬁcient approximation inference

solution for the model. With the rapid progress of the deep

learning techniques, the CRFs are also utilized together

with Convolutional Neural Network (CNN) architectures

for learning structured deep predictions [7], [38], [39], [47].

Among them, Zheng et al. [39] ﬁrst implemented the CRF

inference as Recurrent Neural Networks for end-to-end

learning with any backbone CNN. There are also existing

works exploring learning structured deep features with CRFs,

in order to more ﬂexibly adapt to different continuous

and discrete tasks, such as human pose estimation [68]

and monocular depth estimation [69]. Great success has

been made by these existing models; however, none of

them considered simultaneously learning structured multi-

scale representations and structured attention in a joint

probabilistic graph formulation for pixel-wise prediction.

3 TH E PROPOSED APPROACH

3.1 Problem Deﬁnition and Notation

Given an input image

I

and a generic front-end CNN model

with parameters

Wc

, we consider a set of

S

multi-scale

feature maps

F={fs}S

s=1

. Being a generic framework, these

feature maps can be the output of

S

intermediate CNN

layers or of another representation, thus

s

is a virtual scale.

The feature map at scale

s

,

fs

can be interpreted as a set

of feature vectors,

fs={fi

s}N

i=1

, where

N

is the number

of pixels. Opposite to previous works adopting simple

concatenation or weighted averaging schemes [2], [16], we

propose to combine the multi-scale feature maps by learning

a set of latent feature maps

hs={hi

s}N

i=1

with a novel

Attention-Gated CRF model sketched in Figure2. Intuitively,

this allows a joint reﬁnement of the features by ﬂowing

information between different scales. Moreover, since the

information from one scale may or may not be relevant

for the pixels at another scale, we utilise the concept of

gate, previously introduced in the literature in the case of

graphical models [70], in our CRF formulation. These gates

are binary random hidden variables that permit or block the

ﬂow of information between scales at every pixel. Formally,

gi

se,sr∈ {0,1}

is the gate at pixel

i

of scale

sr

(receiver) from

scale

se

(emitter), and we also write

gse,sr={gi

se,sr}N

i=1

.

Precisely, when

gi

se,sr= 1

then the hidden variable

hi

sr

is

updated taking (also) into account the information from

the

se

-th layer, i.e.

hse

. As shown in the following, the

joint inference of the hidden features and the gates leads to

estimating the optimal features as well as the corresponding

attention model, hence the name Attention-Gated CRFs.

3.2 Attention-Gated CRFs

Given the observed multi-scale feature maps

F

of image

I

, the objective is to estimate the hidden multi-scale repre-

sentation

H={hs}S

s=1

and, accessorily the attention gate

variables

G={gse,sr}S

se,sr=1

. To do that, we formalize the

problem within a conditional random ﬁeld framework and

write the Gibbs distribution as

P(H,G|I,Θ) = exp (−E(H,G,I,Θ)) /Z (I,Θ) ,(1)

where

Θ

is the set of parameters and

E

is the energy function.

As usual, we exploit both unary and binary potentials

to couple the hidden variables between them and to the

observations. Importantly, the proposed binary potential is

gated, and thus only active when the gate is open. More

formally the general form1of the energy function writes:

E(H,G,I,Θ) =

X

sX

i

φh(hi

s,fi

s)

| {z }

Unary potential

+X

se,sr

X

i,j

gi

se,srψh(hi

sr,hj

se,Ki,j

se,sr)

| {z }

Gated pairwise potential (2)

The ﬁrst term of the energy function is a classical unary

term that relates the hidden features to the observed multi-

scale CNN representations. The second term synthesizes

the theoretical contribution of the present study because it

conditions the effect of the pair-wise potential

ψh(hi

se,hj

sr)

upon the gate hidden variable

gi

se,sr

. Figure 2c depicts the

model formulated in Equ.(2). If we remove the attention gate

variables, it becomes a general multi-scale CRFs as shown in

Figure 2b.

Given that formulation, and as it is typically the case in

conditional random ﬁelds, we exploit the mean-ﬁeld approx-

imation in order to derive a tractable inference procedure.

Under this generic form, the mean-ﬁeld inference procedure

writes:

q(hi

s)∝exp φh(hi

s,fi

s)+

X

s06=sX

j

Eq(gi

s0,s){gi

s0,s}Eq(hj

s0){ψh(hi

s,hj

s0)},(3)

q(gi

s0,s)∝exp gi

s0,sEq(hi

s)nX

j

Eq(hj

s0)nψh(hi

s,hj

s0)oo,

(4)

where

Eq

stands for the expectation with respect to the

distribution q.

Before deriving these formulae for our precise choice of

potentials, we remark that, since the gate is a binary variable,

the expectation of its value is the same as

q(gi

s0,s = 1)

. By

1.

One could certainly include a unary potential for the gate variables

as well. However this would imply that there is a way to set/learn the

a priori distribution of opening/closing a gate. In practice we did not

observe any notable difference between using or skipping the unary

potential on g.

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 3. An overview of the profposed Probabilistic Graph Attention Network (PGA-Net) for monocular depth detection. The symbols

C

,

D

,

M

and

L

denote the convolution, the deconvolution, the max-pooling operation and optimization loss, respectively. AG-CRF represents the proposed

attention-gated CRF model with conditional kernels (CK) for structured multi-scale feature learning, which is fully differentiable and supports

end-to-end training with a multi-scale CNN network. PGA-Net consists of two hierarchies. The hierarchy 1 generates rich multi-scale features which

are reﬁned by AG-CRFs, and then are passed to hierarchy 2 for ﬁnal prediction.

deﬁning:

Mi

s0,s =Eq(hi

s)nPjEq(hj

s0)nψh(hi

s,hj

s0)oo

, the

expected value of the gate writes:

αi

s,s0=Eq(gi

s0,s){gi

s0,s}=q(gi

s0,s = 1)

q(gi

s0,s = 0) + q(gi

s0,s = 1)

=σ−Mi

s0,s,

(5)

where

σ()

denotes the sigmoid function. This ﬁnding is

specially relevant in the framework of CNN since many of

the attention models are typically obtained after applying the

sigmoid function to the features derived from a feed-forward

network. Importantly, since the quantity

Mi

s0,s

depends on

the expected values of the hidden features

hi

s

, the AG-CRF

framework extends the unidirectional connection from the

features to the attention model, to a bidirectional connection

in which the expected value of the gate allows to reﬁne the

distribution of the hidden features as well.

3.3 AG-CRF Inference

In order to construct an operative model we need to deﬁne

the unary and gated potentials

φh

and

ψh

. In our case, the

unary potential corresponds to an isotropic Gaussian:

φh(hi

s,fi

s) = −ai

s

2khi

s−fi

sk2,(6)

where ai

s>0is a weighting factor.

The gated binary potential is speciﬁcally designed for a

two-fold objective. On the one hand, we would like to learn

and further exploit the relationships between hidden vectors

at the same, as well as at different scales. On the other hand,

we would like to exploit previous knowledge on attention

models and include linear terms in the potential. Indeed,

this would implicitly shape the gate variable to include a

linear operator on the features. Therefore, we chose a bilinear

potential:

ψh(hi

s,hj

s0) = ˜

hi

sKi,j

s,s0˜

hj

s0,(7)

where

˜

hi

s= (hi>

s,1)>

and

Ki,j

s,s0∈R(Cs+1)×(Cs0+1)

being

Cs

the size, i.e. the number of channels, of the representation at

scale

s

. If we write this matrix as

Ki,j

s,s0= (Li,j

s,s0,li,j

s,s0;lj,i>

s0,s ,1)

,

then

Li,j

s,s0

exploits the relationships between hidden vari-

ables, while

li,j

s,s0

and

lj,i

s0,s

implement the classically used

linear relationships of the attention models. In order words,

ψh

models the pair-wise relationships between features with

the upper-left block of the matrix. Furthermore,

ψh

takes into

account the linear relationships by completing the hidden

vectors with the unity. In all, the energy function writes:

E(H,G,I,Θ) =

−X

sX

i

ai

s

2khi

s−fi

sk2+X

se,sr

X

i,j

gi

se,sr

˜

hi

srKi,j

sr,se

˜

hj

se.

(8)

Under these potentials, we can consequently update the

mean-ﬁeld inference equations to:

q(hi

s)∝exp −ai

s

2(khi

sk − 2hi>

sfi

s)+

X

s06=s

αi

s,s0hi>

sX

j

(Li,j

s,s0¯

hj

s0+li,j

s,s0),(9)

where ¯

hj

s0is the expected a posterior value of hj

s0.

The previous expression implies that the a posterior

distribution for

hi

s

is a Gaussian. The mean vector of the

Gaussian and the function Mwrite:

¯

hi

s=1

ai

sai

sfi

s+X

s06=s

αi

s,s0X

j

(Li,j

s,s0¯

hj

s0+li,j

s,s0)

Mi

s0,s =X

j

¯

hi

sLi,j

s,s0¯

hj

s0+¯

hi>

sli,j

s,s0+¯

hj>

s0lj,i

s0,s(10)

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

which concludes the inference procedure. Furthermore, the

proposed framework can be simpliﬁed to obtain the tradi-

tional attention models. In most of the previous studies, the

attention variables are computed directly from the multi-

scale features instead of computing them from the hidden

variables. Indeed, since many of these studies do not propose

a probabilistic formulation, there are no hidden variables and

the attention is computed sequentially through the scales.

We can emulate the same behavior within the AG-CRF

framework by modifying the gated potential as follows:

˜

ψh(hi

s,hj

s0,fi

s,fj

s0) = hi

sLi,j

s,s0hj

s0+fi>

sli,j

s,s0+fj>

s0lj,i

s0,s.(11)

This means that we keep the pair-wise relationships between

hidden variables (as in any CRF) and let the attention model

be generated by a linear combination of the observed features

from the CNN, as it is traditionally done. The changes in

the inference procedure are straightforward. We refer to this

model as partially-latent AG-CRFs (PLAG-CRFs), whereas

the more general one is denoted as fully-latent AG-CRFs

(FLAG-CRFs). The potential deﬁned for the PLAG-CRF

model, i.e. (11), has an impact on the inference of both the

hidden features and the attention gates. Indeed, since the

linear term does not depend on the hidden features, but

on the observations, the mean of the hidden features is

computed independently of this linear term:

¯

hi

s=1

ai

sai

sfi

s+X

s06=s

αi

s,s0X

j

Li,j

s,s0¯

hj

s0.(12)

Likewise, the linear terms of the attention gate do not depend

anymore on the hidden features but on the observations from

the CNN:

Mi

s0,s =X

j

¯

hi

sLi,j

s,s0¯

hj

s0+¯

fi>

sli,j

s,s0+¯

fj>

s0lj,i

s0,s.(13)

These are the two differences with respect to the inference

of FLAG-CRFs. We will also introduce the implementation

difference of both versions in Sec. 4.

3.4 Feature Dependant Conditional Kernels

In the model inference as shown in (10), (12) and (13), the

kernels

Li,j

s,s0

,

li,j

s,s0

and

lj,i

s0,s

are shared for all the input

features. This property restricts the learning capacity of the

model: one would like those kernels to be dependant on

the features so as to capture the feature correlated context,

which is particularly important for pixel-wise prediction

tasks. We hence propose to learn feature conditioned kernels,

instead of the previous shared kernels. In practice, each

kernel is predicted from the input features using a linear

transformation as follows:

Li,j

s,s0=WLi,j

s,s0concat(¯

hi

s,¯

hj

s0) + bLi,j

s,s0,

li,j

s,s0=Wli,j

s,s0¯

hi

s+bli,j

s,s0,li,j

s0,s =Wli,j

s0,s ¯

hj

s0+bli,j

s0,s,

(14)

where

concat(·)

denotes a concatenation operation func-

tion. The symbols

{WLi,j

s,s0,bLi,j

s,s0}

,

{Wli,j

s,s0,bli,j

s,s0}

and

{Wli,j

s0,s,bli,j

s0,s}

are the parameters of the linear transfor-

mation. By making this concise modiﬁcation in the model

inference, we further clearly boost the performance of the

model on different pixel-wise prediction tasks, which will be

elaborated in the experimental part.

4 NE TWORK IMPLEMENTATION

4.1 Neural network implementation for joint learning

In order to infer the hidden variables and learn the pa-

rameters of the AG-CRFs together with those of the front-

end CNN, we implement the AG-CRFs updates in neural

network. A detailed computing ﬂow is depicted in Figure 4.

The implementation consists of several steps:

•

Conditional kernel prediction for the kernels

Lse→sr

,

lse→sr

, and

lsr→se

with

Lse→sr←WLse→sr⊗(hsec

hsr)

,

lse→sr←Wlse→sr⊗hse, and lsr→se←Wlsr→se⊗hsr;

•

Message passing from the

se

-th scale to the current

sr

-th

scale is performed with

hse→sr←Lse→sr⊗hse

, where

⊗

denotes the convolutional operation and

Lse→sr

denotes

the corresponding convolution kernel;

•

Attention map estimation

q(gse,sr=1)←σ(hsr

(Lse→sr⊗hse) + lse→sr⊗hse+lsr→se⊗hsr)

, where

Lse→sr

,

lse→sr

and

lsr→se

are convolution kernels and

represents element-wise product operation;

•

Attention-gated message passing from other scales and

adding unary term:

¯

hsr=fsr⊕asrPse6=sr(q(gse,sr=

1)hse→sr)

, where

asr

encodes the effect of the

ai

sr

for

weighting the message and can be implemented as a

1×1

convolution. The symbol

⊕

denotes element-wise addition.

In order to simplify the overall inference procedure, and

because the magnitude of the linear term of

ψh

is in practice

negligible compared to the quadratic term, we discard the

message associated to the linear term. When the inference is

complete, the ﬁnal estimate is obtained by convolving all the

scales. For the inference of the partially latent model, we only

need to discard the corresponding terms in the computation

of the messages in the second step, and replace the latent

features with observation features for the attention prediction

in the third step.

4.2 Exploiting AG-CRFs with a Multi-scale Network

PGA-Net Architecture.

The proposed Attention-guided

Multi-scale Hierarchical Network (PGA-Net), as sketched

in Figure 3, consists of a multi-scale hierarchical network

(MH-Net) together with the AG-CRF model described above.

The MH-Net can be constructed from a front-end CNN

architecture such as the widely used AlexNet [71], VGG [72]

and ResNet [73]. One prominent feature of MH-Net is its

ability to generate richer multi-scale representations. In order

to do that, we perform distinct non-linear mappings (decon-

volution

D

, convolution

C

and max-pooling

M

) upon

fl

, the

CNN feature representation from an intermediate layer

l

of

the front-end CNN. This leads to a three-way representation:

fD

l

,

fC

l

and

fM

l

. Remarkably, while

D

upsamples the feature

map,

C

maintains its original size and

M

reduces it, and

different kernel size is utilized for them to have different

receptive ﬁelds, then naturally obtaining complementary

inter- and multi-scale representations. The

fC

l

and

fM

l

are

further aligned to the dimensions of the feature map

fD

l

by

the deconvolutional operation. The hierarchy is implemented

in two levels. The ﬁrst level uses an AG-CRF model to fuse

the three representations of each layer

l

, thus reﬁning the

CNN features within the same scale. The second level of the

hierarchy uses an AG-CRF model to fuse the information

coming from multiple CNN layers. The proposed hierarchical

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 4. The detailed computing ﬂow of the mean-ﬁeld updating of the

proposed conditional kernel AG-CRF model. The symbol

⊗

denotes

the convolutional operation. The ones with green color represent the

operation for the conditional kernel prediction. The symbols

and

⊕

denote element-wise multiplication and addition operation, respectively.

The symbols

σ

and

c

represent a sigmoid and a concatenation

operation, respectively.

Fig. 5. The visualization of the learned attention maps in the proposed

AG-CRF model. Our attention is a pixel-wise attention, i.e. simultaneously

learning both spatial- and channel-wise attention. We visualize the

attention by uniformly sampling four attention channels of the attention

map. The learned attentions could capture distinct meaningful parts of

the features for guiding the message passing. These attention maps are

learned on the KITTI dataset for the task of monocular depth estimation.

multi-scale structure is general purpose and able to involve

an arbitrary number of layers and of diverse intra-layer

representations. It should be also noted that the proposed

AG-CRF model is ﬂexible to be applied into any multi-

scale context in a deep learning network for structured

representation reﬁnement.

End-to-End Network Optimization.

The parameters of the

model consist of the front-end CNN parameters,

Wc

, the

parameters to produce the richer decomposition from each

layer

l

,

Wl

, the parameters of the AG-CRFs of the ﬁrst

level of the hierarchy,

{WI

l}L

l=1

, and the parameters of the

AG-CRFs of the second level of the hierarchy,

WII

.

L

is

the number of intermediate layers used from the front-end

CNN. In order to jointly optimize all these parameters we

adopt deep supervision [2] and we add an optimization loss

associated to each AG-CRF module. We apply the proposed

model into different pixel-wise prediction tasks, including

contour detection, monocular depth estimation and semantic

segmentation. For the contour detection task, as the contour

detection problem is highly unbalanced, i.e. contour pixels

are signiﬁcantly less than non-contour pixels, we employ the

modiﬁed cross-entropy loss function of [2]. Given a training

data set

D={(Ip,Ep)}P

p=1

consisting of

P

RGB-contour

ground-truth pairs, the loss function `writes:

`W=X

p

βX

ek

p∈E+

p

log Pek

p= 1|Ip;W+

1−βX

ek

p∈E−

p

log Pek

p= 0|Ip;W,(15)

where

β=|E+

p|/(|E+

p|+|E−

p|)

,

E+

p

is the set of contour

pixels of image

p

and

W

is the set of all parameters. For

the monocular depth estimation, we utilize an L2 loss for

the continuous regression as in previous works [7], [46], and

for the semantic segmentation, we use a standard cross-

entropy loss for multi-class classiﬁcation as in [74]. The

network optimization is performed via the back-propagation

algorithm with stochastic gradient descent.

PGA-Net for pixel-wise prediction.

After training of the

whole PGA-Net, the optimized network parameters

W

are

used for the pixel-wise prediction task. Given a new test

image

I

, the

L+ 1

classiﬁers produce a set of multi-scale

prediction maps

{ˆ

El}L+1

l=1 = PGA-Net(I;W)

. Multi-scale

predictions

ˆ

El

obtained from the AG-CRFs with elementary

operations on the contour prediction task are shown in Fig. 6.

We inspire from [2] to fuse the multiple scale predictions

thus obtaining an average prediction ˆ

E=Plˆ

El/(L+ 1).

5 EXPERIMENTS

We demonstrate the effectiveness of the proposed approach

through extensive experiments on several publicly available

benchmarks, and on three different tasks involving both the

continuous domain (i.e. monocular depth estimation) and the

discrete domain (i.e. object contour detection and semantic

segmentation). We ﬁrst introduce the experimental setup and

then present our results and analysis.

5.1 Experimental Setup

5.1.1 Datasets.

BSDS500 and NYUD-V2 for object contour detection.

For

the object contour detection task, we employ two differ-

ent benchmarks: the BSDS500 and the NYUD-V2 datasets.

The BSDS500 dataset is an extended dataset based on

BSDS300 [11]. It consists of 200 training, 100 validation and

200 testing images. The groundtruth pixel-level labels for

each sample are derived considering multiple annotators.

Following [2], [19], we use all the training and validation

images for learning the proposed model and perform data

augmentation as described in [2]. The NYUD-V2 [12] con-

tains 1449 RGB-D images and it is split into three subsets,

consisting of 381 training, 414 validation and 654 testing

images. Following [2] in our experiments we employ images

at full resolution (i.e.

560 ×425

pixels) both in the training

and in the testing phases.

KITTI for monocular depth estimation.

For the monocular

depth estimation task, the KITTI dataset [13] is considered.

This dataset is collected for various important computer

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 1

BSDS500 dataset: quantitative results. All the compared methods use

the ofﬁcial training and testing data. RCF [20] utilizes extra training data

for model learning. Our model outperforms Deep Crip Boundary [21]

considering the same ResNet50 backbone. Res16x is an enhanced

backbone based on ResNet50. 3S/5S indicates three/ﬁve feature scales

are considered for the structured fusion.

Method Backbone ODS OIS AP

Human - .800 .800 -

Felz-Hutt [76] - .610 .640 .560

Mean Shift [77] - .640 .680 .560

Normalized Cuts [78] - .641 .674 .447

ISCRA [79] - .724 .752 .783

gPb-ucm [11] - .726 .760 .727

Sketch Tokens [80] - .727 .746 .780

MCG [81] - .747 .779 .759

LEP [82] - .757 .793 .828

DeepEdge [18] AlexNet .753 .772 .807

DeepContour [17] AlexNet .756 .773 .797

HED [2] VGG16 .788 .808 .840

CEDN [19] VGG16 .788 .804 .834

COB [4] ResNet50 .793 .820 .859

Deep Crip Boundary [21] ResNet50 .803 .820 .871

Deep Crip Boundary [21] Res16x .810 .829 .879

RCF [20] (not comp.) ResNet50 .811 .830 –

PGA-Net (fusion) (3S) ResNet50 .798 .829 .869

PGA-Net (fusion) w/ CK (3S) ResNet50 .799 .831 .872

PGA-Net (fusion) w/ CK (5S) ResNet50 .805 .835 .878

vision tasks within a context of self-driving. It contains depth

video data capture using a LiDAR sensor installed on a

driving car. To have a fair comparison with existing works,

we follow a standard setting of the training and testing split

originally proposed by Eigen et al. [30]. There are in total 61

scenes selected from the raw data distribution. Speciﬁcally,

we use total 22,600 frames from 32 scenes for training, and

697 frames from the rest 29 scenes for testing. The sparse

ground-truth depth maps are obtained by reprojecting the 3D

points captured from a velodyne laser onto the left monocular

camera as in [75]. The resolution of RGB images are down-

sampled to

621 ×188

from the original resolution of

1224 ×

368 for training.

Pascal-Context for semantic segmentation.

For the semantic

segmentation task, we use the Pascal-Context dataset [14].

The Pascal-Context dataset performs the augmentation of the

pixel-level segmentation annotations on the Pascal VOC 2010,

and enlarges the number of semantic classes from original 20

categories to more than 400 categories. Following previous

works [5], [74], we evaluate on the setting with the most

frequent 59 classes, in total 60 classes plus the background

class. The rest classes are masked out during training.

5.1.2 Evaluation Metrics.

Evaluation protocol on object contour detection.

Dur-

ing the test phase standard non-maximum suppression

(NMS) [83] is ﬁrst applied to produce thinned contour maps.

We then evaluate the detection performance of our approach

according to different metrics, including the F-measure at

Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS)

and the Average Precision (AP). The maximum tolerance

allowed for correct matches of edge predictions to the ground

truth is set to 0.0075 for the BSDS500 dataset, and to .011 for

the NYUDv2 dataset as in previous works [2], [83], [84].

Evaluation protocol on monocular depth estimation.

Fol-

lowing the standard evaluation protocol as in previous

Fig. 6. Examples of predictions from different multi-scale features on

BSDS500. The ﬁrst column is the input test images. The 2nd to the 5nd

columns show the predictions from different multi-scale features. The last

column shows the ﬁnal contour map after standard NMS.

works [22], [23], [30], the following quantitative evaluation

metrics are adopted in our experiments:

•mean relative error (rel): 1

KPK

i=1 |˜

di−d?

i|

d?

i;

•root mean squared error (rms): q1

KPK

i=1(˜

di−d?

i)2;

•mean log10 error (log10):

1

KPK

i=1 klog10(˜

di)−log10(d?

i)k;

•scale invariant rms log error as used in [30], rms(sc-inv.);

•accuracy with threshold t: percentage (%) of d?

i,

subject to max(d?

i

˜

di,˜

di

d?

i) = δ < t (t∈[1.25,1.252,1.253]).

Where

˜

di

and

d?

i

is the ground-truth depth and the estimated

depth at pixel

i

respectively;

K

is the total number of pixels

of the test images.

Evaluation protocol on semantic segmentation.

Following

previous works and use the DeepLab evaluation tool, we

report our quantitative results on the standard metrics of

pixel accuracy (pixAcc) and mean intersection over union

(mIoU) averaged over classes. Both metrics are the higher

the better. The background category is all included in the

evaluation as in previous works [74], [85].

5.1.3 Implementation Details.

The proposed PGA-Net is implemented under the deep learn-

ing framework Pytorch. The training and testing phase are

carried out on four Nvidia Tesla P40 GPUs, each with 24GB

memory. The ResNet-50 and ResNet-101 networks pretrained

on ImageNet [86] are used to initialize the front-end CNN of

PGA-Net for different backbone experiments. To consider the

computational efﬁciency, our implementation only employs

three scales, i.e. we generate multi-scale features from three

different semantic layers of the backbone CNN (i.e. res3d,

res4f,res5c in ResNet). In our CRF model we consider

dependencies between all scales. Within the AG-CRFs, the

kernel size for all convolutional operations is set to

3×3

with stride

1

and padding

1

. The weighting parameters

ai

sr

are learned automatically via using convolutional operations

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

[F=0.800] Human

[F=0.805] PGA-Net

[F=0.793] COB

[F=0.788] CEDN

[F=0.788] HED

[F=0.757] LEP

[F=0.756] DeepContour

[F=0.753] DeepEdge

[F=0.747] MCG

[F=0.727] SketchTokens

[F=0.726] UCM

[F=0.724] ISCRA

[F=0.641] Normalized Cuts

[F=0.640] MeanShift

[F=0.610] Felz-Hut

(a) BSDS500

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

[F=0.800] Human

[F=0.785] PGA-Net

[F=0.746] HED

[F=0.706] SE+NG+

[F=0.695] SE

[F=0.685] gPb+NG

[F=0.680] SemiContour

[F=0.658] Silberman

[F=0.651] OEF

[F=0.632] gPb-ucm

(b) NYUD-V2

Fig. 7. Precision-Recall Curves on the BSDS500 [11] and NYUD-V2 [12] test sets. The proposed PGA-Net achieves the best performance among the

competitors on the ODS metric on both datasets. The results on NYUD-V2 are all based on the RGB and HHA data.

with a kernel size of

1×1

. For the object contour detection,

the initial learning rate is set to 1e-7 in all our experiments,

and decreases 10 times after every 10k iterations. The total

number of iterations for BSDS500 and NYUD v2 is 40k

and 30k, respectively. The momentum and weight decay

parameters are set to 0.9 and 0.0002, as in [2]. As the training

images have different resolution, we need to set the batch size

to 1, and for the sake of smooth convergence we updated the

parameters only every 10 iterations. For the monocular depth

estimation and the semantic segmentation task, following

previous works [74], [87], [88] for fair comparison, the batch

size is set to 8 and 16, respectively; the learning rate is set to

0.001 with a momentum of 0.9 and a weight decay of 0.0001

using a polynomial learning rate scheme as used in [5], [74].

Regarding the overall training time, it takes around 8, 13, 20

hours for the contour detection on BSDS500, depth estimation

on KITTI and semantic segmenation on Pascal-Context. Our

model also achieves almost real-time inference time (around

8 frames per second) for all the three tasks.

5.2 Experimental Results

In this section, we present the results of our evaluation,

comparing the proposed model with several state of the

art methods respectively on the three different tasks. We

further conduct an in-depth analysis of our model, to show

the impact of different components on the performance. And

Finally we present some qualitative results and analysis of

the model.

5.2.1 Comparison with state of the art methods.

Comparison on BSDS500 and NYUD-V2.

We ﬁrst consider

the BSDS500 dataset and compare the performance of our

approach with several traditional contour detection meth-

ods, including Felz-Hut [76], MeanShift [77], Normalized

Cuts [78], ISCRA [79], gPb-ucm [11], SketchTokens [80],

TABLE 2

Quantitative performance comparison on NYUD-V2 RGB dataset for the

contour detection task with the ofﬁcial training/testing protocols. We

achieve better performance in terms of all the metrics than RCF [20],

which is the best performing method on BSDS500 with extra training

data. 3S/5S indicates three/ﬁve feature scales are considered.

Method Backbone ODS OIS AP

gPb-ucm [11] – .632 .661 .562

OEF [89] – .651 .667 –

Silberman et al. [12] – .658 .661 –

SemiContour [90] – .680 .700 .690

SE [91] – .685 .699 .679

gPb+NG [92] – .687 .716 .629

SE+NG+ [84] – .710 .723 .738

HED (RGB) [2] VGG16 .720 .734 .734

HED (HHA) [2] VGG16 .682 .695 .702

HED (RGB + HHA) [2] VGG16 .746 .761 .786

RCF (RGB) + HHA) [20] VGG16 .757 .771 –

RCF (RGB) + HHA) [20] ResNet50 .781 .793 –

PGA-Net (HHA) (3S) ResNet50 .716 .729 .734

PGA-Net (RGB) (3S) ResNet50 .744 .758 .765

PGA-Net (RGB+HHA) (3S) ResNet50 .771 .786 .802

PGA-Net (RGB+HHA) w/ CK (3S) ResNet50 .780 .795 .813

PGA-Net (RGB+HHA) w/ CK (5S) ResNet50 .785 .799 .816

MCG [81], LEP [82], and more recent CNN-based meth-

ods, including DeepEdge [18], DeepContour [17], HED [2],

CEDN [19], COB [4], and Deep Crisp Boundaries [21]. We

also report results of the RCF method [20], although they are

not comparable because in [20] an extra dataset (i.e. Pascal

Context) which is even larger than BSDS500, was used during

RCF training to improve the results on BSDS500. In this

series of experiments we consider PGA-Net with FLAG-

CRFs. The results of this comparison are shown in Table 1

and Figure 7a. PGA-Net obtains an F-measure (ODS) of 0.798,

thus outperforms all previous methods. The improvement

over the second and third best approaches, i.e. COB and

HED, is 0.5% and 1.0%, respectively, which is not trivial

to achieve on this challenging dataset. Furthermore, when

considering the OIS and AP metrics, our approach is also

better, with a clear performance gap. By using the proposed

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE 3

Quantitative performance analysis of the proposed PGA-Net on

NYUD-V2 RGB dataset for the contour detection task. H2 indicates only

the second hierarchy is utilized; 3S and 5S indicate three and ﬁve

feature scales are considered for the structured fusion, respectively.

Method Backbone ODS OIS AP

Hypercolumn [16] (5S) ResNet50 .720 .731 .733

HED [2] (5S) ResNet50 .722 .737 .738

PGA-Net (baseline) (H2, 3S) ResNet50 .711 .720 .724

PGA-Net (w/o AG-CRFs) (3S) ResNet50 .722 .732 .739

PGA-Net (w/ CRFs) (3S) ResNet50 .732 .742 .750

PGA-Net (w/o deep supervision) (3S) ResNet50 .725 .738 .740

PGA-Net (w/ PLAG-CRFs) (H2, 3S) ResNet50 .731 .742 .743

PGA-Net (w/ PLAG-CRFs) (3S) ResNet50 .737 .749 .746

PGA-Net (w/ FLAG-CRFs) (3S) ResNet50 .744 .758 .765

PGA-Net (w/ FLAG-CRFs + CK) (3S) ResNet50 .751 .767 .778

PGA-Net (w/ FLAG-CRFs + CK) (5S) ResNet50 .754 .772 .781

strategy of conditional kernels, we further clearly boost the

performance of PGA-Net on all the three metrics (i.e. ODS,

OIS and AP), on this performance saturated dataset. Besides,

comparing to Deep Crisp Boundaries, ours using 3 feature

scales is comparable on the ODS metric while achieves better

performance w.r.t. both the OIS and AP metrics if the same

backbone architecture (i.e. ResNet50) is considered for both

methods. While ﬁve feature scales are used for the structured

fusion, ours outperforms Deep Crisp Boundaries on all the

metrics with the ResNet50 backbone.

To conduct the experimental comparison on NYUDv2,

following previous works [2] we also consider three different

types of input representations, i.e. RGB, HHA [84] and RGB-

HHA data. The HHA data [84] are encoded depth feature

images, in which the three image chanels are horizontal

disparity, height above ground, and angle of the pixel’s

local surface normal with the inferred direction of gravity,

respectively. The results corresponding to the use of both

RGB and HHA data (i.e. RGB+HHA) are obtained by

performing a weighted average of the estimates obtained

from two PGA-Net models trained separately on RGB and

HHA representations. As baselines we consider gPb-ucm [11],

OEF [89], the method in [12], SemiContour [90], SE [91],

gPb+NG [92], SE+NG+ [84], HED [2] and RCF [20]. On the

NYUD-V2 dataset, our approach outperforms the RCF [20]

which obtained the best performance on BSDS500 with

extra training data, with a clear performance gap, where

the experimental protocol for both is exactly the same on this

dataset. All of them are reported in Table 2 and Figure 7b.

Again, our ﬁnal model (PGA-Net w/ CK) signiﬁcantly

outperforms all previous comparison methods. In particular,

the increased performance with respect to HED [2] and

RCF [20] conﬁrms the beneﬁt of the proposed multi-scale

feature learning and fusion scheme.

Comparison on KITTI.

The state of the art comparison on

the KITTI dataset for monocular depth estimation is shown in

Table 4. We compare with the methods with both supervised

and unsupervised settings. For the unsupervised setting,

the representative works such as Zhou et al. [93], Garg et

al. [75], Godard et al. [87], and Kuznietsov et al. [99] are

compared. For the supervised methods, we consider the very

competitive works such as Eigen et al. [30], Liu et al. [7],

Kuznietsov et al. [99], Gan et al. [27], DORN [26] and Lee et

al. [29]. Our approach also employs the supervised setting

using single monocular images in training and testing. As

shown in Table 4, our approach achieves top-level perfor-

mance compared with both the supervised and unsupervised

comparison methods. Speciﬁcally, DORN needs to assume

the depth range in the training, which is not the same

setting as our continuous regression, and thus not directly

comparable to ours. Besides, DORN speciﬁcally works on the

predictions via using an ordinal regression loss, while ours

focuses on learning effective representations, therefore we

are complementary to each other. By combing the proposed

AG-CRF module with DORN, speciﬁcally with its multi-scle

backbone and ordinary regression module, we achieve clearly

better performance than DORN, which further conﬁrms the

effectiveness of the proposed AG-CRF model. Our approach

also obtains the same level of performance compared with

the best performing method Lee et al. [29] using 3 feature

scales. We outperform Lee et al. [29] while 5 feature scales are

further considered. More importantly, the proposed graph-

based approach obtains signiﬁcantly better results than the

CRF-based methods (i.e. MS-CRF [88] and Liu et al. [7]) on

the monocular depth estimation task.

Comparison on Pascal-Context.

We compare the proposed

PGA-Net with the most competitive methods on the Pascal-

Context dataset, including ASPP [102], PSPNet [103], Enc-

Net [74] and HRNet [48]. The experiments are conducted

on both ResNet-50 and ResNet-101 backbone networks. Our

PGA-Net is 2.24 and 3.05 points better on the mIoU metric

than the popular method (i.e. EncNet, which considers a

channel-wise attention for feature reﬁnement) with ResNet-

50 and ResNet-101 respectively. Compared with another

multi-scale method HRNet, which utilizes the multi-scale

aggregation from feature maps with different resolutions to

boost the ﬁnal performance, our PGA-Net obtains better per-

formance with a clear gap. Note that the backbone structure

HRNetV2-W48 used by HRNet has a bigger network capacity

than the ResNet101 backbone we used. Our approach also

obtains the same level performance comparing with the best

performing method OCR considering the same backbone

ResNet101. By using ﬁve feature scales for the structured

fusion in the proposed model, our model outperforms OCR

(55.1 vs. 54.8 in terms of mIoU using the same ResNet101

backbone). More importantly, the core idea of OCR utilizing

the soft object regions is complementary to ours. We could

also observe that our graph-based method signiﬁcantly

outperforms deep-lab-v2 [5] which also utilizes a CRF model.

5.2.2 Model Analysis.

Baseline models.

To further demonstrate the effectiveness of

the proposed model and analyze the impact of the different

components of PGA-Net on the countour detection task,

we conduct an ablation study considering the NYUDv2

(RGB data) and the Pascal-Context dataset. We evaluated

the following baseline models: (i) PGA-Net (baseline), which

removes the ﬁrst-level hierarchy and directly concatenates

the feature maps for prediction, (ii) PGA-Net (w/o AG-

CRFs), which employs the proposed multi-scale hierarchical

structure but discards the AG-CRFs, (iii) PGA-Net (w/ CRFs),

which replaces our AG-CRFs with a multi-scale CRF model

without attention gating, (iv) PGA-Net (w/o deep supervi-

sion) obtained by removing intermediate loss functions in

PGA-Net, (v) PGA-Net with the proposed two versions of the

AG-CRFs model, i.e. PLAG-CRFs and FLAG-CRFs, and (vi)

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

RGB Image Eigen et al. [30] Zhou et al. [93] Garg et al. [75] Godard et al. [87] Ours GT Depth Map

Fig. 8. Qualitative examples of monocular depth prediction on the KITTI raw dataset. The comparison with other competive methods including

Eigen et al. [30], Zhou et al. [93], Garg et al. [75] and Godard et al. [87] are presented. We perform bilinear interpolation on the sparse ground-truth

depth maps for better visualization.

TABLE 4

Quantitative comparison with the state of the art methods on the KITTI raw dataset for monocular depth estimation. The proposed PGA-Net achieves

top performance over all the competitors w.r.t. all the evaluation metrics. The standard training and testing sets split by Eigen et al. [30] are used. The

‘range’ means different ground-truth depth range for evaluation, and the ’sup’ means the ground-truth depth is used for supervision in the training.

‘CK’ denotes the proposed conditional kernel strategy. The methods requiring video data are marked with

∗

. The DORN method needs to assume the

depth range in the training, which is not the same as our continuous regression setting, and thus not directly comparable, and we are highly

complementary to each other since we work on learning effective representation while theirs focus on the loss level. 3S/5S indicates three/ﬁve feature

scales from the front-end CNN are considered for the structured fusion by the proposed model.

Method Setting Error (lower is better) Accuracy (higher is better)

range sup? rel sq rel rmse rmse (log) δ < 1.25 δ < 1.252δ < 1.253

Garg et al. [75] 80m No 0.177 1.169 5.285 - 0.727 0.896 0.962

Garg et al. [75] L12 + Aug 8x 50m No 0.169 1.080 5.104 - 0.740 0.904 0.958

Godard et al. [87] 80m No 0.148 1.344 5.927 0.247 0.803 0.922 0.964

Zhou∗et al. [93] 80m No 0.208 1.768 6.858 0.283 0.678 0.885 0.957

AdaDepth [94] 50m No 0.203 1.734 6.251 0.284 0.687 0.899 0.958

Pilzer et al. [95] 80m No 0.152 1.388 6.016 0.247 0.789 0.918 0.965

Wang & Lucey et al. [96] 80m No 0.151 1.257 5.583 0.228 0.810 0.936 0.974

DF-Net∗[97] 80m No 0.150 1.124 5.507 0.223 0.806 0.933 0.973

Zhan∗et al. [98] 80m No 0.144 1.391 5.869 0.241 0.803 0.933 0.971

Kuznietsov et al. [99] 80m No - - 4.621 - 0.852 0.960 0.986

Saxena et al. [100] 80m Yes 0.280 - 8.734 0.327 0.601 0.820 0.926

Liu et al. [7] 80m Yes 0.217 0.092 7.046 - 0.656 0.881 0.958

Eigen et al. [30] 80m Yes 0.190 - 7.156 0.246 0.692 0.899 0.967

Mahjourian∗[101] 80m Yes 0.163 1.240 6.220 0.250 0.762 0.916 0.968

MS-CRF [88] 80m Yes 0.125 0.899 4.685 - 0.816 0.951 0.983

Kuznietsov et al. [99] (supervised & stereo) 80m Yes - - 4.621 0.189 0.862 0.960 0.986

Gan et al. [27] 80m Yes 0.098 0.666 3.933 0.173 0.890 0.964 0.985

DORN [26] 80m Yes 0.072 0.307 2.727 0.120 0.932 0.984 0.994

Lee et al. (ResNet) [29] 80m Yes 0.061 0.261 2.834 0.099 0.954 0.992 0.998

PGA-Net (baseline) 80m Yes 0.152 0.973 4.902 0.176 0.782 0.931 0.975

PGA-Net (w/ CRFs) (3S) 80m Yes 0.140 0.942 4.823 0.171 0.793 0.941 0.979

PGA-Net (w/ PLAG-CRFs) (3S) 80m Yes 0.134 0.909 4.796 0.167 0.801 0.951 0.981

PGA-Net (w/ FLAG-CRFs) (3S) 80m Yes 0.126 0.901 4.689 0.157 0.813 0.950 0.982

PGA-Net (w/ FLAG-CRFs + CK) (3S) 80m Yes 0.118 0.752 4.449 0.181 0.852 0.962 0.987

PGA-Net (w/ FLAG-CRFs + CK + DORN) (3S) 80m Yes 0.063 0.267 2.634 0.101 0.952 0.992 0.998

PGA-Net (w/ FLAG-CRFs + CK + DORN) (5S) 80m Yes 0.060 0.258 2.595 0.097 0.954 0.993 0.998

PGA-Net w/ CK, which uses the proposes conditional kernel

strategy. We also consider as reference traditional multi-scale

deep learning models employing multi-scale representations,

i.e. Hypercolumn [16] and HED [2].

Analysis.

The quantitative results on different baseline

models are shown in Table 3 and 5. The results clearly

show the advantages of our contributions. The ODS F-

measure of PGA-Net (w/o AG-CRFs) is 1.1% higher than

PGA-Net (baseline), clearly demonstrating the effectiveness

of the proposed hierarchical network and conﬁrming our

intuition that exploiting more richer and diverse multi-scale

representations is beneﬁcial, which could be also veriﬁed

from the results on the Pascal-Context as shown in Table 5.

Table 3 also shows that our AG-CRFs plays a fundamental

role for accurate detection, as PGA-Net (w/ FLAG-CRFs)

leads to an improvement of 1.9% over PGA-Net (w/o AG-

CRFs) in terms of OSD. Besides, we could also observe that

the PGA-Net (w/ PLAG-CRF) also boosts the performance

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

0 10 20 30 40 50 60 70 80

Epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precentage (%)

Baseline (mIoU)

Ours (mIoU)

Baseline (PixelAcc)

Ours (PixelAcc)

Fig. 9. Training curves of our approach and the baseline model (i.e. ours

with the proposed AG-CRFs disabled) in terms of both the mIoU and

PixAcc metrics on the Pascal-Context dataset. The number of overall

training epochs is 80.

TABLE 5

Quantitative performance analysis of the proposed PGA-Net with the

ResNet 50 backbone on Pascal-Context for the semantic segmentation

task. ‘CK’ denotes the proposed conditional kernel strategy. H2 denotes

only the second hierarchy is used. 3S, 4S and 5S indicate three, four

and ﬁve feature scales from different stages are considered, respectively.

Method Backbone pixAcc(%) mIoU(%)

Hypercolumn [16] (5S) ResNet50 75.96 47.88

HED [2] (5S) ResNet50 76.45 48.41

HRNet [48] HRNetV2-W48 - 54.00

PGA-Net (baseline) (H2, 3S) ResNet50 75.60 47.21

PGA-Net (w/o AG-CRFs) (3S) ResNet50 76.70 48.73

PGA-Net (w/ CRFs) (3S) ResNet50 77.10 49.12

PGA-Net (w/o deep supervision) (3S) ResNet50 76.90 48.92

PGA-Net (w/ PLAG-CRFs) (H2, 3S) ResNet50 77.41 49.22

PGA-Net (w/ PLAG-CRFs) (3S) ResNet50 77.91 50.13

PGA-Net (w/ FLAG-CRFs) (3S) ResNet50 78.49 51.01

PGA-Net (w/ FLAG-CRFs + CK) (3S) ResNet50 79.62 52.15

PGA-Net (w/ FLAG-CRFs + CK) (5S) ResNet50 80.42 53.25

PGA-Net (w/ FLAG-CRFs + CK) (4S) HRNetV2-W48 81.31 55.18

over the PGA-Net (baseline) (.711 vs. .731 in terms of the

ODS metric) where both of them use only the second

hierarchy. Finally, PGA-Net (w/ FLAG-CRFs) is 1.2% and

1.5% better than PGA-Net (w/ CRFs) in ODS and AP metrics

respectively, conﬁrming the effectiveness of embedding an

attention mechanism in the multi-scale CRF model. In Table 5,

the mIoU of PGA-Net (w/ FLAG-CRFs) is 1.39 points higher

than that of PGA-Net (w/ CRFs), further demonstrating the

advantage of the proposed attention mechanism. PGA-Net

(w/o deep supervision) decreases the overall performance of

our method by 1.9% in ODS, showing the crucial importance

of deep supervision for better optimization of the whole

PGA-Net. Comparing the performance of the proposed

two versions of the AG-CRF model, i.e. PLAG-CRFs and

FLAG-CRFs, we can see that PGA-Net (FLAG-CRFs) slightly

outperforms PGA-Net (PLAG-CRFs) in both ODS and OIS,

while bringing a signiﬁcant improvement (around 2%) in AP.

Finally, considering HED [2] and Hypercolumn [16], it is

clear that our PGA-Net (FLAG-CRFs) is signiﬁcantly better

than these methods. Importantly, our approach utilizes only

three scales while for HED [2] and Hypercolumn [16] we

consider ﬁve scales. By considering ﬁve scales, our model

obtains further improvement upon the results using three

scales for contour detection on the NYUD-V2 (see Table 3)

and for semantic segmenation on the Pascal-Context (see

Table 5). We also deploy the proposed AG-CRF model into

Fig. 10. Qualitative semantic segmentation results on the Pascal-Context

dataset. The representative Dilated FCN method [5] is compared.

a popular multi-scale network architecture HRNet [48]. It is

clear to see that our model could also boost its performance,

further conﬁrming that our model is able to be applied into

different multi-scale context for effective feature fusion. From

Table 3 and 5, we can also observe the effectiveness of the

proposed conditional kernel strategy. PGA-Net (w/ FLAG-

CRFs + CK) clearly improves over PGA-Net (w/ FLAG-

CRFs) on the AP (1.3 points) for the contour detection and the

mIoU (1.1 points) for the semantic segmentation. We also plot

the training curves of the proposed approach and the baseline

on the Pascal-Context validation in Figure 9. As shown in the

ﬁgure, our approach consistently outperforms the baseline

model at each training epoch, furthering demonstrating the

effectiveness of the proposed AG-CRF model.

5.2.3 Qualitative Analysis.

Attention visualization.

Figure 5 and Figure 11 show exam-

ples of the learned attention maps in our proposed AG-CRF

model on KITTI and Pascal-Context, respectively. As our

attention mechanism learns a multi-channel attention map,

meaning that the attention map has the same number of

channels as the feature map. We visualize four channels of

the overall 256 channels (i.e. every the 64-th channel). It can

be observed that the learned attention map could capture

the informative feature region from both the spatial and

channel dimension, which we believe is an important reason

for our model to effectively reﬁne the feature maps. Taking

the second row in Fig. 11 for instance, it is easy to observe

that the dog, the chair and the background are activated on

different channels, and for each speciﬁc channel, different

spatial regions are activated. The dark blue color marks the

activated parts/regions.

Prediction visualization.

The multi-scale predictions and

the ﬁnal prediction from the PGA-Net on contour detection

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE 6

Overall performance comparison with state of the art methods on the val

set of the PASCAL-Context dataset. Spatial context post-processing is

used in the pipeline of most methods except for [5], [39], [104]. Our full

model achieves the best results compared with both the CRF-based or

not the CRF-based approaches on the pixAcc and mIoU metrics. 3S and

5S indicate three and ﬁve feature scales are considered, respectively.

Method Backbone pixAcc% mIoU%

CFM (VGG+MCG) [105] VGG-16 - 34.4

FCN-8s [106] VGG-16 46.5 35.1

FCN-8s [31] VGG-16 50.7 37.8 -

DeepLab-v2 [5] VGG-16 - 37.6

BoxSup [107] VGG-16 - 40.5

ConvPP-8s [104] VGG-16 - 41.0

PixelNet [108] VGG-16 51.5 41.4

CRF-RNN [39] VGG-16 - 39.3

DeepLab-v2 + CRF [5] VGG-16 - 45.7

ASPP [102] ResNet-50 78.3 49.2

PSPNet [103] ResNet-50 78.6 50.6

EncNet [74] ResNet-50 78.4 49.9

EncNet [74] ResNet-101 79.2 51.7

HRNet [48] HRNetV2-W48 - 54.0

OCR [109] ResNet101 - 54.8

PGA-Net (w/ FLAG-CRFs+ CK) (3S) ResNet-50 79.6 52.2

PGA-Net (w/ FLAG-CRFs+ CK) (3S) ResNet-101 80.8 54.8

PGA-Net (w/ FLAG-CRFs+ CK) (5S) ResNet-101 81.2 55.1

Fig. 11. Visualization of the learned attention maps of the proposed AG-

CRF model on the Pascal-Context dataset. Our model learns pixel-wise

attention in both the spatial and channel dimension. We visualize the

attention as we perform on KITTI by sampling four attention channels of

the attention map uniformly. The learned attentions are able to capture

different semantic regions to guide the message passing among features.

is shown in Figure 6. It can be observed that the multi-scale

predictions are highly complementary to each other, which

conﬁrming the initial intuition of modelling the multi-scale

predictions in a joint CRF model for structured prediction

and fusion. Figure 8 and Figure 10 show examples of the

monocular depth estimation on KITTI and the semantic

segmentation on Pascal-Context respectively. Different state

of the arts methods are compared in the ﬁgures. It is clearly

that our approach achieves qualitatively better than these

methods on both datasets.

6 CONCLUSION

We presented a novel multi-scale probabilistic graph atten-

tion networks with conditional kernels for pixel-wise predic-

tion. The proposed model introduces two main components,

i.e. a hierarchical architecture for generating more rich and

complementary multi-scale feature representations, and an

Attention-Gated CRF model using conditional kernels for

robust feature reﬁnement and fusion. We demonstrate the

effectiveness of our approach through extensive experiments

on three different pixel-wise prediction tasks, including

continuous problems, i.e. monocular depth estimation, and

discrete problems, i.e. object contour detection and semantic

segmentation. Four challenging publicly available datasets,

BSDS500, NYUD-V2, KITTI and Pascal-Context are consid-

ered in our experiments. The proposed model achieved very

competitive performance on all the task and the datasets. The

proposed approach addresses a general problem, i.e. how

to learn rich multi-scale representations and optimally fuse

them. Therefore, we believe it may be also beneﬁcial for other

continuous and discrete pixel-level prediction tasks.

REFERENCES

[1]

X. Ren, “Multi-scale improves boundary detection in natural

images,” in ECCV, 2008.

[2]

S. Xie and Z. Tu, “Holistically-nested edge detection,” in CVPR,

2015.

[3]

I. Kokkinos, “Pushing the boundaries of boundary detection using

deep learning,” in ICLR, 2016.

[4]

K.-K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool,

“Convolutional oriented boundaries,” in ECCV, 2016.

[5]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.

Yuille, “Deeplab: Semantic image segmentation with deep convo-

lutional nets, atrous convolution, and fully connected crfs,” arXiv

preprint arXiv:1606.00915, 2016.

[6]

Y. Xu, D. Xu, W. Ouyang, and G. Zhao, “Structured modeling

of joint deep feature and prediction reﬁnement for salient object

detection,” in ICCV, 2019.

[7]

F. Liu, C. Shen, and G. Lin, “Deep convolutional neural ﬁelds for

depth estimation from a single image,” in CVPR, 2015.

[8]

D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale

continuous crfs as sequential deep networks for monocular depth

estimation,” in CVPR, 2017.

[9]

V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual

attention,” in NIPS, 2014.

[10] T. Minka and J. Winn, “Gates,” in NIPS, 2009.

[11]

P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour

detection and hierarchical image segmentation,” TPAMI, vol. 33,

no. 5, pp. 898–916, 2011.

[12]

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor

segmentation and support inference from rgbd images,” in ECCV,

2012.

[13]

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets

robotics: The kitti dataset,” IJRR, vol. 32, no. 11, pp. 1231–1237,

2013.

[14]

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,

R. Urtasun, and A. Yuille, “The role of context for object detection

and semantic segmentation in the wild,” in CVPR, 2014.

[15]

D. Xu, W. Ouyang, X. Alameda-Pineda, E. Ricci, X. Wang, and

N. Sebe, “Learning deep structured multi-scale features using

attention-gated crfs for contour prediction,” in NIPS, 2017.

[16]

B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hyper-

columns for object segmentation and ﬁne-grained localization,” in

CVPR, 2015.

[17]

W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang, “Deepcontour:

A deep convolutional feature learned by positive-sharing loss for

contour detection,” in CVPR, 2015.

[18]

G. Bertasius, J. Shi, and L. Torresani, “Deepedge: A multi-scale

bifurcated deep network for top-down contour detection,” in

CVPR, 2015.

[19]

J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object contour

detection with a fully convolutional encoder-decoder network,”

in CVPR, 2016.

[20]

Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai, “Richer convolu-

tional features for edge detection,” arXiv preprint arXiv:1612.02103,

2016.

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[21]

Y. Wang, X. Zhao, Y. Li, and K. Huang, “Deep crisp boundaries:

From boundaries to higher-level tasks,” TIP, vol. 28, no. 3, pp.

1285–1298, 2018.

[22]

D. Eigen and R. Fergus, “Predicting depth, surface normals

and semantic labels with a common multi-scale convolutional

architecture,” in ICCV, 2015.

[23]

P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille,

“Towards uniﬁed depth and semantic prediction from a single

image,” in CVPR, 2015.

[24]

A. Roy and S. Todorovic, “Monocular depth estimation using

neural regression forest,” in CVPR, 2016.

[25]

I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,

“Deeper depth prediction with fully convolutional residual net-

works,” arXiv preprint arXiv:1606.00373, 2016.

[26]

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep

ordinal regression network for monocular depth estimation,” in

CVPR, 2018.

[27]

Y. Gan, X. Xu, W. Sun, and L. Lin, “Monocular depth estimation

with afﬁnity, vertical pooling, and label enhancement,” in ECCV,

2018.

[28]

M. M. Puscas, D. Xu, A. Pilzer, and N. Sebe, “Structured coupled

generative adversarial networks for unsupervised monocular

depth estimation,” in 3DV, 2019.

[29]

J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small:

Multi-scale local planar guidance for monocular depth estimation,”

arXiv preprint arXiv:1907.10326, 2019.

[30]

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from

a single image using a multi-scale deep network,” in NIPS, 2014.

[31]

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional

networks for semantic segmentation,” in CVPR, 2015.

[32]

F. Yu and V. Koltun, “Multi-scale context aggregation by dilated

convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[33]

Y. Yuan and J. Wang, “Ocnet: Object context network for scene

parsing,” arXiv preprint arXiv:1809.00916, 2018.

[34]

L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to

scale: Scale-aware semantic image segmentation,” in CVPR, 2016.

[35]

H. Noh, S. Hong, and B. Han, “Learning deconvolution network

for semantic segmentation,” in ICCV, 2015.

[36]

V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep

convolutional encoder-decoder architecture for robust semantic

pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.

[37]

Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image

segmentation via deep parsing network,” in ICCV, 2015.

[38]

A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr, “Higher order

conditional random ﬁelds in deep neural networks,” in ECCV,

2016.

[39]

S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su,

D. Du, C. Huang, and P. H. Torr, “Conditional random ﬁelds as

recurrent neural networks,” in ICCV, 2015.

[40]

Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and

Y. Kalantidis, “Graph-based global reasoning networks,” in CVPR,

2019.

[41]

D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-

tasks guided prediction-and-distillation network for simultaneous

depth estimation and scene parsing,” in CVPR, 2018.

[42]

Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint

task-recursive learning for semantic segmentation and depth

estimation,” in ECCV, 2018.

[43]

Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern-

afﬁnitive propagation across depth, surface normal and semantic

segmentation,” in CVPR, 2019.

[44]

S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multi-

scale task interaction networks for multi-task learning,” in ECCV,

2020.

[45]

D. Xu, W. Xie, and A. Zisserman, “Geometry-aware video object

detection for static cameras,” in BMVC, 2019.

[46]

D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe, “Learning

cross-modal deep representations for robust pedestrian detection,”

in CVPR, 2017.

[47]

D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistent

similarity learning via deep crf for person re-identiﬁcation,” in

CVPR, 2018.

[48]

K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,

W. Liu, and J. Wang, “High-resolution representations for labeling

pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.

[49]

J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu,

M. Tan, X. Wang, et al., “Deep high-resolution representation

learning for visual recognition,” TPAMI, 2020.

[50]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.

Yuille, “Semantic image segmentation with deep convolutional

nets and fully connected crfs,” in ICLR, 2015.

[51]

S. Yang and D. Ramanan, “Multi-scale recognition with dag-cnns,”

in ICCV, 2015.

[52]

G. Huang and D. Chen, “Multi-scale dense networks for resource

efﬁcient image classiﬁcation,” in ICLR, 2018.

[53]

P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and

Y. Bengio, “Graph attention networks,” in ICLR, 2017.

[54]

T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The

application of two-level attention models in deep convolutional

neural network for ﬁne-grained image classiﬁcation,” in CVPR,

2015.

[55]

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,

“Attention-based models for speech recognition,” in NIPS, 2015.

[56]

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,

R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image

caption generation with visual attention,” in ICML, 2015.

[57]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.

Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”

in NIPS, 2017.

[58]

J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual

attention network for scene segmentation,” in CVPR, 2019.

[59]

Y. Tang, “Gated boltzmann machine for recognition under oc-

clusion,” in NIPS Workshop on Transfer Learning by Learning Rich

Generative Models, 2010.

[60]

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua-

tion of gated recurrent neural networks on sequence modeling,”

arXiv preprint arXiv:1412.3555, 2014.

[61]

X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou,

B. Yang, Z. Wang, et al., “Crafting gbd-net for object detection,”

arXiv preprint arXiv:1610.02579, 2016.

[62]

A. Quattoni, M. Collins, and T. Darrell, “Conditional random

ﬁelds for object recognition,” in NIPS, 2005.

[63]

S. Sarawagi and W. W. Cohen, “Semi-markov conditional random

ﬁelds for information extraction,” in NIPS, 2005.

[64] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random

ﬁelds: Probabilistic models for segmenting and labeling sequence

data,” in ICML, 2001.

[65]

X. He, R. S. Zemel, and M. Á. Carreira-Perpiñán, “Multiscale

conditional random ﬁelds for image labeling,” in CVPR, 2004.

[66]

P. Krähenbühl and V. Koltun, “Efﬁcient inference in fully con-

nected crfs with gaussian edge potentials,” in NIPS, 2011.

[67]

Y. Boykov and G. Funka-Lea, “Graph cuts and efﬁcient nd image

segmentation,” IJCV, vol. 70, no. 2, pp. 109–131, 2006.

[68]

X. Chu, W. Ouyang, X. Wang, et al., “Crf-cnn: Modeling structured

information in human pose estimation,” in NIPS, 2016.

[69]

D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured

attention guided convolutional neural ﬁelds for monocular depth

estimation,” in CVPR, 2018.

[70] J. Winn, “Causality with gates,” in AISTATS, 2012.

[71]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas-

siﬁcation with deep convolutional neural networks,” in NIPS,

2012.

[72]

K. Simonyan and A. Zisserman, “Very deep convolutional

networks for large-scale image recognition,” arXiv preprint

arXiv:1409.1556, 2014.

[73]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for

image recognition,” arXiv preprint arXiv:1512.03385, 2015.

[74]

H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and

A. Agrawal, “Context encoding for semantic segmentation,” in

CVPR, 2018.

[75]

R. Garg, G. Carneiro, and I. Reid, “Unsupervised cnn for single

view depth estimation: Geometry to the rescue,” in ECCV, 2016.

[76]

P. F. Felzenszwalb and D. P. Huttenlocher, “Efﬁcient graph-based

image segmentation,” IJCV, vol. 59, no. 2, 2004.

[77]

D. Comaniciu and P. Meer, “Mean shift: A robust approach toward

feature space analysis,” TPAMI, vol. 24, no. 5, pp. 603–619, 2002.

[78]

J. Shi and J. Malik, “Normalized cuts and image segmentation,”

TPAMI, vol. 22, no. 8, 2000.

[79]

Z. Ren and G. Shakhnarovich, “Image segmentation by cascaded

region agglomeration,” in CVPR, 2013.

JOURNAL OF L

A

T

E

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

[80]

J. J. Lim, C. L. Zitnick, and P. Dollár, “Sketch tokens: A learned

mid-level representation for contour and object detection,” in

CVPR, 2013.

[81]

J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik,

“Multiscale combinatorial grouping for image segmentation and

object proposal generation,” TPAMI, vol. 39, no. 1, pp. 128–140,

2016.

[82]

Q. Zhao, “Segmenting natural images with the least effort as

humans,” in BMVC, 2015.

[83]

P. Dollár and C. L. Zitnick, “Structured forests for fast edge

detection,” in ICCV, 2013.

[84]

S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich

features from rgb-d images for object detection and segmentation,”

in ECCV, 2014.

[85]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,

“Scene parsing through ade20k dataset,” in CVPR, 2017.

[86]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,

“Imagenet: A large-scale hierarchical image database,” in CVPR,

2009.

[87]

C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised

monocular depth estimation with left-right consistency,” in CVPR,

2017.

[88]

D. Xu, E. Ricci, W. Ouyang, X. Wang, N. Sebe, et al., “Monocular

depth estimation using multi-scale continuous crfs as sequential

deep networks,” TPAMI, vol. 41, no. 6, pp. 1426–1440, 2018.

[89]

S. Hallman and C. C. Fowlkes, “Oriented edge forests for

boundary detection,” in CVPR, 2015.

[90]

Z. Zhang, F. Xing, X. Shi, and L. Yang, “Semicontour: A semi-

supervised learning approach for contour detection,” in CVPR,

2016.

[91]

P. Dollár and C. L. Zitnick, “Fast edge detection using structured

forests,” TPAMI, vol. 37, no. 8, pp. 1558–1570, 2015.

[92]

S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and

recognition of indoor scenes from rgb-d images,” in CVPR, 2013.

[93]

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised

learning of depth and ego-motion from video,” in CVPR, 2017.

[94]

J. N. Kundu, P. K. Uppala, A. Pahuja, and R. V. Babu, “Adadepth:

Unsupervised content congruent adaptation for depth estimation,”

in CVPR, 2018.

[95]

A. Pilzer, D. Xu, M. Puscas, E. Ricci, and N. Sebe, “Unsupervised

adversarial depth estimation using cycled generative networks,”

in 3DV, 2018.

[96]

C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning

depth from monocular videos using direct methods,” in CVPR,

2018.

[97]

Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint

learning of depth and ﬂow using cross-task consistency,” in ECCV,

2018.

[98]

H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and

I. Reid, “Unsupervised learning of monocular depth estimation

and visual odometry with deep feature reconstruction,” arXiv

preprint arXiv:1803.03893, 2018.

[99]

Y. Kuznietsov, J. Stückler, and B. Leibe, “Semi-supervised deep

learning for monocular depth map prediction,” in CVPR, 2017.

[100]

A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene

structure from a single still image,” TPAMI, vol. 31, no. 5, pp.

824–840, 2009.

[101]

R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised

learning of depth and ego-motion from monocular video using

3d geometric constraints,” in CVPR, 2018.

[102]

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking

atrous convolution for semantic image segmentation,” arXiv

preprint arXiv:1706.05587, 2017.

[103]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing

network,” arXiv preprint arXiv:1612.01105, 2016.

[104]

S. Xie, X. Huang, and Z. Tu, “Top-down learning for structured

labeling with convolutional pseudoprior,” in ECCV, 2016.

[105]

J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint

object and stuff segmentation,” in CVPR, 2015.

[106]

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional

networks for semantic segmentation,” in CVPR, 2015.

[107]

J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to

supervise convolutional networks for semantic segmentation,” in

ICCV, 2015.

[108]

A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan,

“Pixelnet: Representation of the pixels, by the pixels, and for

the pixels,” arXiv preprint arXiv:1702.06506, 2017.

[109]

Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations

for semantic segmentation,” arXiv preprint arXiv:1909.11065, 2019.

PLACE

PHOTO

HERE

Dan Xu

is an Assistant Professor in the Depart-

ment of Computer Science and Engineering at

HKUST. He was a Postdoctoral Research Fellow

in VGG at the University of Oxford. He was a

Ph.D. in the Department of Computer Science at

the University of Trento. He was also a research

assistant of MM Lab at the Chinese University of

Hong Kong. He received the best scientiﬁc paper

award at ICPR 2016, and a Best Paper Nominee

at ACM MM 2018. He served as Area Chairs of

ACM MM 2020, ICPR 2020 and WACV 2021.

PLACE

PHOTO

HERE

Xavier Alameda-Pineda

received M.Sc. de-

grees in mathematics (2008), in telecommunica-

tions (2009) and in computer science (2010) and

a Ph.D. in mathematics and computer science

(2013) from Université Joseph Fourier. Since

2016, he is a Research Scientist at Inria Greno-

ble Rhône-Alpes, with the Perception team. He

served as Area Chair at ICCV’17, of ICIAP’19

and of ACM MM’19. He is the recipient of several

paper awards and of the ACM SIGMM Rising Star

Award in 2018.

PLACE

PHOTO

HERE

Wanli Ouyang

received the PhD degree in the

Department of Electronic Engineering, The Chi-

nese University of Hong Kong. He is now a senior

lecturer in the School of Electrical and Information

Engineering at the University of Sydney, Australia.

His research interests include image processing,

computer vision and pattern recognition. He is a

senior member of IEEE.

PLACE

PHOTO

HERE

Elisa Ricci

received the PhD degree from the

University of Perugia in 2008. She is an asso-

ciate professor at the University of Trento and a

researcher at Fondazione Bruno Kessler. She

has since been a post-doctoral researcher at

Idiap, Martigny, and Fondazione Bruno Kessler,

Trento. She was also a visiting researcher at

the University of Bristol. Her research interests

are mainly in the areas of computer vision and

machine learning. She is a member of the IEEE.

PLACE

PHOTO

HERE

Xiaogang Wang

received the PhD degree in

Computer Science from Massachusetts Institute

of Technology. He is an associate professor in the

Department of Electronic Engineering at the Chi-

nese University of Hong Kong since August 2009.

He was the Area Chairs of ICCV 2011 and 2015,

ECCV 2014 and 2016, ACCV 2014 and 2016.

He received the Outstanding Young Researcher

in Automatic Human Behaviour Analysis Award

in 2011, Hong Kong RGC Early Career Award in

2012, and CUHK Young Researcher Award 2012.

PLACE

PHOTO

HERE

Nicu Sebe

is Professor with the University of

Trento, Italy, leading the research in the areas

of multimedia information retrieval and human

behavior understanding. He was the General Co-

Chair of the IEEE FG Conference 2008 and ACM

Multimedia 2013, and the Program Chair of the

International Conference on Image and Video

Retrieval in 2007 and 2010, ACM Multimedia

2007 and 2011. He was the Program Chair of

ICCV 2017 and ECCV 2016, and a General Chair

of ACM ICMR 2017. He is a fellow of the IAPR.