ArticlePDF Available

Abstract and Figures

In this paper we present two compressed-domain features that are highly indicative of saliency in natural video. We demonstrate the potential of these two features to indicate saliency by comparing their statistics around human fixation points against their statistics at control points away from fixations. Then, using these features, we construct a simple and effective saliency estimation method for compressed video, which utilizes only motion vectors, block coding modes and coded residuals from the bitstream, with partial decoding. The proposed algorithm has been extensively tested on two ground truth datasets using several accuracy metrics. The results indicate its superior performance over several state-of-the-art compressed-domain and pixel-domain algorithms for saliency estimation.
Content may be subject to copyright.
To appear in Multimedia Tools and Applications, October 2015.
Special Issue on Perception Inspired Video Processing
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes
Sayed Hossein Khatoonabadi ·
Ivan V. Baji´
c·
Yufeng Shan
Abstract In this paper we present two compressed-domain features that are highly indicative of saliency in nat-
ural video. We demonstrate the potential of these two features to indicate saliency by comparing their statistics
around human fixation points against their statistics at control points away from fixations. Then, using these fea-
tures, we construct a simple and effective saliency estimation method for compressed video, which utilizes only
motion vectors, block coding modes and coded residuals from the bitstream, with partial decoding. The proposed
algorithm has been extensively tested on two ground truth datasets using several accuracy metrics. The results
indicate its superior performance over several state-of-the-art compressed-domain and pixel-domain algorithms
for saliency estimation.
Keywords Compressed-domain video processing ·Visual saliency ·Human fixations
1 Introduction
Visual attention in humans is a set of strategies in early stages of vision processing that filters the massive stream
of data collected by eyes. To reduce the complexity of visual scene analysis, the Human Visual System (HVS)
selects visually salient regions by shifting the focus of attention across the scene [40]. Mimicking human attention
has become useful in a number of computer vision and image/video processing tasks. Computational models of
attention are finding applications in object recognition [15], object tracking [36], video abstraction [25], video
coding [11,12], video quality assessment [38], error concealment [13], data hiding [26], attention retargeting [37],
and so on.
The overwhelming majority of existing saliency models operate on raw pixels, rather than compressed images
or video. An excellent review of the state of the art on pixel-domain saliency estimation is given in [4,5]. In
addition, a few attempts have been made to use compressed video data, such as motion vectors (MVs), block
coding modes, motion-compensated prediction residuals, or their transform coefficients, in saliency modeling [34,
2,33,39,9]. A recent comparison of these methods is presented in [27]. Compressed-domain approach is typically
adopted for efficiency reasons, i.e., to avoid recomputing information already present in the compressed bitstream.
This work was supported in part by the Cisco Research Award CG# 573690 and NSERC Grant RGPIN 327249.
Sayed Hossein Khatoonabadi
Simon Fraser University
Burnaby, BC, Canada
E-mail: skhatoon@sfu.ca
Ivan V. Baji´
c
Simon Fraser University
Burnaby, BC, Canada
E-mail: ibajic@ensc.sfu.ca
Yufeng Shan
Cisco Systems Boxborough, MA, USA
E-mail: yshan@cisco.com
2 Sayed Hossein Khatoonabadi et al.
The extracted data is a proxy for many of the features frequently used in saliency modeling. For example, the MV
field is an approximation to optical flow, while block coding modes and prediction residuals are indicative of
motion complexity. Furthermore, the extraction of these features only requires partial decoding of the compressed
video file, while the recovery of the actual pixel values is not necessary.
In this paper, we describe two video features called Motion Vector Entropy (MV E) and Smoothed Residual
Norm (SRN), both of which can be computed from the compressed video bitstream using motion vectors, block
coding modes, and transformed prediction residuals. We analyze the statistics of these features at the human
fixation points in video, and show that they are sufficiently different from their statistics at non-attended points,
which qualifies them as correlates of fixations. That is to say, these features are indicative of attended regions in
video.
Finally, using these two features, we build a simple visual saliency model for compressed video and compare
it against several state-of-the-art pixel-domain and compressed-domain saliency models, using several accuracy
metrics. The results indicate that the proposed model achieves higher accuracy in predicting fixations compared
to state-of-the-art models, even the pixel-domain ones. We also discuss the reasons for this improved accuracy.
Preliminary version of this work was presented in [28]. The present paper includes some new material, such as
saliency estimation assessment in terms of Jensen-Shannon Divergence (JSD) and an evaluation of individual
compressed-domain features as saliency predictors.
2 Background
Many computational models of visual saliency have been introduced during the past 25 years. In a gold-standard
saliency model, the so-called Itti-Koch-Niebur (IKN) model [24], visual saliency is computed based on center-
surround differences in various feature channels. Itti et al. further extended the model to account for more biolog-
ically plausible normalization [18] and temporal features [19,20].
Harel et al. [16] used graph topology to combine feature maps based on the similarity and distance between
two connected nodes in the graph. Seo and Milanfar [43] estimated saliency as the resemblance of each pixel
to its surroundings by using local steering kernels. Kim et al. [29] proposed a multiscale saliency detection al-
gorithm based on a center-surround strategy where the spatial saliency is computed by self-ordinal resemblance
measure while the temporal saliency is computed by the difference between temporal gradients of the center and
the surroundings. Garcia-Diaz et al. [10] proposed an adaptive sparse representation computed through adaptive
whitening of color and scale features to produce the saliency map.
The complexity of some of the saliency models for dynamic scenes is quite high, which has limited their
deployment in real-world applications. Reduction of complexity can be achieved by performing computation in
the compressed domain, using the data already computed by the encoder. Therefore, compressed-domain visual
saliency models are attractive due to their generally lower computational cost compared to their pixel-domain
counterparts.
For example, Ma and Zhang [34] used only MVs to estimate visual saliency. Their model takes two rules
into account: the sensitivity of HVS to a large-magnitude motion and the stability of camera motion. They further
added two new rules to reduce false detection [35]: the MVs of a moving object tend to have coherent angles, and
if the magnitudes of object’s MVs are large and their angles are incoherent, the motion information is not reliable.
Agarwal et al. [2] computed the saliency map based on a linear combination of three energy terms: normalized
motion magnitude, edge energy, and Spatial Frequency Content (SFC). Motion magnitude is obtained from the
MVs, while the edge energy and SFC is computed from the DCT coefficient values for each macroblock. This
model was further improved in [44] by using a center-surround mechanism that attempted to measure how different
the magnitude of a MV is from the MV magnitudes in its neighborhood through respective downsampling and
upsampling.
Liu et al. [33] introduced a motion center-surround difference model by constructing seven different resolution
levels according to MV sizes, and then computing the motion saliency at any level for the MV of a given block as
the average magnitude difference from 8-connected neighbors at the same level.
The model introduced by Muthuswamy and Rajan [39] consists of two steps for generating the final motion
saliency map. In the first step, spatial saliency map was constructed using low-frequency AC coefficients according
to center-surround differences. The motion saliency map was the result of refining the spatial saliency map by
using an accumulated binary motion map across neighboring frames. Each binary motion frame was obtained by
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 3
thresholding the magnitude of MVs, and the refining was carried out by element-wise multiplication. In the second
step, the dissimilarity of DC images among co-located blocks over the frames was calculated based on entropy.
The final saliency map was the product of the spatial and motion saliency maps.
Fang et al. [9] determined the static saliency from Intra-coded frames (I-frames) using features such as lumi-
nance, color and texture obtained from DCT coefficients of each 8×8 block. Meanwhile, MVs were extracted from
Predicted frames (P-frames) and Bidirectionally-predicted frames (B-frames) as the features of motion saliency.
For each feature, a conspicuity map was constructed using the center-surround difference mechanism based on the
feature integration theory of attention [47], where the saliency value is inversely proportional to the distance be-
tween center and surround. The models reviewed above are summarized in Table 2 and included in the experiments
described in Section 5.
3 Compressed-domain features
Typical video compression consists of motion estimation and motion-compensated prediction, followed by trans-
formation, quantization and entropy coding of prediction residuals and motion vectors. These processing blocks
have existed since the earliest video coding standards, getting more sophisticated over time. Our compressed-
domain features are computed from the outputs of these basic processing blocks. For concreteness, we shall focus
on the H.264/AVC coding standard [48], but the feature computation can be adjusted to other video coding stan-
dards, including the latest High Efficiency Video Coding (HEVC) [45]. Due to the focus on H.264/AVC, our
terminology involves 16 ×16 macroblocks, block coding modes (INTER, INTRA, SKIP) for various block sizes
(4 ×4, 8 ×8, etc.), and the 4 ×4 integer transform that we shall refer to as “DCT” although it is only an approxi-
mation to the actual Discrete Cosine Transform.
3.1 Motion Vector Entropy
Motion vectors (MVs) in the video bitstream carry important cues regarding temporal changes in the scene. A MV
is a two-dimensional vector (vx,vy)assigned to a block, which represents its offset from the best-matching block
in a reference frame. The best-matching block is found via motion estimation, often coupled with rate-distortion
optimization. The MV field can be considered as an approximation to the optical flow.
When a moving object passes through a certain region in the scene, it will generate different MVs in the
corresponding spatio-temporal neighborhood; some MVs will correspond to the background, others to the object
itself, and object’s MVs themselves may be very different from each other, especially if the object is flexible. On
the other hand, an area of the scene covered entirely by the background will tend to have consistent MVs, caused
mostly by the camera motion. From this point of view, variation of MVs in a given spatio-temporal neighborhood
could be used as an indicator of the presence of moving objects, which in turn shows potential for attracting
attention.
Before computing the feature that describes the above-mentioned concept, block processing in the frame
is performed as follows. SKIP blocks are assigned a zero MV, while INTRA-coded blocks are excluded from
analysis. Then all MVs in the frame are mapped to 4 ×4 blocks; for example, a MV assigned to an 8 ×8 block is
allocated to all four of its constituent 4 ×4 blocks, etc.
We define a motion cube as a causal spatio-temporal neighborhood of a given 4 ×4 block b, as illustrated in
Fig. 1. The spatial dimension of the cube (W) is selected to be twice the size of the fovea (2of visual angle [19])
while the temporal dimension (L) is set to 200 ms. For example, for CIF resolution (352×288) video at 30 frames
per second and viewing conditions specified in [14], these values are W=52 pixels and L=6 frames.
To give a quantitative interpretation of MV variability within the motion cube, we use the normalized Motion
Vector Entropy (MV E), defined as
MV E(b) = 1
logN
iH(
Θ
(b))
ni
N·logni
N,(1)
where
Θ
(b)is the motion cube associated with the 4 ×4 block b, H(·)is the histogram, iis the bin index, niis the
number of MVs in bin i, and N=ini. The factor 1/logNin (1) serves to normalize MV E so that its maximum
4 Sayed Hossein Khatoonabadi et al.
Fig. 1 Motion cube
Θ
(b)associated with block b(shown in red) is its causal spatio-temporal neighborhood of size W×W×L.
value is 1, achieved when ni=nj,i,j. The minimum value of MV E is 0, achieved when ni=0 for all iexcept
one.
Histogram H(
Θ
(b)) is constructed from MVs of inter-coded blocks within the motion cube. Depending on the
encoder settings, such as search range and MV accuracy (full pixel, half pixel, etc.), each MV can be represented
using a finite number of pairs of values (vx,vy). Each possible pair of values (vx,vy)under the given motion
estimation settings defines one bin of the histogram. The cube is scanned and every occurrence of a particular
(vx,vy)results in incrementing the corresponding niby 1.
It has been observed that large-size blocks are more likely to be part of the background, whereas small-size
blocks, arising from splitting during the motion estimation, are more likely to belong to moving objects [3]. To
take this into account, during block processing, 4 ×4 INTER blocks are assigned random vectors from a uniform
distribution over the motion search range prior to mapping MVs from larger INTER and SKIP blocks to their
constituent 4 ×4 blocks. This way, a motion cube that ended up with many 4 ×4 INTER blocks during encoding
is forced to have high MV E.
3.2 Smoothed Residual Norm
Large motion-compensated prediction residual is an indication that the best-matching block in the reference frame
is not a very good match to the current block. This in turn means that the motion of the current block cannot be
well predicted using the block translation model, either due to the presence of higher-order motion or due to
(dis)occlusions. Of these two, (dis)occlusions often yield higher residuals. Moreover, (dis)occlusions are asso-
ciated with surprise, as a new object enters the scene or gets revealed behind another moving object, so they
represent a potential attractor of attention. Therefore, large residuals might be an indicator of regions that have the
potential to attract attention.
The “size” of the residual is usually measured using a certain norm, for example pnorm for some p0. In
this paper we employ the 0norm, i.e., the number of non-zero elements, since it is easier to compute than other
popular norms such as 1or 2. For any macroblock, we define Residual Norm (RN) as the norm of the quantized
transformed prediction residual of the macroblock, normalized to the range [0,1]. For the 0norm employed in
this paper, RN would be:
RN(m) = 1
256 kmk0,(2)
where mdenotes the quantized transformed residual of a 16 ×16 macroblock.
Finally, the map of macroblock residual norms is smoothed spatially using a 3×3 averaging filter, temporally
using a moving average filter over previous Lframes, and finally upsampled by a factor of 4 using bilinear inter-
polation. The result is the Smoothed Residual Norm (SRN) map, with one value per 4×4 block, just like the MV E
map.
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 5
0 1
0
1
control sample mean
test sample mean
MVE
0 1
0
1
control sample mean
test sample mean
SRN
Fig. 2 Scatter plots of the pairs (control sample mean, test sample mean) in each frame, for MV E (left) and SRN (right). Dots above
the diagonal show that feature values at fixation points are higher than at randomly selected points.
3.3 Discriminative power of the features
To assess how indicative of fixations are the two above-mentioned features, we use an approach similar to Reinagel
and Zador [42], who performed an analogous analysis for two other features – spatial contrast and local pixel
correlation – on still natural images. Their analysis showed that in still images, on average, spatial contrast is
higher, while local pixel correlation is lower, around fixation points compared to random points.
We follow an analogous approach, using two eye-tracking datasets, DIEM [1] and SFU [14], which contain
fixation points of human observers for a number of video clips. Each video is encoded using the FFMPEG library
(www.ffmpeg.org) with the QP value set to 30 and 1/4-pixel MV accuracy with motion estimation range of 16
pixels, and up to four motion vectors per MB. After encoding, transformed residuals (DCT), block coding modes
(BCMs) and MVs of P-frames were extracted and the two features were computed as explained above.
In each frame, feature values at fixation points are selected as a test sample, while feature values at non-fixation
points are selected as a control sample. Specifically, control sample is obtained using a nonparametric bootstrap
technique [7] from non-fixation points in the video frame. Control points are sampled with replacement, multiple
times, with sample size equal to the number of fixation points. The average of the feature values over all bootstrap
(sub)samples is taken as the control sample mean.
The pair of values (control sample mean, test sample mean) is shown in Fig. 2 for each frame as a green dot.
The left scatter plot corresponds to MV E and the right scatter plot to SRN. From these plots, it is easy to see that,
on average, MV E and SRN values at fixation points tend to be higher than MV E and SRN values at randomly-
selected non-fixation points. This suggests that MV E and SRN could be used as indicators of possible fixations in
video.
To further test this observation, the control and test sample in each frame are compared via a two-sample
t-test [30], with the null hypothesis being that the two samples come from populations with the same mean. A
separate test is performed for MV E and SRN. The two-sample t-test at the 0.1% significance level results in
rejecting the null hypothesis for both MV E and SRN , in all sequences. Note that we have used a stricter 0.1%
significance level here, compared to the more conventional (and looser) 1% and 5% levels. The p-values obtained
for each video sequence are listed in Table 1, along with the percentage of frames where the test sample mean is
greater than the control sample mean.
Overall, these results lend strong support to the assertion that MV E and SRN are compressed-domain corre-
lates of fixations in natural video. In the remainder of the paper, we describe a simple approach to visual saliency
estimation using these two features, and then proceed to compare the proposed approach against several state-of-
the-art visual saliency models.
4 Proposed saliency estimation
The two compressed-domain features identified in Section 3 as visual correlates of fixations in video suggest that
a simple saliency estimate may be obtained without fully reconstructing the video. Fig. 3 shows the block diagram
of the proposed algorithm to estimate visual saliency. For each inter-coded frame, MVs, BCMs and DCT are
entropy-decoded from the video bitstream. MVs and BCMs are used to construct the MV E map, illustrated in the
upper branch in Fig. 3 for a frame from sequence Stefan. Transformed residuals are used to construct the SRN
6 Sayed Hossein Khatoonabadi et al.
Table 1 Results of statistical comparison of test and control samples. For each sequence, the p-value of a two-sample t-test and the
percentage (%) of frames where the test sample mean is larger than the control sample mean are shown.
MV E SRN
# Seq. p % p %
1Bus 1059 100 10116 100
2City 1013 77 1056 85
3Crew 1053 91 1064 93
4Foreman 1025 79 1017 59
5Garden 1022 83 1056 90
6Hall 10207 96 10218 95
7Harbour 1061 90 10131 100
8Mobile 1064 93 1086 90
9Mother 10187 99 10167 100
10 Soccer 1073 97 1092 96
11 Stefan 1046 100 1069 100
12 Tempete 10359 1055 98
13 blicb 1050 98 10105 98
14 bws 1070 99 1097 94
15 ds 1049 97 1046 86
16 abb 1023 65 1068 75
17 abl 10121 95 10145 96
18 ai 1023 65 1059 78
19 aic 10100 10265 100
20 ail 10204 98 10198 98
21 hp6t 1038 97 1070 94
22 mg 1058 96 1074 97
23 mtnin 1054 94 10154 94
24 ntbr 1021 97 1042 93
25 nim 1036 86 1028 67
26 os 10105 93 10142 100
27 pas 1073 95 1092 98
28 pnb 10548 1014 58
29 ss 10137 100 10175 99
30 swff 10968 1027 65
31 tucf 10788 1049 95
32 ufci 10878 1035 94
H.264/AVC
bitstream
MV+BCM
DCT
Block
processing
Histogram
construction
Residual
norm Filtering Resizing
Entropy
Fusion
Saliency
map
Original frame
superimposed
by fixations
MVE map
SRN map
Fig. 3 Block diagram of the proposed saliency estimation algorithm.
map, shown in the bottom branch in Fig. 3. Operations performed by various processing blocks were described
when the corresponding features were introduced in Section 3. The final saliency map is obtained by fusing the
two feature maps.
A number of feature fusion methods have been investigated in the context of saliency estimation [19,24].
The appropriate fusion method will depend on whether the features in question are independent, and whether
their mutual action reinforces or diminishes saliency. In our case, we note that MVE and SRN are somewhat
independent, in the sense that one could imagine a region in the scene with high MV E and low SRN, and vice
versa. Also, their combined action is likely to increase saliency – when both MV E and SRN are large, the region
is not only likely to contain moving objects (large MVE ), but also contains parts that are surprising and not
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 7
Table 2 Saliency estimation algorithms used in our evaluation. D: target domain (cmp: compressed; pxl: pixel); I: Implementation (M:
Matlab; P: Matlab p-code; C: C/C++; E: Executable). In GBVS, the DIOFM channels were used.
# Algorithm First Author Year D I
1MaxNorm Itti [24] (ilab.usc.edu/toolkit) 1998 pxl C
2Fancy1 Itti [19] (ilab.usc.edu/toolkit) 2004 pxl C
3SURP Itti [22] (ilab.usc.edu/toolkit) 2006 pxl C
4GBVS Harel [16] 2007 pxl M
5STSD Seo [43] 2009 pxl M
6SORM Kim [29] 2011 pxl E
7AWS Garcia-Diaz [10] 2012 pxl P
8PMES Ma [34] 2001 cmp M
9MAM Ma [35] 2002 cmp M
10 PIM-ZEN Agarwal [2] 2003 cmp M
11 PIM-MCS Sinha [44] 2004 cmp M
12 MCSDM Liu [33] 2009 cmp M
13 MSM-SM Muthuswamy [39] 2013 cmp M
14 PNSP-CS Fang [9] 2014 cmp M
easily predictable from previous frames (large SRN). Hence, our fusion involves both additive and multiplicative
combination of MV E and SRN maps,
S=N(MV E +SRN +MV E SRN ),(3)
where the symbol denotes pointwise multiplication and N(·)indicates normalization to the range [0,1]. We
will refer to the proposed method as MVE+SRN.
5 Experimental results
This section presents experimental evaluation of the proposed saliency estimation method and its comparison
with several state of the art saliency models. The MATLAB code and data used in these experiments will be made
available online at http://www.sfu.ca/~ibajic/software.html upon acceptance of the paper.
5.1 Experimental setup
The proposed algorithm was compared with a number of state-of-the-art algorithms for saliency estimation in
video. These methods are listed in Table 2. For each algorithm, the target domain - pixel (pxl) or compressed (cmp)
- and the implementation details are also indicated. Only MCSDM was specifically designed for H.264/AVC,
while other cmp-domain algorithms were designed for earlier coding standards, for example MPEG-1, MPEG-2,
MPEG-4 SP and MPEG-4 ASP. Although, fundamentally, these coding standards all rely on the same type of
information - MVs and prediction residuals - minor modifications to some algorithms were necessary in order for
them to accept H.264/AVC input data, in particular 4 ×4 MVs.
Evaluation was carried out on the DIEM [1] and SFU [14] datasets mentioned in Section 3. The fixations of
the right eye in the DIEM dataset and the first viewing in the SFU dataset were used as the ground truth. For
the DIEM dataset, 20 sequences similar to those in [5] were chosen, and only the first 300 frames were used
in the experiments, to match the length of SFU sequences. Since the videos in the DIEM dataset are at various
resolutions, they were first resized to 288 pixels height, while preserving the original aspect ratio, resulting in five
resolutions: 352 ×288, 384 ×288, 512 ×288, 640 ×288 and 672 ×288.
Encoding was done using FFMPEG (www.ffmpeg.org) with QP ∈ {3,6, ..., 51}in the baseline profile. Se-
quences were encoded with the IPPP GOP structure, with the first frame coded as intra (I) and the remaining
frames coded predictively (P). For each MB, there exists up to four MVs having 1/4-pixel accuracy with motion
estimation range of 16 pixels. We avoided the use of I- and B-frames since many of the compressed-domain meth-
ods did not specify how to handle these frames. In principle, there are several possibilities for B-frames, such as
flipping forward MVs around the frame to create backward MVs (as in P-frames), or forming two saliency maps,
one from forward MVs and one from backward MVs, and then averaging them. However, to avoid speculation
and stay true to the algorithms the way they were presented, we used only P-frames in the evaluation.
8 Sayed Hossein Khatoonabadi et al.
A number of metrics have been used to evaluate the accuracy of visual saliency models with respect to gaze
point data [4,5,8,21,23,31]. Since each metric emphasizes a particular aspect of model’s performance, to make
the evaluation balanced, a collection of metrics was used in this study to compare how accurate are the various al-
gorithms in predicting fixations. These metrics are Area Under receiver operating characteristic Curve (AUC) [46],
Normalized Scanpath Saliency (NSS) [41] and Jensen-Shannon divergence (JSD) [32]. All were corrected for cen-
ter bias and border effects [27] by sampling control points more often from the middle of the frame relative to the
frame boundaries.
AUC is computed from detection rates and false alarm rates at various threshold parameters. An AUC of 0.5
represents pure chance and an AUC of 1 represents perfect prediction. NSS is computed as the response value of
predicted saliency at fixation points [41]. A larger NSS reflects a greater correspondence between fixations and
saliency predictions. JSD is a Kullback-Leibler Divergence (KLD)-based metric that calculates the divergence
between two probability distributions. Specifically, for two discrete probability distributions Pand Q, JSD is
defined as [6]
JSD(PkQ) = KLD(PkR) + KLD(QkR)
2,(4)
where
R=P+Q
2.(5)
and
KLD(PkQ) =
r
i=1
P(i)·logbP(i)
Q(i),(6)
where bis the logarithmic base and rindicates the number of bins in each distribution. Unlike KLD, JSD is
symmetric in Pand Q, and is bounded in [0,1]if the logarithmic base is set to b=2 [32]. The value of JSD for
the saliency map that perfectly predicts fixation points will be equal to 1.
5.2 Comparison to the state-of-the-art
Figs. 4 and 5 illustrate the differences between the saliency predictions of various algorithms for sample video
frames from Stefan and advert-bbc4-library, respectively. In the figures, human maps were obtained by the con-
volution of a 2D Gaussian blob (with standard deviation of 1of visual angle) with the fixation points.
Fig. 6 shows the average AUC score (top figure) as well as NSS score (bottom figure) of various algorithms
across the test sequences. The sequences were encoded with QP =30, with the average Peak Signal-to-Noise
Ratio (PSNR) across encoded sequences being 35.85 dB. This is normally considered reasonable objective qual-
ity. The average AUC/NSS performance across sequences/algorithms shown in the sidebar/topbar. Not surpris-
ingly, on average, pixel-domain methods perform better than compressed-domain ones. However, our proposed
compressed-domain method MVE+SRN tops all other methods, including pixel-domain ones, on both metrics.
Possible reasons for this are discussed in Section 6. Also note that the SRN feature, by itself, is a close second
behind MVE+SRN, while the MVE feature alone is the fourth-best predictor of saliency according to AUC and
the third-best according to NSS. Based on these results, the SRN feature itself shows excellent saliency prediction
capabilities. Including the MVE feature into saliency prediction helps on sequences with considerable amount of
motion, such as Soccer and Stefan, but in many cases, SRN is sufficient.
The performance of the various saliency models was also evaluated using a multiple comparison test [17]. For
each sequence, the average score of a given model across all frames is computed, along with the 95% confidence
interval for the average score. The model with the highest average score is a top performer on that sequence,
however, all other models whose 95% confidence interval overlaps that of the highest-scoring model are also
considered top performers on that sequence. The number of appearances among top performers for each model is
shown in Fig. 7. Again, pixel-domain methods tend to do better than compressed-domain ones, but our method
tops both groups, with some margin. As before, SNR comes in second in terms of both AUC and NSS, while MVE
is fourth according to AUC and shares the third spot according to NSS.
It is worth mentioning that due to the very weak performance of AWS on sequences where most other methods
scored well in Fig. 6 (such as Mobile), the average AUC and NSS scores of AWS are not particularly high - the
average was dragged down by the low scores on these few sequences. However, in the multiple comparison test
in Fig. 7, AWS shows strong performance, because it is among top performers on many other sequences. These
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 9
Human Map MVE+SRN MaxNorm Fancy1
SURP GBVS STSD SORM
AWS PMES MAM PIM-ZEN
PIM-MCS MCSDM MSM-SM PNSP-CS
Fig. 4 Sample saliency maps obtained by various algorithms for an example video frame from Stefan.
results corroborate the results of the comparison made in [5] (which was performed on DIEM and other datasets
but not the SFU dataset) where AWS was among the highest-scoring algorithms under study.
We also compare the distribution of saliency values at the fixation locations against the distribution of saliency
values at random points from non-fixation locations in Fig. 8 in terms of JSD. If these two distributions overlap
substantially, then the saliency model predicts fixation points no better than a random guess. On the other hand,
as one distribution diverges from the other, the saliency model is better able to predict fixation points. As seen in
the Fig. 8, MVE+SRN generates higher divergence between two distributions compared to other models.
5.3 Sensitivity to the amount of compression
Since a compressed video representation always involves some amount of information loss, it is important to
determine the sensitivity of the compressed-domain saliency model to the amount of compression. Note that the
predictive power of MVs and DCT coefficients could change dramatically across the compression range. To study
this issue, we repeated the experiments described above for different amounts of compression, by varying the QP
parameter. The quality of the encoded video, measured in terms of PSNR, drops as QP increases due to the larger
amount of compression.
Fig. 9 shows how the average AUC and NSS scores change as a function of the average PSNR (across se-
quences), by varying QP ∈ {3,6,...,51}. Note that pixel-domain methods are also sensitive to compression -
they do not use compressed-domain information, but they are impacted by the accuracy of decoded pixel values.
10 Sayed Hossein Khatoonabadi et al.
Human Map MVE+SRN MaxNorm Fancy1
SURP GBVS STSD SORM
AWS PMES MAM PIM-ZEN
PIM-MCS MCSDM MSM-SM PNSP-CS
Fig. 5 Sample saliency maps obtained by various algorithms for an example video frame from advert-bbc4-library.
Many compressed-domain methods (our MVE+SRN, but also PIM-ZEN, MSM-SM, etc.) achieve their best per-
formance at PSNR around 35 dB. This is excellent news for compressed-domain models because this range of
PSNR is thought to be very appropriate in terms of balance between objective quality and compression efficiency.
Overall, MVE+SRN achieves the highest accuracy across most of the compression range.
Based on the results in Fig. 9, it appears that around the PSNR value of 35 dB, compressed-domain features
MVE and SRN are both sufficiently informative and sufficiently accurate. As the amount of compression reduces
(i.e., quality increases) SRN becomes less informative, since small quantization step-size makes each residual
have large 0norm. At the same time, MVs may become too noisy, since rate-distortion optimization does not
impose sufficient constraints for smoothness. On the other hand, as the amount of compression increases (i.e.,
quality reduces), both MVE and SRN become less accurate. Both extremes are detrimental to saliency prediction.
5.4 Complexity analysis
To assess the complexity of various algorithms, processing time was measured on an Intel (R) Core (TM) i7
CPU at 3.40 GHz and 16 GB RAM running 64-bit Windows 8.1. The results are shown in Table 3. As expected,
compressed-domain models tend to require far less processing time than their pixel-domain counterparts. How-
ever, PNSP-CS is very computationally demanding because calculating the conspicuity of each block (center) in
the frame requires the computation of its dissimilarity and its distance from all other blocks in the frame (sur-
round). MAM and PMES apply
α
-trimmed average filter within a 3-D spatio-temporal tracking volume in the
pre-processing step, which requires sorting. This process makes MAM and PMES more demanding in terms of
processing. On the other hand, the proposed method, implemented in MATLAB, required an average of only 30 ms
per CIF video frame. While this is slower than some of the other compressed-domain algorithms, it enables the
computation of saliency within the real-time requirements, even in MATLAB. The majority of the computational
cost in our method is due to entropy computations over the histograms of MVs in various motion cubes.
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 11
ail
abl
aic
ss
Stefan
blicb
Hall
os
Mobile
bws
abb
Mother
Soccer
swff
mtnin
ds
Garden
pas
mg
Harbourai
Bus
nim
ntbr
Foreman
tucf
hp6t
pnb
Crew
Tempete
ufci
City
MVE+SRN
SRN
MaxNorm
MVE
Fancy1
SURP
STSD
PIM ZEN
AWS
GBVS
SORM
PIM MCS
PMES
PNSP CS
MAM
MCSDM
MSM SM
0.3 0.9
0.6 0.7
Avg AUC
0.4
0.6
0.8
Avg AUC
abl
Mobile
mtnin
ail
ss
aic
abb
blicb
swff
Stefan
Hall
bws
os
Soccer
Garden
Mother
pas
pnb
ntbr
ds
ufci
Harbourai
hp6t
mg
nim
Bus
tucf
Foreman
Tempete
Crew
City
MVE+SRN
SRN
MVE
Fancy1
MaxNorm
PIMZEN
PMES
STSD
PIMMCS
MSMSM
SORM
GBVS
AWS
SURP
MAM
PNSPCS
MCSDM
0.3 5.3
0.5 1
Avg NSS
0
1
2
3
Avg NSS
Fig. 6 Accuracy of various saliency algorithms over the two datasets according to (top) AUC and (bottom) NSS scores. Each 2D color
map shows the average AUC/NSS score of each algorithm on each sequence. Topbar: Average AUC/NSS score for each sequence,
across all algorithms. Sidebar: Average AUC/NSS scores each algorithm across all sequences. Error bars represent standard error of
the mean,
σ
/n, where
σ
is the sample standard deviation of nsamples. Sequences from the SFU dataset are indicated with capital
first letter.
6 Discussion and conclusions
We presented two video features, obtainable from the compressed video bitstream, that qualify as correlates of
fixations. Using these two features we constructed a simple visual saliency estimation method and compared it with
fourteen other saliency prediction methods for video. Some of these methods also made use of compressed-domain
features, while others operated in the pixel domain. Comparison was made using several established metrics on
two eye tracking datasets. The results showed that the proposed method outperformed all other methods, including
pixel-domain ones.
12 Sayed Hossein Khatoonabadi et al.
0 5 10 15 20 25
MVE+SRN
MVE
SRN
MaxNorm
Fancy1
SURP
GBVS
STSD
SORM
AWS
PMES
MAM
PIM−ZEN
PIM−MCS
MCSDM
MSM−SM
PNSP−CS
AUC
0 5 10 15 20 25
NSS
Fig. 7 The number of appearances among top performers, using AUC and NSS evaluation metrics.
0 0.5 1
0
5
10
15x 104MVE+SRN
JSD = 7.72 × 10−2
0 0.5 1
0
5
10
15x 104Fancy1
JSD = 5.79 × 10−2
0 0.5 1
0
5
10
15x 104MaxNorm
JSD = 5.63 × 10−2
0 0.5 1
0
5
10
15x 104AWS
JSD = 4.17 × 10−2
0 0.5 1
0
5
10
15x 104SURP
JSD = 3.89 × 10−2
0 0.5 1
0
5
10
15x 104PIM−ZEN
JSD = 3.23 × 10−2
0 0.5 1
0
5
10
15x 104STSD
JSD = 3.01 × 10−2
0 0.5 1
0
5
10
15x 104PIM−MCS
JSD = 3 × 10−2
0 0.5 1
0
5
10
15x 104PMES
JSD = 2.54 × 10−2
0 0.5 1
0
5
10
15x 104MSM−SM
JSD = 2.53 × 10−2
0 0.5 1
0
5
10
15x 104SORM
JSD = 2.24 × 10−2
0 0.5 1
0
5
10
15x 104GBVS
JSD = 2.23 × 10−2
0 0.5 1
0
5
10
15x 104MAM
JSD = 1.26 × 10−2
0 0.5 1
0
5
10
15x 104PNSP−CS
JSD = 1.24 × 10−2
0 0.5 1
0
5
10
15x 104MCSDM
JSD = 1.13 × 10−2
Fig. 8 The frequencies of saliency values estimated by different algorithms at the fixation locations (narrow blue bars) and random
points from non-fixation locations (wide green bars) vs the number of human fixations. The JSD between two distribution corresponding
to each algorithm presents how large each distribution diverges from another (histograms were sorted from left to right and top to bottom
according to the JSD metric.)
Table 3 Average processing time (ms) per frame.
Algorithm
AWS
GBVS
PNSP-CS
MAM
PMES
SURP
STSD
Fancy1
SORM
MaxNorm
PIM-ZEN
MVE+SRN
MCSDM
PIM-MCS
MSM-SM
Time 1492 923 895 778 579 323 227 98 92 89 43 30 15 10 8
A natural question to ask is - how come a compressed-domain method, which seems to be restricted to a
relatively constrained set of data, can outperform pixel-domain methods in terms of saliency prediction? To answer
this question, one needs to realize that a compressed-domain method is not a stand-alone entity. Its front end is the
video encoder, an extremely sophisticated algorithm whose goal is to provide the most compact representation of
video. Surely, such compact representation contains useful information about various aspects of the video signal,
including saliency. A more operational explanation for the performance gap between our method and pixel-domain
Compressed-Domain Correlates of Human Fixations in Dynamic Scenes 13
24 27 30 33 36 39 42 45 48
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
Average PSNR (dB)
Average AUC
MVE+SRN
MaxNorm
Fancy1
SURP
GBVS
STSD
AWS
PMES
MAM
PIM−ZEN
PIM−MCS
MCSDM
MSM−SM
PNSP−CS
24 27 30 33 36 39 42 45 48
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Average PSNR (dB)
Average NSS
MVE+SRN
MaxNorm
Fancy1
SURP
GBVS
STSD
AWS
PMES
MAM
PIM−ZEN
PIM−MCS
MCSDM
MSM−SM
PNSP−CS
Fig. 9 The relationship between the average PSNR and the models accuracy according to AUC (left) and NSS (right).
methods is that none of these methods used optical flow (or its approximation, the MV field), even though it is
known that such information is a powerful cue for saliency. From this point of view, the results are perhaps not
surprising.
References
1. The Dynamic Images and Eye Movements (DIEM) project, http://thediemproject.wordpress.com
2. Agarwal G, Anbu A, Sinha A (2003) ”A fast algorithm to find the region-of-interest in the compressed MPEG domain”. Proc. IEEE
ICME’03, vol 2, pp. 133-136
3. Arvanitidou MG, Glantz A, Krutz A, Sikora T, Mrak M, Kondoz A (2009) ”Global motion estimation using variable block sizes
and its application to object segmentation”. Proc. IEEE WIAMIS’09, pp. 173–176
4. Borji A, Itti L (2013) ”State-of-the-art in visual attention modeling”. IEEE Trans. Pattern Anal. Mach. Intell., vol 35, no 1, pp.185-
207
5. Borji A, Sihite DN, Itti L (2013) ”Quantitative analysis of human-model agreement in visual saliency modeling: A comparative
study”. IEEE Trans. Image Process., vol 22, no 1, pp. 55-69
6. Dagan I, Lee L, Pereira F (1997) ”Similarity-based methods for word sense disambiguation”. Proc. European chapter of the Asso-
ciation for Computational Linguistics, pp. 56-63
7. Efron B, Tibshirani R (1993) ”An introduction to the bootstrap”. CRC press, vol 57
8. Einh¨
auser W, Spain M, Perona P (2008) ”Objects predict fixations better than early saliency”. Journal of Vision, vol 8, no 14
9. Fang Y, Lin W, Chen Z, Tsai CM, Lin CW (2014) ”A video saliency detection model in compressed domain”. IEEE Trans. Circuits
Syst. Video Technol., vol 24, no 1, pp. 27-38
10. Garcia-Diaz A, Fdez-Vidal XR, Pardo XM, Dosil R (2012) ”Saliency from hierarchical adaptation through decorrelation and
variance normalization”. Image and Vision Computing, vol 30, no 1, pp. 51-64
11. Guo C, Zhang L (2010) ”A novel multiresolution spatiotemporal saliency detection model and its applications in image and video
compression”. IEEE Trans. Image Process., vol 19, no 1, pp. 185-198
12. Hadizadeh H, Baji´
c IV (2014) ”Saliency-aware video compression”. IEEE Trans. Image Process., vol 23, no 1, pp. 19-33
13. Hadizadeh H, Baji´
c IV, Cheung G (2013) ”Video error concealment using a computation-efficient low saliency prior”. IEEE Trans.
Multimedia, vol 15, no 8, pp. 2099-2113
14. Hadizadeh H, Enriquez MJ, Baji´
c IV (2012) ”Eye-tracking database for a set of standard video sequences”. IEEE Trans. Image
Process., vol 21, no 2, pp. 898-903
15. Han S, Vasconcelos N (2010) ”Biologically plausible saliency mechanisms improve feedforward object recognition”. Vision Re-
search, vol 50, no 22, pp. 2295-2307
16. Harel J, Koch C, Perona P (2007) ”Graph-based visual saliency”. Advances in Neural Information Processing Systems, vol 19, pp.
545-552
17. Hochberg Y, Tamhane AC (1987) ”Multiple comparison procedures”. John Wiley & Sons, Inc.
18. Itti L, Koch C (2001) ”Feature combination strategies for saliency-based visual attention systems”. Journal of Electronic Imaging,
vol 10, no 1, pp. 161-169
19. Itti L (2004) ”Automatic foveation for video compression using a neurobiological model of visual attention”. IEEE Trans. Image
Process., vol 13, no 10, pp. 1304-1318
20. Itti L, Dhavale N, Pighin F (2004) ”Realistic avatar eye and head animation using a neurobiological model of visual attention”.
Optical Science and Technology, SPIE’s 48th Annual Meeting, pp. 64-78
21. Itti L, Baldi P (2005) ”A principled approach to detecting surprising events in video”. Proc. IEEE CVPR’05, vol 1, pp. 631-637
22. Itti L, Baldi PF (2006) ”Bayesian surprise attracts human attention”. Advances in Neural Information Processing Systems, vol 19,
pp. 547-554
23. Itti L, Baldi P (2009) ”Bayesian surprise attracts human attention”. Vision Research, vol 49, no 10, pp. 1295-1306
14 Sayed Hossein Khatoonabadi et al.
24. Itti L, Koch C, Niebur E (1998) ”A model of saliency-based visual attention for rapid scene analysis”. IEEE Trans. Pattern Anal.
Mach. Intell., vol 20, no 11, pp. 1254-1259
25. Ji QG, Fang ZD, Xie ZH, Lu ZM (2013) ”Video abstraction based on the visual attention model and online clustering”. Signal
Processing: Image Commun., vol 28, no 3, pp. 241-253
26. Khalilian H, Baji´
c IV (2013) ”Video watermarking with empirical PCA-based decoding”. IEEE Trans. Image Processing, vol 22,
no 12, pp. 4825-4840
27. Khatoonabadi SH, Baji´
c IV, Shan Y (2014) ”Comparison of visual saliency models for compressed video”. Proc. IEEE ICIP’14,
pp. 1081-1085
28. Khatoonabadi SH, Baji´
c IV, Shan Y (2014) ”Compressed-domain correlates of fixations in video”. Proc. 1st Intl. Workshop on
Perception Inspired Video Processing (PIVP’14), pp. 3–8
29. Kim W, Jung C, Kim C (2011) ”Spatiotemporal saliency detection and its applications in static and dynamic scenes”. IEEE Trans.
Circuits Syst. Video Technol., vol 21, no 4, pp. 446-456
30. Kreyszig E (1970) ”Introductory mathematical statistics: Principles and methods”. Wiley New York
31. Le Meur O, Baccino T (2013) ”Methods for comparing scanpaths and saliency maps: Strengths and weaknesses”. Behavior Re-
search Methods, vol 45, no 1, pp. 251-266
32. Lin J (1991) ”Divergence measures based on the Shannon entropy”. IEEE Trans. Inf. Theory, vol 37, no 1, pp. 145-151
33. Liu Z, Yan H, Shen L, Wang Y, Zhang Z (2009) ”A motion attention model based rate control algorithm for H. 264/AVC”. The 8th
IEEE/ACIS International Conference on Computer and Information Science (ICIS’09), pp. 568-573
34. Ma YF, Zhang HJ ”A new perceived motion based shot content representation”. Proc. IEEE ICIP’01, vol 3, pp. 426-429
35. Ma YF, Zhang HJ (2002) ”A model of motion attention for video skimming”. Proc. IEEE ICIP’02, vol 1, pp. 129-132
36. Mahadevan V, Vasconcelos N (2013) ”Biologically inspired object tracking using center-surround saliency mechanisms”. IEEE
Trans. Pattern Anal. Mach. Intell., vol 35, no 3, pp. 541-554
37. Mateescu VA, Baji ´
c IV (2014) ”Attention retargeting by color manipulation in images, Proc. 1st Intl. Workshop on Perception
Inspired Video Processing (PIVP’14), pp. 15-20
38. Moorthy AK, Bovik AC (2009) ”Visual importance pooling for image quality assessment”. IEEE J. Sel. Topics Signal Process.,
vol 3, no 2, pp. 193-201
39. Muthuswamy K, Rajan D (2013) ”Salient motion detection in compressed domain”. IEEE Signal Process. Lett., vol 20, no 10, pp.
996-999
40. Niebur E, Koch C (1998) ”Computational architectures for attention”. Cambridge, MA: MIT Press, The Attentive Brain, chapter
9, pp. 163-186
41. Peters RJ, Iyer A, Itti L, Koch C (2005) ”Components of bottom-up gaze allocation in natural images”. Vision Research, vol 45,
no 18, pp. 2397-2416
42. Reinagel P, Zador AM (1999) ”Natural scene statistics at the center of gaze”. Network: Computation in Neural Systems, vol 10,
pp. 1-10
43. Seo HJ, Milanfar P (2009) ”Static and space-time visual saliency detection by self-resemblance”. Journal of Vision, vol 9, no 12,
pp. 1-27
44. Sinha A, Agarwal G, Anbu A (2004) ”Region-of-interest based compressed domain video transcoding scheme”. Proc. IEEE
ICASSP’04, vol 3, pp. 161-164
45. Sullivan GJ, Ohm J, Woo-Jin H, Wiegand T (2012) ”Overview of the high efficiency video coding (HEVC) standard”. IEEE Trans.
Circuits Syst. Video Technol., vol 22, no 12, pp. 1649-1668
46. Swets A (1996) ”Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers”. Lawrence Erlbaum
Associates, Inc.
47. Treisman AM, Gelade G (1980) ”A feature-integration theory of attention”. Cognitive Psychology, vol 12, no 1, pp. 97-136
48. Wiegand T, Sullivan GJ, Bjontegaard G, Luthra A (2003) ”Overview of the H. 264/AVC video coding standard”. IEEE Trans.
Circuits Syst. Video Technol., vol 13, no 7, pp. 560-576
... They proposed what is referred as operational block description length (OBDL) as a measure of saliency and cast it in a Markov random field (MRF) framework and measured the salient regions as eyefixations from various subjects. Khatoonabadi et al. [13] also explored a visual saliency model by analyzing novel features like motion vector entropy (MVE) and smoothed residual norm (SRN) at human fixation points and found that these features cluster around the human fixations and termed them as correlates of fixation. Residual DCT coefficient norm (RDCN) and OBDL were combined using a Gaussian model whose center was determined by the two features (i.e., RDCN and OBDL) and was demonstrated to be computationally efficient [17]. ...
... Residual DCT coefficient norm (RDCN) and OBDL were combined using a Gaussian model whose center was determined by the two features (i.e., RDCN and OBDL) and was demonstrated to be computationally efficient [17]. Although these compressed domain saliency models achieved good saliency prediction sometimes in methods like [13,16] better than pixel domain counterparts yet in this work, we show that there is scope for improvement in terms of both accuracy as well as computational time reduction, which forms the motivation of the work reported in this paper. ...
... However, as observed it incorrectly marks few people in the audience as salient. A similar observation can be made in MVE-SRN [13] where a large section of people in the audience are incorrectly marked as salient. The City sequence is analyzed next. ...
Article
Full-text available
This paper presents a novel compressed domain saliency estimation method based on analyzing block motion vectors and transform residuals extracted from the bitstream of H.264/AVC compressed videos. Block motion vectors are analyzed by modeling their orientation values utilizing Dual Cross Patterns, a feature descriptor that earlier found applications in face recognition to obtain the motion saliency map. The transform residuals are analyzed by utilizing lifting wavelet transform on the luminance component of the macro-blocks to obtain the spatial saliency map. The motion saliency map and the spatial saliency map are fused utilizing the Dempster–Shafer combination rule to generate the final saliency map. It is shown through our experiments that Dual Cross Patterns and lifting wavelet transform features fused via Dempster–Shafer rule are superior in predicting fixations as compared to the existing state-of-the-art saliency models.
... Paper [42] used the concept of information entropy to mark the foreground area through statistical MV. As shown in Fig. 6, it is a cube centered on a block b (shown in red) with a size of 4×4. ...
... It is worth noting that all the MVs in the frame are mapped to 4 × 4 blocks, which is convenient for statistics of motion information, since the macroblock size is different in video coding. The motion information MV in the causal spatiotemporal neighborhood of 4 × 4 blocks is used to calculate the MVE of this block [42]. As follow in (12) ...
Article
Full-text available
With the development of communication networks and the widespread use of mobile terminals, videos are increasingly shared and distributed among mobile users. During the process, the video is recompressed to a certain file size on the sending side and sent to the receiving side via the server. This makes robust video watermark with low complexity to resist recompression attacks become an important issue to address. The proposed video watermarking method in this article is specially designed for resisting recompression attacks when quantization parameter (QP) is greatly increased. In the proposed method, by using the texture information and motion information of video, the invariance of video content under different quantization parameters could be found to help improve the anti-recompression attack ability of video watermark. Moreover, the proposed framework does not use a location map which has security risk to locate the watermark. It aims at finding the optimal location of watermark embedding adaptively according to the feature of video content itself. The experiment results indicate that the proposed video watermarking method can not only achieve greater robustness against recompression attack but also effectively limit the degradation in video perceptual quality.
... DATASETS USED TO EVALUATE COMPRESSED-DOMAIN VISUALSALIENCY MODELS. Each participant watched each sequence twice, after several minutes ‡ Viewings for the left/right eye are available § A total of 250 subjects participated in the study, but not all of them viewed each video; the number of viewers per video was[35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53] ...
Preprint
Computational modeling of visual saliency has become an important research problem in recent years, with applications in video quality estimation, video compression, object tracking, retargeting, summarization, and so on. While most visual saliency models for dynamic scenes operate on raw video, several models have been developed for use with compressed-domain information such as motion vectors and transform coefficients. This paper presents a comparative study of eleven such models as well as two high-performing pixel-domain saliency models on two eye-tracking datasets using several comparison metrics. The results indicate that highly accurate saliency estimation is possible based only on a partially decoded video bitstream. The strategies that have shown success in compressed-domain saliency modeling are highlighted, and certain challenges are identified as potential avenues for further improvement.
... Rate control for H.264 videos to optimize the bits for salient regions utilizing multi-scale processing of motion vectors was carried out in [27]. Saliency modeling using features, namely motion vector entropy (MVE) and smoothed residual norm (SRN), was explored in [10]. They extended their work by identifying regions on a video frame that possess more bits and termed them as salient and utilized Operational block description length (OBDL) [12] as a measure of saliency. ...
Article
Full-text available
This paper presents a robust spatio-temporal saliency estimation method based on modeling motion vectors and transform residuals extracted from the H.264/AVC compressed bitstream. Spatial saliency is estimated by analyzing the detailed sub-band coefficients obtained by the wavelet decomposition of the luminance component of the macro-blocks, while temporal saliency is estimated by modeling the block motion vector orientation information using local derivative patterns. Dempster Shafer fusion rule is used to fuse the spatial saliency map and the motion saliency map to obtain the final saliency for a video frame. Extensive experimental validation along with comparative analysis with state-of-the-art methods is carried out to establish the proposed saliency method.
... Some works focus on extracting the saliency areas from encoded videos. In the framework of an H.264/AVC encoded sequence, in [12] the authors present, compressed-domain features based on the study of visual attention in humans. The first one is the Motion Vector Entropy, which is an quantitative measurement of MV variability. ...
Article
Full-text available
With the advent of smartphones and tablets, video traffic on the Internet has increased enormously. With this in mind, in 2013 the High Efficiency Video Coding (HEVC) standard was released with the aim of reducing the bit rate (at the same quality) by 50% with respect to its predecessor. However, new contents with greater resolutions and requirements appear every day, making it necessary to further reduce the bit rate. Perceptual video coding has recently been recognized as a promising approach to achieving high-performance video compression and eye tracking data can be used to create and verify these models. In this paper, we present a new algorithm for the bit rate reduction of screen recorded sequences based on the visual perception of videos. An eye tracking system is used during the recording to locate the fixation point of the viewer. Then, the area around that point is encoded with the base quantization parameter (QP) value, which increases when moving away from it. The results show that up to 31.3% of the bit rate may be saved when compared with the original HEVC-encoded sequence, without a significant impact on the perceived quality.
... The second group of models follow Bayesian approaches for computation of saliency map by the evaluation of dissimilarities between the a priori and the a postriori distributions of the visual features around each point by using KL divergence [8]. The third group includes compressed-domain visual saliency models which operate on the information found in a partial decoding of compressed video bitstream [20], [33] and extract features such as block-based motion vectors, prediction residuals, block coding modes, etc. Even though there are a few attempts using perceptual theories to build a saliency model, they are only limited to classical approaches but not tried to explain the deep learning based architectures. ...
Preprint
Full-text available
Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying human perception theories and the conventional concepts of saliency. This technique helps us understand the learning pattern of the deep network at its intermediate layers through their activation maps. Initially, we consider two state-of-the-art deep saliency models, namely UNISAL and MSI-Net for our interpretation. We use a set of biologically plausible log-gabor filters for identifying and reconstructing the activation maps of them using our explainable saliency model. The final saliency map is generated using these reconstructed activation maps. We also build our own deep saliency model named cross-concatenated multi-scale residual block based network (CMRNet) for saliency prediction. Then, we evaluate and compare the performance of the explainable models derived from UNISAL, MSI-Net and CMRNet on three benchmark datasets with other state-of-the-art methods. Hence, we propose that this approach of explainability can be applied to any deep visual saliency model for interpretation which makes it a generic one.
Article
Visual saliency prediction plays an important role in Unmanned Aerial Vehicle (UAV) video analysis tasks. In this paper, an efficient saliency prediction model for UAV video is proposed based on spatial–temporal features, prior information and the relationship of frames. It can achieve high efficiency by designing a simplified network model. Since UAV videos usually cover a wide range of scenes containing various background disturbances, a cascading architecture module is proposed for feature extraction from coarse to fine, in which a saliency related feature sub-network is utilized to obtain useful clues from each frame, then a new convolution block is designed to capture spatial–temporal features. This structure can achieve advanced performance and high speed based on a 2D CNN framework. Moreover, a multi-stream prior module is proposed to model the bias phenomenon in viewing behavior for UAV video scenes. It can automatically learn prior information based on the video context, and can also combine other priors. Finally, based on the spatial–temporal features and learned priors, a temporal weighted average module is proposed to model the inter-frame relationship and generate the final saliency map, which can make the generated saliency maps look smoother in the temporal dimension. The proposed method is compared with 17 state-of-the-art models on two public UAV video saliency prediction datasets. The experimental results demonstrate that our model outperforms other competitors. Source code is available at: https://github.com/zhangkao/IIP_UAVSal_Saliency.
Article
Full-text available
We have studied using density functional theory and non-equilibrium Green’s function based approach, the electronic structures of 555-777 divacancy (DV) defected armchair edged graphene nanoribbons (AGNR) as well as the transport properties of AGNR based two-terminal devices constructed with one defected electrode and one N doped electrode. Introduction of 555-777 DV defect into AGNR results in shifting of the π and π∗ bands towards the higher energy value indicating a downward shift of the Fermi level. Formation of a potential barrier, analogous to that of conventional p-n junction, has been observed across the junction of defected and N-doped AGNR. The two terminal devices show diode like property with high rectifying efficiency for a wide range of bias voltages. The devices also show robust negative differential resistance with very high peak-to-valley ratio. Shift of the electrode energy states and modification of the transmission function with applied bias have been analyzed, in order to gain an insight into the nonlinear and asymmetric behavior of the current-voltage characteristics. Variation of the transport properties on the width of the ribbons has also been discussed.
Conference Paper
Full-text available
Attention retargeting in images is a concept in which the content or composition of the image is altered in an effort to guide the viewer's attention to a specific location. In this paper, we propose a method that modifies the color of a selected region in an image to increase its saliency and draw attention towards it. To avoid many of the issues present in existing approaches to attention retargeting, including high computational complexity and unnatural-looking modifications to the images, we make the case for adjusting hue while leaving all remaining color components fixed. By representing the hue as an angle in CIE L*a*b* color space, we may express an adjustment in hue as a rotation in this space. The optimal hue adjustment is the rotation that maximizes the dissimilarity of hue distribution of the selected region relative to its surroundings. We apply our method on a set of natural images and confirm its effectiveness in guiding attention through eye-tracking. The naturalness of the results are evaluated in a separate set of subjective experiments.
Conference Paper
Full-text available
In this paper we present two compressed-domain features that are highly indicative of saliency in natural video. Their potential to predict saliency is demonstrated by comparing their statistics around human fixation points in a number of videos against the control points selected randomly away from fixations. Using these features, we construct a simple and effective saliency estimation method for compressed video, which utilizes only motion vectors, block coding modes and coded residuals from the bitstream, with partial decoding. The proposed algorithm has been extensively tested on two ground truth datasets using several accuracy metrics. The results indicate its superior performance over several state-of-the-art compressed-domain and pixel-domain algorithms for saliency estimation.
Article
Defects are common but important in graphene, which could significantly tailor the electronic structures and physical and chemical properties. In this paper, the density functional theory (DFT) method was applied to study the electronic structure and catalytic properties of graphene cluster containing various point and line defects. The electron transfer processes in oxygen reduction reaction (ORR) on perfect and defective graphene cluster in fuel cells was simulated, and the free energy, reaction energy barrier of the elementary reactions were calculated to determine reaction pathways. It was found that the graphene cluster with point defect having pentagon rings at zigzag edge, or line defects (grain boundaries) consisting of pentagon-pentagon-octagon or pentagon-heptagon chains also at the edges, shows the electrocatalytic capability for ORR. Four-electron and two-electron transfer processes could occur simultaneously on graphene cluster with certain types of defects. The energy barriers of the reactions are comparable to that of platinum (111). The catalytic active sites were determined on the defective graphene.
Article
Single-Stage Procedures Two-Stage Procedures Incomplete Block Designs for Comparing Treatments with a Control
Article
Graphene (G) reactivity toward oxygen is very poor, which limits its use as electrode for the oxygen reduction reaction (ORR). Contrarily, boron-doped graphene was found to be an excellent catalyst for the ORR. Through a density functional study, comparing molecular and periodic approaches and different functionals (B3LYP vs PBE), we show how substitutional boron in the carbon sheet can boost the reactivity with oxygen leading to the formation of bulk borates covalently bound to graphene (BO3–G) in oxygen-rich conditions. These species are highly interesting intermediates for the O═O breaking step in the reduction process of O2 to form H2O as they are energetically stable.
Article
Boron and nitrogen doped graphenes are highly promising materials for electrochemical applications, such as energy storage, generation and sensing. The doped graphenes can be prepared by a broad variety of chemical approaches. The substitution of a carbon atom should induce n-type behavior in the case of nitrogen and p-type behavior in the case of boron-doped graphene; however, the real situation is more complex. The electrochemical experiments show that boron-doped graphene prepared by hydroboration reaction exhibits similar properties as the nitrogen doped graphene; according to theory, the electrochemical behavior of B and N doped graphenes should be opposite. Here we analyze the electronic structure of N/B-doped graphene (at ∼5% coverage) by theoretical calculations. We consider graphene doped by both substitution and addition reactions. The density of states (DOS) plots show that graphene doped by substitution of the carbon atom by N/B behaves as expected, i.e., as an n/p-doped material. N-doped graphene also has a lower value of the workfunction (3.10 eV) with respect to that of the pristine graphene (4.31 eV), whereas the workfunction of B-doped graphene is increased to the value of 5.57 eV. On the other hand, the workfunctions of graphene doped by addition of -NH2 (4.77 eV) and -BH2 (4.54 eV) groups are both slightly increased and therefore the chemical nature of the dopant is less distinguishable. This shows that mode of doping depends significantly on the synthesis method used, as it leads to different types of behaviour, and, in turn, different electronic and electrochemical properties of doped graphene, as observed in electrocatalytic experiments. This study has a tremendous impact on the design of doped graphene systems from the point of view of synthetic chemistry.