Conference PaperPDF Available

Saliency-preserving video compression

Authors:

Abstract and Figures

In region-of-interest (ROI) video coding, the part of the frame designated as ROI is encoded with higher quality relative to the rest of the frame. At low bit rates, coding artifacts in non ROI parts of the frame may become salient and draw user's attention away from ROI, thereby degrading visual quality. In this paper we propose a saliency-preserving framework for ROI video coding. This approach aims at reducing attention grabbing visual artifacts in non-ROI parts of the frame in order to keep user's attention on ROI. Experimental results indicate that the proposed method is able to improve the visual quality of ROI video at low bit rates.
Content may be subject to copyright.
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
SALIENCY-PRESERVING VIDEO COMPRESSION
Hadi Hadizadeh and Ivan V. Baji´
c
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada
ABSTRACT
In region-of-interest (ROI) video coding, the part of the frame
designated as ROI is encoded with higher quality relative to
the rest of the frame. At low bit rates, coding artifacts in non-
ROI parts of the frame may become salient and draw user’s
attention away from ROI, thereby degrading visual quality.
In this paper we propose a saliency-preserving framework for
ROI video coding. This approach aims at reducing attention-
grabbing visual artifacts in non-ROI parts of the frame in or-
der to keep user’s attention on ROI. Experimental results in-
dicate that the proposed method is able to improve the visual
quality of ROI video at low bit rates.
Index TermsROI video coding, visual attention model,
visual coding artifacts, saliency
1. INTRODUCTION
Video compression standards such as MPEG-4 and H.26x
have been developed to achieve high compression efficiency
simultaneously with high perceived visual quality [1], [2].
However, lossy compression techniques may produce various
coding artifacts such as blockiness, ringing, blur, etc., espe-
cially at low bit rates [3]. Several methods have been pro-
posed to detect and reduce coding artifacts [4], [5], [6].
Recently, region-of-interest (ROI) coding of video using
computational models of visual attention has been recognized
as a promising approach to achieve high-performance video
compression [7],[8]. The idea behind most of these meth-
ods is to encode a small area around the predicted attention-
grabbing (salient) regions with higher quality compared to
other less visually important regions. Such a spatial priori-
tization is supported by the fact that only a small region of
25of visual angle around the center of gaze is perceived
with high spatial resolution due to the highly non-uniform dis-
tribution of photoreceptors on the human retina [7].
Granting a higher priority to the salient regions, how-
ever, may produce severe coding artifacts in areas outside the
salient regions where the image quality is lower. Such ar-
tifacts may draw viewer’s attention away from the naturally
salient regions, thereby degrading the perceived visual qual-
ity. To mitigate this problem, in this paper, we introduce the
concept of saliency-preserving video compression as a new
paradigm for video compression which attempts to suppress
such attention-grabbing artifacts, and keep user’s attention on
the same regions that were salient before compression. At the
same time, some artifacts may be tolerated, so long as they do
not draw attention.
Using this concept, we propose a novel algorithm for
saliency-preserving video compression within a ROI coding
framework. In the proposed algorithm, the visibility of poten-
tial coding artifacts is predicted by computing the difference
between the saliency map of the original raw video frames
and the saliency map of the encoded frames. The quantization
parameters (QPs) of individual macroblocks (MBs) are then
adjusted according to the obtained saliency errors, so that the
total saliency error is reduced while the bit rate constraint is
satisfied. To achieve this goal, the problem is formulated as
a Multiple-Choice Knapsack problem (MCKP) [9]. Exper-
imental results indicate that the proposed method is able to
improve the visual quality of encoded video compared to the
conventional rate-distortion optimized (RDO) video, as well
as the conventional ROI-coded video.
Note that a visible artifact is not necessarily salient. A
particular artifact may be visible if the user is looking directly
at it or at its neighborhood, but may go unnoticed if it is non-
salient and the user is looking elsewhere in the frame. As
the severity of the artifact increases, it may become salient
and draw user’s attention to it. Although several methods
have been developed for detecting visible (but not necessar-
ily salient) artifacts [6], in our work, the concept of visual
saliency is used to minimize salient coding artifacts, i.e., those
coding artifacts that may grab user’s attention.
The paper is organized as follows. In Section 2, a recent
ROI-coding algorithm is described, followed by a brief sum-
mary of the MCKP problem. The proposed method is pre-
sented in Section 3. Experimental results are given in Section
4, and the conclusions are drawn in Section 5.
2. PRELIMINARIES
2.1. ROI Video Coding
In [10], a ROI bit allocation scheme was proposed for
H.264/AVC. In this scheme, after detecting the ROI, several
coding parameters including QP, macroblock (MB) coding
modes, the number of reference frames, accuracy of motion
vectors, and the search range for motion estimation, are adap-
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
tively adjusted at the MB level according to the relative im-
portance of each MB and a given target bit rate. Subsequently,
the encoder allocates more resources, such as bits and com-
putational power, to the ROI. In [10], the optimized QP value
for each MB is obtained as
Qp[i] = X1[i]
T[i](Ni)X2[i]qMADpred,adapt[i]
N
X
k=isw[k]
w[i]MADpred,adapt[k],(1)
where T[i]is the remaining bits before encoding the i-th
MB, Nis the total number of MBs in a frame, X1[i]and
X2[i]are the first-order and zero-order parameters of the R-
Q model [10],[11], M ADpred,adapt [i]is the adaptive mean
absolute difference (MAD) prediction value, and w[i]is the
importance level associated with the i-th MB.
In order to combine this ROI coding approach with the
concept of visual attention, we encode the input video based
on the saliency maps produced by the Itti-Koch-Niebur (IKN)
saliency model from [12]. The saliency map of each frame
is first remapped to the range [0.5,0.5]. Then, the saliency
value of the i-th MB, computed as the average saliency value
of its pixels, is used as the importance level w[i]of that MB.
Given the target bit rate, the QP value of each MB can be
obtained using (1). As in [10], the obtained QP value is fur-
ther bounded to within ±4of the QP value of the previously
encoded MB in order to maintain visual smoothness and sup-
press blocking artifacts. Other ROI bit allocation schemes
(e.g., [13]) can also be employed to find the initial set of QP
values. The proposed method starts with a set of QP values
obtained by ROI bit allocation, and then modifies them in a
way that minimizes saliency error.
2.2. Multiple Choice Knapsack Problem
The Multiple Choice Knapsack Problem (MCKP) is a gener-
alization of the ordinary knapsack problem, where the set of
items is partitioned into classes. The binary choice of tak-
ing an item is replaced by the selection of exactly one item
out of each class [9]. Consider Kmutually disjoint classes
C1, C2, ..., CKof items to be packed into a knapsack of ca-
pacity c. Each item mCkis associated with a profit pkm
and a weight wkm. The goal is to choose exactly one item
from each class such that the profit sum is maximized while
the total weight is kept below the capacity c. Let xk m be the
binary indicator of whether item mis chosen in class Ck. The
MCKP is formulated as follows [9]:
Maximize
K
X
k=1 X
mCk
pkmxk m
Subject to
K
X
k=1 X
mCk
wkmxk m c,
X
mCk
xkm = 1, k = 1,2, ..., K,
xkm {0,1}, k = 1,2, ..., K, m Ck.
(2)
MCKP is an NP-hard problem [9]. However, an exact
solution can be obtained in a reasonable time for the problem
sizes encountered in the proposed method using the algorithm
proposed in [14].
3. THE PROPOSED METHOD
We now present the proposed algorithm for saliency-
preserving video coding. In the sequel, capital bold letters
(e.g., X) denote matrices, lowercase bold letters (e.g., x) de-
note vectors, and italics letters (e.g., x) represent scalars.
Consider an uncompressed video sequence consisting of
Nframes {F1,F2, ..., FN}, where each frame is WMBs
wide and HMBs high. Let BTbe the target number of bits
we wish to spend on encoding these frames. For each frame
Fi, a visual saliency map Siof the same size as Fiis com-
puted by a chosen visual attention model in which the saliency
of each 16×16 block (obtained by the average saliency of pix-
els within the block) determines the visual importance of the
corresponding MB in Fi. The proposed method consists of
the following steps.
Step 1) The current frame Fiis first encoded by a ROI
encoder (e.g., the one described in Section 2.1) using its orig-
inal saliency map Si. The encoded frame is then decoded, and
the saliency map of the decoded frame, ˜
Si, is computed. Let
Qibe the QP matrix of Fiobtained by the ROI encoder, and
Bibe the matrix containing the actual number of bits spent
on each MB of the encoded Fi. Both Qiand Biare of size
W×H(in MBs). Qi(x, y)is the QP value of the MB at
position (x, y), and Bi(x, y)is the number of bits of the MB
at position (x, y). The total number of bits Bispent on en-
coding frame Fiis Bi=PH
y=1 PW
x=1 Bi(x, y). Note that
BT=PN
i=1 Bi. Due to quantization, Siis, in general, dif-
ferent from ˜
Si. Let Ei=Si˜
Sibe the saliency error matrix.
We store Eiand Bifor subsequent optimization.
Our goal is to modify some elements of Qisuch that if
Fiis re-encoded with the modified Qi, the L1-norm of its
saliency error matrix Eiis decreased. The QP values can
change by the offsets from the set O={o1, o2, ..., oM}of
size M, whose elements are either positive or negative integer
values. We always set o1= 0, so that one of the options
corresponds to not changing the QP value.
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
Due to spatial intra-prediction, the bit rates of neighboring
MBs are dependent. Moreover, since the saliency is computed
over a neighborhood, the saliency value of an MB is affected
by the QPs of its neighbors. Modeling such a dependence
is not an easy task. To overcome this difficulty, we use the
following approach. Let Pbe the set of all binary matrices
(i.e., whose elements are either 0or 1) of size W×Hthat
have the following property: there are exactly two zeros in
between every two non-zero elements in both horizontal and
vertical directions. In total, there are 9such matrices, i.e.,
P={P1,P2, ..., P9}. Let Knbe the number of 1’s in Pn.
Each element of Pnis identified with a MB in a frame. At
any one time, we will only change the QPs of MBs identified
with 1’s. Since any two non-zero elements of each Pnare at
least two positions apart, the dependence between MBs cor-
responding to those non-zero elements is reduced, so we can
change their QPs without affecting other MBs selected by the
same Pnsignificantly. As an illustration, the binary masks
corresponding to Pn, n = 1,2, ..., 9,are shown in Fig. 1 for a
QCIF resolution frame, which contains 9×11 MBs.
Step 2) The current frame Fiis then re-encoded in the
following manner: first a binary matrix PnPand a QP
offset omOare chosen. A new QP matrix Qmn
iis com-
puted as Qmn
i=Qi+omPnwhere the superscript in Qmn
i
indicates that the new QP matrix has been obtained by offset
omand binary mask Pn. All elements of Qmn
iare passed
through a hard-limiter to ensure that all QP values will be in
the range [0,51] as required by H.264/AVC. Fiis then re-
encoded by Qmn
i, and its saliency map ˜
Smn
i, saliency error
matrix Emn
i, and the bit matrix Bmn
iare stored. In this step,
the rate control is disabled to prevent further modification of
QP values by the encoder. Note that since o1= 0,Q1n
i=Qi.
Therefore, E1n
i=Eiand B1n
i=Bi, where Eiand Biare
computed in Step 1. Therefore, this procedure does not need
to be performed with offset o1, but is applied to all other off-
sets oj,j2using the selected Pn. At the end of this step,
we obtain Mdifferent QP values, saliency errors, and actual
bits for each MB in Fifor which the corresponding element
in Pnis non-zero. Let Gbe the set of locations (x, y)of such
MBs in Fi. Since there are Knnon-zero elements of Pn,G
also contains Knelements.
We now want to find the best QP offset for each MB in
Gamong the obtained Moptions, such that if the chosen QP
offsets are applied to Qi, the L1-norm of the saliency error
is minimized, while the resultant number of bits of the en-
coded frame remains at or below Bi. To achieve this goal,
we model this problem as a Multiple-Choice-Knapsack prob-
lem (MCKP) [9]. Here, each class is one MB in G(so we
have a total of Knclasses), and each item in a class is a QP
offset (hence, Mitems in each class). We then consider a
2D window of size 3×3(in MBs) around the k-th MB in G
(k= 1,2, ..., Kn), and we compute the total saliency error
etot
ikmn, and total number of bits btot
ikmn within this window as
Fig. 1. An illustration of the nine binary masks (three in each
row) corresponding to Pn, n = 1,2, ..., 9for QCIF resolu-
tion. Black squares indicate the positions of 1’s in Pn, while
white squares indicate the positions of 0’s. For QCIF resolu-
tion, the number of black squares is Kn= 12 in six cases,
and Kn= 9 in three cases.
follows:
etot
ikmn =X
(x,y)∈N (k)
Emn
i(x, y),
btot
ikmn =X
(x,y)∈N (k)
Bmn
i(x, y),
(3)
where N(k)denotes the neighborhood around the k-th MB in
G, and (x, y)denotes the MB position within Fi. Note that,
as mentioned earlier, when the QP of an MB is changed, not
only the saliency error and bits of that MB are changed, but
also the saliency error and bits of its neighbors may change.
For this reason, we compute the total saliency error and the
actual number of bits of all MBs within a window around k-
th MB, and consider them as a generalized saliency error and
total bits of the k-th MB.
The idea here is to cover the whole frame by non-
overlapping windows around all MBs in G. Since there are at
least two MBs in between each pair of MBs in G, there should
not be any gap between any pair of 3×3windows surround-
ing the MBs in G. For some MBs in G, parts of the window
may fall outside the frame. For such cases, we consider only
the parts that are inside the frame. In other cases, the 3×3
window around the first or the last MB in a row or column
might not touch the frame boundary (e.g., the black MB in the
bottom-right corner of the first mask in Fig. 1). In such cases,
the window is expanded up to the frame boundary. Covering
the entire frame by such windows allows us to use Bias the
capacity of the knapsack. Note that PK
k=1 btot
ikmn|m=1 =Bi
because for m= 1,o1= 0.
Having computed etot
ikmn and btot
ikmn, the negative etot
ikmn is
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
considered as the profit (pkm =etot
ikmn in (2)) , and btot
ikmn
is regarded as the weight (wkm =btot
ikmn in (2)) of the m-th
offset in the k-th MB within the i-th frame when using Pn.Bi
is set as the capacity of the knapsack (c=Biin (2)), and the
MCKP problem is solved. MCKP chooses exactly one item
(in our case one offset) per class (in our case, per MB) such
that the total profit is maximized (in our case, saliency error
minimized) while the total weight (in our case, bits) remains
at or below Bi.
The obtained QP offsets are then applied to the original
QP matrix Qi, and the current frame is re-encoded with the
updated QP matrix Qn
i. Finally, the new saliency error ma-
trix of the encoded frame is computed, and its L1-norm Ln
1as
well as the obtained QP matrix Qn
iare stored. This proce-
dure is repeated for each matrix Pn,n= 1,2, ..., 9.
Step 3) At the end of Step 2, we obtain nine saliency error
L1norms L1
1, L2
1, ..., L9
1and the corresponding QP matrices.
The QP matrix whose saliency error L1norm is the smallest
is chosen as the final QP matrix Q
ifor frame Fi. Finally, Fi
is encoded using Q
i, and the encoder moves on to the next
frame.
Note that in the proposed algorithm, whenever the QP
of one MB is changed, the rate-distortion optimized (RDO)
mode decision [2] is employed to obtain the optimal predic-
tion mode and MB type. Therefore, the total number of bits
of each MB is computed after the RDO mode decision. Any
potential underflow (overflow) in the number of bits is added
to (subtracted from) the knapsack capacity of the subsequent
frame, thereby preserving the total rate assigned during ini-
tial ROI bit allocation. Algorithm 1 summarizes the proposed
method. In our current implementation, each video frame is
encoded K= (M1) ×9 + 1 times, the saliency map
of the frame is computed Ktimes, and MCKP is employed
nine times. Hence, in its current implementation, the pro-
posed method is only suitable for offline applications. How-
ever, multiple frame encodings and saliency computations can
be avoided by using a suitable model for the relationship be-
tween QP values and saliency values. Our current work is
focused on the development of such a model. As for solving
the MKCP, in our simulations the algorithm from [14] takes
an average of about 200 msec (on an Intel Core 2 Duo pro-
cessor at 3.33 GHz with 8 GB RAM) per frame.
4. RESULTS AND DISCUSSION
To evaluate the proposed method, we used three standard
CIF sequences (Soccer,Crew, and Bus). All sequences were
100 frames long, at 30 frames per second (fps), and were
encoded using JM 9.8 [15]. Soccer and Bus were encoded
at 50 kbps, and Crew was encoded at 100 kbps. We used
these relatively low bit rates to make the coding artifacts more
visible for display purposes. The GOP structure was set to
IPPPP. The IKN model [12] was utilized to generate saliency
maps. In all experiments reported here, only three QP offsets
Input: Raw Frame Fi
Output: Encoded Frame Fi
Encode Fiusing the ROI encoder
Compute Ei,Bi, and Qi
Lmax = MAXINT
foreach PnPdo
E1n
i=Eiand B1n
i=Bi
foreach omO\o1do
Encode Fiusing Qmn
i=Qi+omPn
Compute Emn
iand Bmn
i, and store them
end
Run MCKP using Emn
iand Bmn
i,m= 1,2, ..., M
Encode Fiusing Qn
iobtained by MCKP
Compute the new Ei
if L1(Ei)Lmax then
Q
i=Qn
i
Lmax =L1(Ei)
end
end
Encode Fiusing Q
i.
Algorithm 1: The proposed algorithm for saliency pre-
serving video coding
O={0,1,1}were employed.
We compare the proposed saliency-preserving ROI (SP-
ROI) coding to the conventional ROI coding using three met-
rics. The first metric L1(E)is computed as
L1(E) = (ESP ROI ERO I )/EROI ,(4)
where
ESP ROI =1
N
N
X
i=1
L1(ESP ROI
i),
EROI =1
N
N
X
i=1
L1(EROI
i),
(5)
Nis the total number of frames, and ESP ROI
iand EROI
iare
the saliency error maps of the i-th frame encoded by the SP-
ROI and ROI coding methods, respectively. This metric indi-
cates how much the total saliency error of the video encoded
by the SP-ROI method is different from the total saliency er-
ror of the video encoded by the ROI coding method.
To measure the propensity of coding artifacts outside of
ROI to draw user’s attention, we first binarize the saliency
map of the original raw frames using a specific threshold in
order to obtain an estimate of the location of ROI. This thresh-
old is set to the 75-th percentile of the saliency map. All
MBs whose saliency is larger than this threshold are con-
sidered as a part of ROI. We then define two new metrics
L1(E)and J, where L1(E)is computed as L1(E)
in (4), except that saliency errors in (5) are taken only for the
MBs outside of ROI. Meanwhile, Jis computed as J=
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
Table 1. The performance of SP-ROI relative to ROI video
coding.
Sequence L1(E) L1(E) JROI-PSNR
Soccer 6.27% 11.00% 2.65% 0.17 dB
Crew 1.01% 2.11% 5.23% 0.18 dB
Bus 11.57% 5.96% 2.83% 0.09 dB
Table 2. Average PSNR-Y of SP-ROI and ROI coding rela-
tive to RDO coding.
Method Soccer Crew Bus
ROI 0.08 dB 0.22 dB 0.10 dB
SP-ROI 0.09 dB 0.10 dB 0.08 dB
(JSP ROI JRO I )/JROI , where JSP ROI and JRO I are
the fractions of pixels outside ROI that have absolute quanti-
zation error greater than the just-noticeable-difference (JND)
threshold [16] of their corresponding pixels in frames en-
coded by the SP-ROI and ROI-coding methods, respectively.
To compute the JND thresholds, we employed the spatial JND
model proposed in [17] in the luminance (Y) channel. Note
that the JND threshold determines the visibility threshold of
a quantization error. Therefore, a quantization error is visible
if its magnitude is greater that the JND threshold.
Table 1 compares the SP-ROI method with the ROI cod-
ing method using the three aforementioned metrics, as well as
the average Peak Signal-to-Noise Ratio (PSNR) of the Y com-
ponent within ROI. The values of L1(E)indicate that the
saliency error of SP-ROI over the entire frame is lower than
that of ROI coding, which was the design goal. L1(E)
shows that outside of ROI, the saliency error with SP-ROI
coding is lower, indicating that it is less likely for non-ROI
regions to become salient after encoding. Finally, the val-
ues of Jshow that the percentage of pixels outside of ROI
whose quantization error is above JND is lower with SP-ROI
coding than with conventional ROI coding. Overall, the pro-
posed SP-ROI method reduces both the saliency and visibility
of the coding artifacts compared to conventional ROI coding.
This comes with the cost of slightly reduced PSNR within
ROI, as indicated in the last column of the table.
In Table 2, the average PSNR performance of the SP-ROI
and ROI coding methods are compared against RDO coding.
These PSNR values were obtained by averaging the PSNR
over all frames of the corresponding sequence. As seen from
this table, the average PSNR value of both SP-ROI and ROI
coding is lower than that of RDO coding, as expected. How-
ever, as illustrated in the next example, both SP-ROI and
ROI coding provide better visual quality than RDO. All of
the above results were obtained after matching the bit rate of
the ROI-coded and RDO-coded videos with the actual bit rate
of the video encoded by the SP-ROI method within ±0.1%
difference. Tables 3 and 4 show, respectively, the average
Table 3. Average SSIM index.
Method Soccer Crew Bus
RDO 0.6200 0.7871 0.4864
ROI 0.6231 0.7897 0.4846
SP-ROI 0.6378 0.7926 0.5100
Table 4. Average VQM values.
Method Soccer Crew Bus
RDO 3.45169 2.55774 7.60255
ROI 3.47870 2.51918 7.60608
SP-ROI 3.79243 2.57792 7.71076
structural similarity (SSIM) index [18] and the average Video
Quality Metric (VQM) value [19] computed over all frames
of each sequence. As seen from these results, the proposed
method provides higher visual quality, as measuerd by both of
these metrics, compared to conventional RDO and ROI meth-
ods.
Fig. 2 compares the visual quality of the three methods
on a sample frame of Soccer. As seen from these figures, the
proposed SP-ROI coding provides an improved visual qual-
ity of the encoded frames by reducing the visibility of the
coding artifacts. In particular, note that the coding artifacts
around the ball and the player’s feet have been reduced, com-
pared to both RDO and ROI-coded frame. At the same time,
the visual quality of conventional ROI-coded frame is slightly
better than the RDO-coded frame.
5. CONCLUSION
In this paper, we introduced the concept of saliency-
preserving video coding, and proposed a novel ROI coding
method that attempts to preserve the saliency of the original
video frames. Experimental results were presented using the
saliency model from [12], although the proposed method is
generic and can utilize any other visual attention model. The
results indicate that the proposed method is able to improve
the visual quality of encoded video compared to conventional
ROI and RDO video coding at low bit rates.
References
[1] M. Ghanbari, Video Coding: An Introduction to Standard
Codecs, London, U.K. : Institution of Electrical Engineers,
1999.
[2] I. E. G. Richardson, H.264 and MPEG-4 Video Compression:
Video Coding for Next-Generation Multimedia, NJ:Wiley,
2003.
[3] M. Yuen and H. R. Wu, “A survey of hybrid MC/D PCM/ DCT
video coding distortions,” Signal Process., vol. 70, no. 3, pp.
247–278, 1998.
Presented at IEEE ICME 2011, Barcelona, Spain, July 2011.
[4] H. Liu, N. Klomp, and I. Heynderickx, “A perceptually rele-
vant approach to ringing region detection, IEEE Trans. Image
Process., vol. 19, no. 6, pp. 1304–1318, Jun. 2010.
[5] M. Shen and C. J. Kuo, “Review of postprocessing techniques
for compression artifact removal, J. Vis. Commun. Image
Rep., vol. 9, no. 1, pp. 2–14, 1998.
[6] S. Daly, “The visible difference predictor: an algorithm for the
assessment of image fidelity, in Digital Images and Human
Vision, A. B. Watson, Ed. 1993, pp. 179–206, MIT Press.
[7] L. Itti, “Automatic foveation for video compression using a
neurobiological model of visual attention,” IEEE Trans. Image
Process., vol. 13, no. 10, pp. 1304–1318, 2004.
[8] Z. Chen, N. K. Ngan, and W. Lin, “Perceptual video cod-
ing: Challenges and approaches,” in Proc. IEEE International
Conference on Multimedia and Expo (ICME’10), Jul. 2010, pp.
784–789.
[9] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems,
Springer, 2004.
[10] Y. Liu, Z. G. Li, and Y. C. Soh, “Region-of-interest based
resource allocation for conversational video communication of
H.264 /AVC,” IEEE Trans. Circuits Syst. Video Technol., vol.
18, no. 1, pp. 134–139, Jan. 2008.
[11] Y. Liu, Z. G. Li, and Y. C. Soh, “A novel rate control scheme
for low delay video communication of H.264 /AVC standard,”
IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 1, pp.
67–78, Jan. 2007.
[12] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based
visual attention for rapid scene analysis,” IEEE Trans. Pattern
Anal. Machine Intell., vol. 20, pp. 1254–1259, Nov. 1998.
[13] J.-C. Chiang, C.-S. Hsieh, G. Chang, F.-D. Jou, and W.-N.
Lie, “Region-of-interest based rate control scheme with flexi-
ble quality on demand,” in Proc. IEEE International Confer-
ence on Multimedia and Expo (ICME’10), Jul. 2010, pp. 238–
242.
[14] D. Pisinger, A minimal algorithm for the multiple-choice
knapsack problem,” European Journal of Operational Re-
search, vol. 83, pp. 394–410, 1994.
[15] “The H. 264/ AVC J M reference software, [Online] Available:
http://iphome.hhi.de/suehring/tml/.
[16] C.-H. Chou and Y.-C. Li, A perceptually tuned subband im-
age coder based on the measure of just-noticeable-distortion
profile,” IEEE Trans. Image Process., vol. 5, no. 6, pp. 467–
476, Dec. 1995.
[17] X. Yang, W. Lin, Z. Lu, E. Ong, and S. Yao, “Motion-
compensated residue preprocessing in video coding based on
just-noticeable-distortion profile,” IEEE Trans. Circuits Syst.
Video Technol., vol. 15, no. 6, pp. 745–752, Jun. 2005.
[18] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,
“Image quality assessment: From error visibility to structural
similarity, IEEE Trans. Image Process., vol. 13, no. 4, pp.
600–612, Apr. 2004.
[19] M. Pinson and S. Wolf, “A new standardized method for ob-
jectively measuring video quality,” IEEE Trans. Broadcasting.,
vol. 50, no. 3, pp. 312–322, Sep. 2004.
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Fig. 2. An example of the visual quality of different methods.
(a) original frame, (b) saliency map of the original frame, (c)
RDO-coded frame, (d) ROI-coded frame, (e) SP-ROI-coded
frame, (f) saliency error map of the RDO-coded frame, (g)
saliency error map of the ROI-coded frame, (h) saliency error
map of the SP-ROI-coded frame.
... Predicting visual attention has been a field with longstanding interest, due to its widespread applications to many fields such as image/video quality assessment [8], [9], image/video compression [10], [11], and so on. In the past three decades, numerous works explored this question from different directions (see [3]- [5] for reviews). ...
... Finally, many approaches have tried to exploit one of the many saliency models (or built their own) for the specific task of video compression [11], [27]- [30]. However, none of these methods directly collect the human spatio-temporal saliency maps explicitly for the task of video compression. ...
Preprint
Full-text available
Human perception is at the core of lossy video compression and yet, it is challenging to collect data that is sufficiently dense to drive compression. In perceptual quality assessment, human feedback is typically collected as a single scalar quality score indicating preference of one distorted video over another. In reality, some videos may be better in some parts but not in others. We propose an approach to collecting finer-grained feedback by asking users to use an interactive tool to directly optimize for perceptual quality given a fixed bitrate. To this end, we built a novel web-tool which allows users to paint these spatio-temporal importance maps over videos. The tool allows for interactive successive refinement: we iteratively re-encode the original video according to the painted importance maps, while maintaining the same bitrate, thus allowing the user to visually see the trade-off of assigning higher importance to one spatio-temporal part of the video at the cost of others. We use this tool to collect data in-the-wild (10 videos, 17 users) and utilize the obtained importance maps in the context of x264 coding to demonstrate that the tool can indeed be used to generate videos which, at the same bitrate, look perceptually better through a subjective study - and are 1.9 times more likely to be preferred by viewers. The code for the tool and dataset can be found at https://github.com/jenyap/video-annotation-tool.git
... In recent years, the traditional hybrid coding framework has not been able to satisfy people's needs in some situations. Therefore, many task related coding schemes, such as perceptual coding and semantic coding, have been explored [35], [36], [37], [38], [39] with the development of image/video coding techniques. ...
... With the advance of saliency detection, some researchers have succeeded in detecting the regions of interest in images/videos. Based on saliency detection, some works [40], [41], [42], [37] succeed in improving the subjective quality of coded video by implementing visual attention guided bit allocation in video compression. In recent years, learning-based perceptual coding [43], [44] has been explored to further improve the perceptual quality of encoded images/videos. ...
Preprint
Task-driven semantic video/image coding has drawn considerable attention with the development of intelligent media applications, such as license plate detection, face detection, and medical diagnosis, which focuses on maintaining the semantic information of videos/images. Deep neural network (DNN)-based codecs have been studied for this purpose due to their inherent end-to-end optimization mechanism. However, the traditional hybrid coding framework cannot be optimized in an end-to-end manner, which makes task-driven semantic fidelity metric unable to be automatically integrated into the rate-distortion optimization process. Therefore, it is still attractive and challenging to implement task-driven semantic coding with the traditional hybrid coding framework, which should still be widely used in practical industry for a long time. To solve this challenge, we design semantic maps for different tasks to extract the pixelwise semantic fidelity for videos/images. Instead of directly integrating the semantic fidelity metric into traditional hybrid coding framework, we implement task-driven semantic coding by implementing semantic bit allocation based on reinforcement learning (RL). We formulate the semantic bit allocation problem as a Markov decision process (MDP) and utilize one RL agent to automatically determine the quantization parameters (QPs) for different coding units (CUs) according to the task-driven semantic fidelity metric. Extensive experiments on different tasks, such as classification, detection and segmentation, have demonstrated the superior performance of our approach by achieving an average bitrate saving of 34.39% to 52.62% over the High Efficiency Video Coding (H.265/HEVC) anchor under equivalent task-related semantic fidelity.
... Instead of relying on the compressed signals that have already been contaminated, physiological signals can be preserved during compression. Inspired by recent progress in ROI-based rate control algorithms [24], [25], in this article, we propose a physiological signal preserving video compression method such that the physiological signals would not be influenced and existing rPPG methods can be applied directly on the compressed video without modification. The proposed framework consists of three main steps: 1) rPPG feature extraction (POS feature [12]); 2) sparse subspace clustering (SSC); and 3) ROI-based video compression. ...
... The basic idea of ROI-based video compression is to allocate for bits to the ROIs and fewer bits to the non-ROIs, which is realized by two steps: macroblock quantization parameter (QP) assignment and ratedistortion optimization (RDO). In the literature, various ROIbased video compression methods have been proposed, which are mainly for improving general perceptual quality [36], [37], conversational videos [38], or saliency preservation [24], [25], [39], etc. But few paper is proposed for preserving rPPG signals. ...
Article
The consumer-level digital camera has become a physiological signal monitoring sensor due to the rapid growth of remote photoplethysmography (rPPG) technique. However, rPPG suffers from the artifacts caused by video compression, a technique that widely exists in nearly all video-related applications. This limitation greatly narrows the application range of rPPG. In this paper, a novel physiological signal preserving video compression algorithm called POSSC is proposed such that existing rPPG signal extraction approaches can be applied directly on the compressed video without modification. The proposed approach consists of three main steps: 1) rPPG feature extraction; 2) sparse subspace clustering; and 3) region of interest (ROI)-based video compression. This article models the skin/nonskin feature classification problem as sparse subspace clustering. Physiological signals are preserved by allocating more (fewer) bits to the ROI (non-ROI) regions. A self-collected benchmark dataset is established to evaluate the performance of POSSC in terms of body part, ROI size, light source, illumination intensity, and multiple subjects in an image. The results demonstrate that POSCC is effective in preserving physiological signals for facial videos under normal light intensity, insensitive to ROI size, shape, and the number of subjects. This article also demonstrates the effectiveness of sparse subspace clustering for pulse region detection.
... Many models have been introduced based on physiological and psychophysical findings to imitate the HVS in order to predict human visual attention [39]. Visual saliency models find a large number of applications in image processing and computer vision, such as quality assessment [14], [22], [52], [58], [70], [82], compression [24], [26], [27], [34], [55], [81], retargeting [19], [60], segmentation [23], [68], object recognition [30], object tracking [64], abstraction [41], guiding visual attention [29], [65], and so on. ...
Preprint
Computational modeling of visual saliency has become an important research problem in recent years, with applications in video quality estimation, video compression, object tracking, retargeting, summarization, and so on. While most visual saliency models for dynamic scenes operate on raw video, several models have been developed for use with compressed-domain information such as motion vectors and transform coefficients. This paper presents a comparative study of eleven such models as well as two high-performing pixel-domain saliency models on two eye-tracking datasets using several comparison metrics. The results indicate that highly accurate saliency estimation is possible based only on a partially decoded video bitstream. The strategies that have shown success in compressed-domain saliency modeling are highlighted, and certain challenges are identified as potential avenues for further improvement.
... In the early work [29], saliency information has been applied in a mechanism to control the coding resource of video compression. The framework was extended from a saliency-preserving framework [30], which tries to reduce coding resources in non-region of interest (ROI) parts since a ROI area in a frame shall deserve more visual attention. The Itti-Koch-Niebur (IKN) model [31] is adopted as saliency model with the improved temporal saliency estimation that considers the effect of camera motion. ...
Article
Full-text available
With the rapid development of devices for virtual reality, massive amounts of multimedia data have been produced to enrich the immersive experience. To construct vivid virtual reality, high-resolution content is required to achieve a good user experience. With the limitations of hardware capability, efficient data compression is one of the keys to addressing the current dilemma. In our opinion, visual perception is a promising strategy for addressing this problem. In this study, the impact of human visual perception in immersive environment is studied. Two hypotheses accounting for their impact on the quality of immersive videos at a fixed viewport are proposed and validated by subjective evaluation and analysis. A series of videos are created with various blur distortions based on the stimulus triggered by depth perception and the rapid falloff of acuity in the peripheral vision. Stereoscopic videos with imposed blur distortions arranged with different severities are evaluated subjectively in an immersive viewing environment. We then follow statistical approaches to examine the collected subjective quality distributions to discuss the feasibility of video compression based on processed videos. To improve encoding efficiency, a heuristic mechanism for immersive visual content compression is proposed. We examine a simple compression framework to demonstrate that the proposed hypotheses are applicable for practical video compression.
... Selective blurring might result in unpleasant artefacts and results in low quality images. A video compression scheme has been presented in [38], which minimises the coding artefacts to retain saliency in the significant regions. In [39] a bit allocation method has been presented that relies on the evaluation of perceptual distortion sensitivity of each macroblock. ...
Article
Full-text available
This study is concerned with achieving the image compression using the concept of memorability. The authors have used memorability of an image, as a perceptual measure while image coding. In the proposed approach, a region‐of‐interest‐based memorability preserving image compression algorithm which is accomplished via two sub‐processes namely, memorability prediction and image compression is introduced. The memorability of images is predicted using convolutional neural network and restricted Boltzmann machine features. Based on these features, the memorability score of individual patches in an image is calculated and these scores are used to generate the memorability map. These memorability map values are used for optimised image compression. In order to validate the results, an eye tracking experiment with human participants is performed. The comparative analysis shows that the memorability‐based compression outperforms the state‐of‐the‐art compression techniques.
... • Model of visual attention underlying the method • Reference encoder: MPEG-1 [9]; MPEG-4 [4], [9]; or H.264 [4], [10], [19]- [22] • Method of bit-allocation control: implicit [4], [9], [10] (video preprocessing before encoding; e.g., non-uniform blur) or explicit (modifying internal encoder data; e.g., setting saliency-specific quantization-parameter (QP) values for macroblocks) [19]- [22] • Evaluation methodology: either researchers can claim that videos encoded using their methods have lower bit rates than the reference video at the same visual quality [4], [9], [10], [19], [20] or they can conclude that their proposed encoders provide better visual quality than a reference at the same bit rate [21], [22]. We believe the second strategy is slightly more reliable because checking bit rates is easy, but confirming that two different videos have same visual quality is difficult. ...
Conference Paper
Full-text available
This work aims to apply visual-attention modeling to attention-based video compression. During our comparison we found that eye-tracking data collected even from a single observer outperforms existing automatic models by a significant margin. Therefore, we offer a semiautomatic approach: using computer-vision algorithms and good initial estimation of eyetracking data from just one observer to produce high-quality saliency maps that are similar to multi-observer eye tracking and that are appropriate for practical applications. We propose a simple algorithm that is based on temporal coherence of the visual-attention distribution and requires eye tracking of just one observer. The results are as good as an average gaze map for two observers. While preparing the saliency-model comparison, we paid special attention to the quality-measurement procedure. We observe that many modern visual-attention models can be improved by applying simple transforms such as brightness adjustment and blending with the center-prior model. The novel quality evaluation procedure that we propose is invariant to such transforms. To show the practical use of our semiautomatic approach, we developed a saliency-aware modification of the x264 video encoder and performed subjective and objective evaluations. The modified encoder can serve with any attention model and is publicly available.
Preprint
Full-text available
Human perception is at the core of lossy video compression, with numerous approaches developed for perceptual quality assessment and improvement over the past two decades. In the determination of perceptual quality, different spatio-temporal regions of the video differ in their relative importance to the human viewer. However, since it is challenging to infer or even collect such fine-grained information, it is often not used during compression beyond low-level heuristics. We present a framework which facilitates research into fine-grained subjective importance in compressed videos, which we then utilize to improve the rate-distortion performance of an existing video codec (x264). The contributions of this work are threefold: (1) we introduce a web-tool which allows scalable collection of fine-grained perceptual importance, by having users interactively paint spatio-temporal maps over encoded videos; (2) we use this tool to collect a dataset with 178 videos with a total of 14443 frames of human annotated spatio-temporal importance maps over the videos; and (3) we use our curated dataset to train a lightweight machine learning model which can predict these spatio-temporal importance regions. We demonstrate via a subjective study that encoding the videos in our dataset while taking into account the importance maps leads to higher perceptual quality at the same bitrate, with the videos encoded with importance maps preferred 2.1×2.1 \times over the baseline videos. Similarly, we show that for the 18 videos in test set, the importance maps predicted by our model lead to higher perceptual quality videos, 2×2 \times preferred over the baseline at the same bitrate.
Article
Task-driven semantic video/image coding has drawn considerable attention with the development of intelligent media applications, such as license plate detection, face detection, and medical diagnosis, which focuses on maintaining the semantic information of videos/images. Deep neural network (DNN)-based codecs have been studied for this purpose due to their inherent end-to-end optimization mechanism. However, the traditional hybrid coding framework cannot be optimized in an end-to-end manner, which makes task-driven semantic fidelity metric unable to be automatically integrated into the rate-distortion optimization process. Therefore, it is still attractive and challenging to implement task-driven semantic coding with the traditional hybrid coding framework, which should still be widely used in practical industry for a long time. To solve this challenge, we design semantic maps for different tasks to extract the pixelwise semantic fidelity for videos/images. Instead of directly integrating the semantic fidelity metric into traditional hybrid coding framework, we implement task-driven semantic coding by implementing semantic bit allocation based on reinforcement learning (RL). We formulate the semantic bit allocation problem as a Markov decision process (MDP) and utilize one RL agent to automatically determine the quantization parameters (QPs) for different coding units (CUs) according to the task-driven semantic fidelity metric. Extensive experiments on different tasks, such as classification, detection and segmentation, have demonstrated the superior performance of our approach by achieving an average bitrate saving of 34.39% to 52.62% over the High Efficiency Video Coding (H.265/HEVC) anchor under equivalent task-related semantic fidelity.
Book
Full-text available
Every aspect of human life is crucially determined by the result of decisions. Whereas private decisions may be based on emotions or personal taste, the complex professional environment of the 21st century requires a decision process which can be formalized and validated independently from the involved individuals. Therefore, a quantitative formulation of all factors influencing a decision and also of the result of the decision process is sought.
Article
Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a Structural Similarity Index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000.
Article
Image fidelity is the subset of overall image quality that specifically addresses the visual equivalence of two images. This paper describes an algorithm for determining whether the goal of image fidelity is met as a function of display parameters and viewing conditions. Using a digital image processing approach, this algorithm is intended for the design and analysis of image processing algorithms, imaging systems, and imaging media. The visual model, which is the central component of the algorithm, is comprised of three parts: an amplitude nonlinearity, a contrast sensitivity function, and a hierarchy of detection mechanisms.
Article
The Multiple-Choice Knapsack Problem is defined as a 0–1 Knapsack Problem with the addition of disjoined multiple-choice constraints. As for other knapsack problems most of the computational effort in the solution of these problems is used for sorting and reduction. But although O(n) algorithms which solve the linear Multiple-Choice Knapsack Problem without sorting have been known for more than a decade, such techniques have not been used in enumerative algorithms. In this paper we present a simple O(n) partitioning algorithm for deriving the optimal linear solution, and show how it may be incorporated in a dynamic programming algorithm such that a minimal number of classes are enumerated, sorted and reduced. Computational experiments indicate that this approach leads to a very efficient algorithm which outperforms any known algorithm for the problem.
Article
Low bit rate image/video coding is essential for many visual communication applications. When bit rates become low, most compression algorithms yield visually annoying artifacts that highly degrade the perceptual quality of image and video data. To achieve high bit rate reduction while maintaining the best possible perceptual quality, postprocessing techniques provide one attractive solution. In this paper, we provide a review and analysis of recent developments in postprocessing techniques. Various types of compression artifacts are discussed first. Then, two types of postprocessing algorithms based on image enhancement and restoration principles are reviewed. Finally, current bottlenecks and future research directions in this field are addressed.
Conference Paper
Investigation on the human perception can play an important role in video signal processing. Recently, there has been great interest in incorporating the human perception in video coding systems to enhance the perceptual quality of the represented visual signal. However, the limited understanding of the human visual system and high complexity of computational models of human visual system make it a challenging task. Furthermore, the hybrid video coding structure brings difficulties to integrate computational models with coding components to fulfill the requirements. In this paper, we review the physiological characteristics of human perception and address the most relevant aspects to video coding applications. Moreover, we discuss the computational models and metrics which guide the design and implementation of the video coding system, as well as the recent advances in perceptual video coding. To introduce this overview with the latest technologies and most promising directions in perceptual video coding, we focus on three key areas. Specifically, we cover 1) visual attention and sensitivity modeling, with which we concentrate on the computational models of bottom-up and top-down attention, contrast sensitivity functions and masking effects, and fovea based manipulations; 2) perceptual quality optimization for constrained video coding, with which we discuss how to achieve maximum perceptual quality whilst satisfying various constraints; and 3) the impact of the human perception on advanced video applications, including emerging immersive multimedia services, and compression of high dynamic range video content and 3D video. For each aspect, we discuss the major challenges, highlight significant approaches, and outline future research directions.
Article
The motion-compensated hybrid DCT/DPCM algorithm has been successfully adopted in various video coding standards, such as H.261, H.263, MPEG-1 and MPEG-2. However, its robustness is challenged in the face of an inadequate bit allocation, either globally for the whole video sequence, or locally as a result of an inappropriate distribution of the available bits. In either of these situations, the trade-off between quality and the availability of bits results in a deterioration in the quality of the decoded video sequence, both in terms of the loss of information and the introduction of coding artifacts. These distortions are an important factor in the fields of filtering, codec design, and the search for objective psychovisual-based quality metrics; therefore, this paper presents a comprehensive analysis and classification of the numerous coding artifacts which are introduced into the reconstructed video sequence through the use of the hybrid MC/DPCM/DCT video coding algorithm. Artifacts which have already been briefly described in the literature, such as the blocking effect, ringing, the mosquito effect, MC mismatch, blurring, and color bleeding, will be comprehensively analyzed. Additionally, we will present artifacts with unique properties which have not been previously identified in the literature.
Article
IntroductionVideo CODECTemporal ModelImage ModelEntropy CoderThe Hybrid DPCM/DCT Video CODEC ModelConclusions References