ArticlePDF Available

Saliency-Guided Just Noticeable Distortion Estimation Using the Normalized Laplacian Pyramid

Authors:

Abstract and Figures

The human visual system (HVS), like any other physical system, has limitations. For instance, it is known that the HVS can only sense the content changes that are larger than the so-called just noticeable distortion (JND) threshold. Also, to reduce the computational load on the brain, the visual attention mechanism is deployed such that regions with higher visual saliency are processed with higher priority than other less-salient regions. It is also known that visual saliency has a modulatory effect on JND thresholds. In this letter, we present a novel pixel-wise JND estimation method that considers the interplay between visual saliency and JND thresholds. In the proposed method, the largest JND thresholds of a given image are found such that the perceptual distance between the image and its JND noise-contaminated version is minimized in a perceptual space defined by the coefficients of the image in a normalized Laplacian pyramid. Experimental results indicate that the proposed method outperforms four of the latest JND models for static images.
Content may be subject to copyright.
1070-9908 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSP.2017.2717946, IEEE Signal
Processing Letters
1
Saliency-guided Just Noticeable Distortion
Estimation Using the Normalized Laplacian
Pyramid
Hadi Hadizadeh, Atiyeh Rajati, and Ivan V. Baji´
c, Senior Member,IEEE
Abstract—The human visual system (HVS), like any other
physical system, has limitations. For instance, it is known that
the HVS can only sense the content changes that are larger than
the so-called just noticeable distortion (JND) threshold. Also, to
reduce the computational load on the brain, the visual attention
mechanism is deployed such that regions with higher visual
saliency are processed with higher priority than other less-salient
regions. It is also known that visual saliency has a modulatory
effect on JND thresholds. In this letter, we present a novel pixel-
wise JND estimation method that considers the interplay between
visual saliency and JND thresholds. In the proposed method,
the largest JND thresholds of a given image are found such
that the perceptual distance between the image and its JND
noise-contaminated version is minimized in a perceptual space
defined by the coefficients of the image in a normalized Laplacian
pyramid. Experimental results indicate that the proposed method
outperforms four of the latest JND models for static images.
Index Terms—just noticeable distortion, visual saliency
I. INT ROD UC TI ON
IT is known that the human visual system (HVS) cannot
sense small visual variations whose amplitude is below
the so-called just noticeable distortion (JND) threshold due
to several physical limitations in the eyes and the brain [1],
[2]. JND modeling is widely used for perceptual redundancy
estimation in images/videos for a variety of different applica-
tions such as image/video coding and transmission [3], quality
assessment [4], watermarking [5], etc. Perceptual redundancies
in visual contents may also be produced by the visual attention
(VA) mechanism of the human brain [6]. VA provides an
automatic mechanism for selection of particular aspects of a
visual scene that are most relevant to our ongoing behavior
while eliminating interference from less relevant data so as to
reduce the computational load on the brain [6].
In the literature, several models have been developed for
JND estimation in images and videos in both the pixel and
subband domain [7]–[13]. For instance, in [9], a JND model
was proposed based on measuring edge and texture masking
[1]. Wu et al. modeled the fact that the HVS is insensitive
to the irregular visual content, and introduced a structure-
uncertainty-based JND model in [10]. An enhanced pixel-wise
Copyright (c) 2017 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
H. Hadizadeh and A. Rajati are with the Quchan University of Advanced
Technology, Quchan, Iran, and I. V. Baji´
c is with Simon Fraser University,
Burnaby, BC, V5A 1S6, Canada. The corresponding author is H. Hadizadeh
(h.hadizadeh@qiet.ac.ir).
JND estimation method was recently proposed in [7] based on
measuring the pattern complexity and luminance contrast.
According to the current knowledge, it is believed that VA
can be driven by “visual saliency”, which is a measure of
propensity for drawing VA to a specific location in a scene
[6]. A region is said to be visually salient if it possess certain
characteristics, that make it stand out from its surrounding
regions and draw attention [14]. The existing computational
models of for static images [14]–[16] are able to produce a
saliency map by which salient regions in a given image can
be predicted automatically.
It is known that visual saliency has a modulatory effect on
JND thresholds [5], [8], [17]. Specifically, it is known that JND
thresholds in attended (or very salient) regions are smaller than
JND thresholds in un-attended (or less salient) regions [17].
Hence, to better estimate visual redundancies, it is reasonable
to consider the interplay between JND thresholds and saliency.
None of the above-mentioned JND models consider the effect
of visual saliency on JND thresholds.
In the literature, there are very few existing JND models
that consider visual saliency. Notable methods include the
ones proposed in [5], [17]. In these two methods, the JND
thresholds are scaled by a set of fixed linear saliency mod-
ulation functions. Recently, a saliency-modulated JND model
was proposed in [8] in the DCT (discrete cosine transform)
domain, in which the JND thresholds estimated by a DCT-
based JND model are scaled by two non-linear modulation
functions based on the visual saliency of the pixels in the
given image. The results reported in [8] indicated that this
method outperforms [17] and [5]. Hence, we consider [8] as
the representative of the earlier saliency-based JND models.
In this letter, we present a novel JND estimation method,
which takes the visual saliency information into account.
In the proposed method, a differentiable saliency-weighted
perceptual image quality metric is first defined to measure
the perceptual difference between two images decomposed by
a normalized Laplacian pyramid (NLP) [18], in which some
biological mechanisms in the early visual system (e.g., the
center-surround filtering and local gain control [1]) are sim-
ulated. The employed metric is designed such that it assigns
larger weights to NLP coefficients corresponding to pixels with
higher visual saliency and vice versa. The JND thresholds
are then considered as an invisible noise in the sense that
if the pixel values of a given image are increased/decreased
by their corresponding JND thresholds, then the resultant
noisy image should not be distinguishable from the original
1070-9908 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSP.2017.2717946, IEEE Signal
Processing Letters
2
pristine image. Based on this assumption, the largest JND
thresholds of a given image are then adaptively estimated such
that the perceptual distance between the original image and
the JND noise-contaminated image is minimized while the
invisible noise energy (or equivalently the amplitude of the
JND thresholds) is maximized. Note that the method proposed
in [8] is a non-adaptive method because the parameters of
the saliency modulation functions in [8] are fixed for all
images, where the utilized parameters may not always be the
best for all kinds of images. However, our proposed method
estimates the JND thresholds in an adaptive manner based
on the image content and its visual saliency. Experimental
results indicate that the proposed method outperforms four
of the latest JND models for static images including [8]. To
the best of our knowledge, we are the first to use a saliency-
weighted perceptual image quality metric for automatic JND
estimation, and this is the main contribution of this letter.
This letter is organized as follows. In Section II, the
proposed method is presented. The experimental results are
given in Section III followed by conclusions in Section IV.
II. TH E PRO PO SE D ME TH OD
Let Jbe a gray-scale image for which we wish to estimate
the JND map M. Suppose that Mis available, and we generate
an image Isuch that I=J+M. If the JND thresholds in M
are accurate, then we expect that the perceptual quality of I
is equal or very close to the perceptual quality of J. In this
case, Ican be considered as a noisy version of J, where the
injected noise is not visible. In practice, we are interested to
find the largest possible JND thresholds to detect the largest
possible perceptual redundancies. In other words, the best M
is the one that maximises the MSE (mean-squared error or the
noise energy) between Iand Jwhile keeping the perceptual
quality of Iequal (or very close) to the perceptual quality of
J. To estimate M, we define the following cost function:
Q(I|J) = (1 λ)D(I,J)λMSE(I,J),(1)
where D(I,J)is an image quality metric that measures the
perceptual distance between Iand J, and 0< λ < 1is a
constant by which the weight and scale of the two terms are
controlled. The best Mcan then be estimated by M=ˆ
IJ,
where ˆ
Iis obtained by the following minimization problem:
ˆ
I= argmin
I
Q(I|J),s.t. j:Gmin IjGmax,(2)
where Ijdenotes the value of Iat location j,Gmin and Gmax
are the minimum and maximum possible gray levels (i.e., 0
and 255 in our case). Assuming that there are Mpixels in I
and J,MSE(I,J) = 1
MPM
m=1 ImJm2. Note that (1) is
minimized when D(I,J)is minimized while MSE (I,J)is
maximized. Generally, when MSE(I,J)increases, D(I,J)
may either increase or remain unchanged because the added
noise (distortion) with larger energy may still remain invisible.
At the optimum point the value of Q(I|J)may be negative.
For computing D(I,J), we seek an image quality metric
that has the following properties: 1) it must measure the
perceptual dissimilarity between two images, 2) it should be
simple to compute and differentiable so that it can be easily
Fig. 1. The flowchart of the normalized Laplacian pyramid [19].
used in an optimization loop, 3) the saliency information
can be integrated into it. For this purpose, we utilized the
Normalized Laplacian Pyramid (NLP) [18], which is a multi-
scale nonlinear representation that mimics the operations of
the retina and lateral geniculate nucleus in the HVS. In [19],
it was shown that distances measured between two images
represented in the perceptual space defined by the NLP are
highly correlated with human judgments. In fact, as will be
discussed in the sequel, the NLP distance has all the above-
mentioned properties for D(I,J). For computing the NLP
distance, the pixels in Jare first transformed using a power law
transformation to get xas x=Jγ. This simulates the transfor-
mation of light to voltage in retinal photoreceptors. As shown
in Fig. 1, the NLP is then recursively built from xas [18]:
x(l+1) =DLx(l), and z(l)=x(l)LUx(l+1),
where superscript (l)denotes the l-th level of the pyramid,
D(.)and U(.)indicate down/up-sampling by a factor 2,
respectively, and Ldenotes the filtering operation by a spa-
tially separable 5-tap filter, (0.05,0.25,0.4,0.25,0.05) as in
[18]. Within each frequency channel (l), each coefficient of
z(l)is divided by a weighted local sum of the element-wise
amplitude of the coefficients plus a positive constant σas [18]:
y(l)=z(l)÷σ+H|z(l)|, where ÷and denote pixel-wise
division, and linear convolution, respectively, and His a local
weighting filter. In fact, this equation implements the divisive
normalization process, which is widely used for describing
the responses of neurons in different parts of the visual
system [20], [21]. Assuming that there are Lpyramid levels,
the set of NLP coefficients {y(l);l= 1,· · · , L}provides
a perceptual representation of x[19]. Inspired by [22], to
measure the perceptual distance between two images Iand
J, we first compute the absolute differences between the NLP
coefficients of the two images within each frequency channel
as d(l)
i=
y(l)
iˆy(l)
i
, where y(l)
iand ˆy(l)
idenote the i-th
NLP coefficient of Iand Jin the l-th channel. We then use
the summation model proposed in [22] to compute a single
distance value as follows. First, the `αnorm of the calculated
differences within each channel is computed. The `βnorm is
then used to combine the obtained values across all channels:
D(I,J) = 1
L
L
X
l=1 1
Nl
Nl
X
i=1 d(l)
iαβ
α1
β
,(3)
where Nlis the number of NLP coefficient in the l-th channel.
This metric treats all spatial locations equally. However, to
weight different spatial locations based on their visual im-
portance (saliency), we propose to use a saliency-weighted
version of d(l)
ias follows: d(l)
i=w(l)
i
y(l)
iˆy(l)
i
, where w(l)
i
1070-9908 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSP.2017.2717946, IEEE Signal
Processing Letters
3
denotes the normalized saliency value of the i-th pixel in J
down-scaled to the size of the l-th channel. To compute the
saliency information, any saliency model can be used. Here,
we use the saliency model proposed in [15]. Our motivation
for using this model was the promising results reported in [23].
To solve (2), we used the Adaptive Moment Estimation
(Adam) algorithm [24] for which the derivative of Q(I|J)with
respect to Ijcan be analytically calculated as follows:
∂Q(I|J)
∂Ij
= (1 λ)D(I,J)
∂Ij
λ∂M SE(I,J)
∂Ij
,(4)
where ∂M SE (I,J)
∂Ij=2
MIjJj, and
∂D(I,J)
∂Ij
=1
βD(I,J)1β
∂Ij1
L
L
X
l=1 1
Nl
Nl
X
i=1 d(l)
iαβ
α,
(5)
where the derivative on the right hand side is calculated as:
∂Ij1
L
L
X
l=1 1
Nl
Nl
X
i=1 d(l)
iαβ
α=
β
αL
L
X
l=1 1
Nl
Nl
X
i=1 d(l)
iαβ
α
1
∂Ij1
Nl
Nl
X
i=1 d(l)
iα.(6)
The derivative on the right hand side of the above equation is:
∂Ij1
Nl
Nl
X
i=1 d(l)
iα=α
Nl
Nl
X
i=1
(d(l)
i)α1∂d(l)
i
∂Ij
,(7)
where ∂d(l)
i
∂Ij=w(l)
isgn(y(l)
iˆy(l)
i)∂y(l)
i
∂Ij, and ∂y(l)
i
∂Ij=
∂y(l)
i
z(l)z(l)
∂x(l)
j∂x(l)
j
∂Ij. We then calculate ∂y(l)
i
∂z(l)
k
as follows:
∂y(l)
i
∂z(l)
k
=
σ+q(l)
i
Hi,isgn(z(l)
i)z(l)
i
(σ+q(l)
i)2, k =i
Hi,ksgn(z(l)
k)z(l)
i
(σ+q(l)
i)2, k 6=i
(8)
where q(l)
iis the value of H∗ |z(l)|at location i, and Hi,k
is the value of Hat location kassuming that the center
of His at location i. We also obtain z(l)
∂x(l)
j
=t(l)
j, where
t(l)
jis the j-th column of T(l), which is the matrix of the
linear transformation performed by the Laplacian pyramid, i.e.,
z(l)=T(l)x(l). In fact, T(l)can be computed from Fig. 1. For
details please refer to [18]. Finally, we get x(l)
j
∂Ij=1
γI(1
γ
1)
j.
To estimate σand H, and αand βin (3), and γ, similar
to [22], we optimized these parameters such that the Pearson
linear correlation between the distance values predicted by (3)
and the mean opinion scores (MOS) provided in the popular
LIVE image quality assessment database [25] is maximized.
The optimization procedure was the same as in [22]. We
obtained σ= 0.19,α= 2,β= 0.5,γ= 0.38. Also, we
obtained Has the following 3×3filter: [0.04 0.05 0.04; 0.05
0.06 0.05; 0.04 0.05 0.04]. We also experimentally found that
λ= 0.01 and L= 6 achieves the best results for our purpose.
III. EXP ER IM EN TAL RES ULT S
For evaluation of the proposed method, similar to [7], [8],
[10], a subjective experiment was performed to compare the
efficacy of the proposed method with the following four latest
JND models: [7] (EJND), [8] (SJND), [10] (Wu2013), [9]
(Liu2010). For this purpose, we used the 12 images from
[7] (named I1 to I12) for comparisons among different JND
models. These images are often used either for comparing
different JND models or for image quality assessment [7].
During the experiment, two JND noise-contaminated images
on a same scene (one produced by the proposed method, and
the other one produced by the other method being compared)
were randomly juxtaposed on the right or left part of a
screen with mid-gray background. To produce a JND noise-
contaminated image ˆ
Jout of a pristine image Jwith JND
map M, similar to many existing works [7], [8], we used the
following formula: ˆ
J=J+ηNM, where Nis a random
noise which takes -1 or +1 independently and equal likely,
denotes pixel-wise multiplication, and ηis the noise level
adjuster. The reason for using Nis to avoid creating a fixed
artificial spatial pattern. In our experiments, the noise level for
all JND models was adjusted such that the PSNR of all noise-
contaminated images becomes equal to 26 ±0.01 dB just to
be able to see distortions easier.
A 17-inch LG monitor T1710B with maximum brightness
300 cd/m2, and resolution 1024 ×768 pixels was used. The
brightness and contrast of the display was set to 50%. The
experiment was run in a quiet classroom with 24 naive subjects
(15 males, 9 females) with normal or corrected-to-normal
eyesight of age between 18 and 23. The viewing environment
and the viewing condition are set with the guidance of ITU-R
BT.500-11 standard [26]. The illumination in the room was
in the range 100-150 Lux. The distance between the display
and the subjects was fixed at 70 cm. Each participant was
familiarized with the task before the start of the expeirment
via a short printed instruction sheet. The total length of the
experiment for each participant was approximately 16 minutes.
Each image pair was shown for 10 seconds. After this pre-
sentation, a mid-gray blank screen was shown for 5 seconds.
During this period, similar to [7], the subjects were asked
to decide the image with better quality (Left or Right), and
how much better it is according to the following scoring
rule: 0(same quality), 1(slightly better), 2(better), 3(much
better). Participants did not know which image was obtained
by which method. Randomly chosen half of the trials had the
image produced by the proposed method presented on the
left side of the screen and the other half on the right side,
in order to counteract side bias in the responses. This gave
a total of 12 ×2×4trials (duplicated to balance left and
right presentation) for each subject. The obtained results are
shown in Table I. In this table, ‘Mean’ refers to the mean
of the quality scores given by the subjects to each image,
where positive Mean values state that the proposed JND model
outperforms its relevant competitor JND model. A two-sided
Pearson’s chi-square (χ2) test [27] was used to examine the
statistical significance of the results based on the number of
collected votes for each model. The null hypothesis is that
1070-9908 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSP.2017.2717946, IEEE Signal
Processing Letters
4
there is no preference for either the proposed method or the
other JND model. The p-value [27] is indicated in the table. In
experimental sciences, as a rule of thumb, the null hypothesis
is rejected when p < 0.05. When this happens in Table I, it
means that the two images cannot be considered to have the
same subjective quality at the 95% significance level, since
one of them has obtained a significantly higher quality score,
and therefore seems to have better quality. In the table, cases
with p > 0.05 are indicated in bold typeface. We didn’t use
any outlier removal process in our experiments because with
a relatively small number of subjects, removing some of the
responses would reduce the statistical power of the test.
As seen from the results in Table I, the proposed method
outperforms EJND on all images except for I1, I8, I9, and
I10. Specifically, its performance is statistically the same as
EJND on I1 as its p-value is greater than 0.05 while on I8,
I9, and I10, the proposed method performs slightly worse
than EJND. However, looking across all trials, we observe
that the proposed method outperforms EJND with an average
quality difference score of 0.38 with overall p= 0.0045,
which is a statistically significant result at the 95% significance
level, because the odds of it occurring by chance are 45 in
10000. Note that I8, I9, and I10 have a complex content
with multiple attention-grabbing objects, but the saliency maps
produced by [15] show only one salient object. Hence, we
believe that the lower performance of the proposed method
on these images may be related to the inaccurate saliency
maps. We also observe that the proposed method performs
statistically better than SJND on all images (with a mean
quality difference score of 0.53) except for I8, I9, and I10
where its performance is statistically the same as SJND.
Comparing to Wu2013 and Liu2010, we observe that the
proposed method outperforms both of these methods on all
images with a mean quality difference score of 0.62 and
1.07, respectively, and the obtained results are all statistically
significant. Fig. 2 shows a visual example comparing various
JND models on I6 based on their JND noise-contaminated
image at the same level of noise energy (PSNR=19.06 dB).
For this example, we intentionally used low PSNR in order to
be able to see distortions on this scale easier. As seen from this
figure, the proposed method achieves better perceptual quality
compared to the other methods.
We also compared the computational complexity of the
proposed method with other JND models implemented using
their original code in Matlab on an Intel i7-3790K CPU at
4.00 GHz with 8 GB RAM on a sample 512×512 image. The
average execution time (second) for Liu2010, Wu2013, EJND,
SJND, and the proposed method was respectively as follows:
0.51, 3.82, 0.57, 1.92, and 4.2 (1.1s for saliency computation
and 3.1s for the cost function minimization). The proposed
method is the slowest among these but it enables more accurate
JND estimation. The speed of the proposed method can be
increased by using a faster method for saliency computation
and a faster algorithm for the cost function minimization.
IV. CON CL US IO NS
In this letter, we presented a novel JND estimation method,
which utilizes the visual saliency information of an image for
TABLE I
THE R ESU LTS OF C OM PARI NG TH E PRO POS ED M ETH OD W ITH E ACH O F
TH E FOU R LATE ST JND MO DE LS ON T HE 12 TES TE D IMAG ES .
VS. EJND VS. SJND VS. Wu2013 VS. Liu2010
Mean p-value Mean p-value Mean p-value Mean p-value
I1 0.03 0.1489 0.12 0.0445 0.15 0.0312 0.45 0.0121
I2 0.69 0.0013 0.74 0.0009 0.57 0.0072 1.25 0.0003
I3 1.12 0.0001 1.45 0.0001 1.57 0.0001 2.4 0.0001
I4 1.04 0.0001 1.23 0.0001 1.32 0.0002 2.1 0.0001
I5 0.24 0.0094 0.11 0.0463 0.19 0.0401 0.45 0.0121
I6 0.63 0.0039 0.45 0.0121 0.78 0.0007 1.16 0.0004
I7 0.43 0.0180 0.76 0.0008 0.61 0.0042 1.09 0.0006
I8 -0.12 0.0193 0.08 0.0543 0.12 0.0445 0.44 0.0176
I9 -0.26 0.0014 0.03 0.1489 0.10 0.0469 0.29 0.0080
I10 -0.33 0.0001 0.01 0.7728 0.13 0.0431 0.17 0.0411
I11 0.27 0.0082 0.41 0.0192 0.67 0.0018 1.14 0.0004
I12 0.84 0.0008 1.04 0.0004 1.23 0.0003 1.89 0.0001
Avg 0.38 0.0045 0.53 0.0012 0.62 0.0007 1.07 0.0001
Fig. 2. Comparing various JND models based on their JND noise-
contaminated image at the same level of noise energy. From top to bottom
and left to right: original image, Liu2010, Wu2013, EJND, SJND, and the
proposed method. Please zoom in to see the distortions better.
a better prediction of JND thresholds. The main idea behind
the proposed method is that a JND noise-contaminated image
(i.e., a nosiy version of an image whose noise amplitude is
equal to the image JND thresholds) should be indistinguishable
from the original pristine image. Hence, to estimate the JND
thresholds of a given image one can find the largest JND
thresholds such that the perceptual quality between the image
and its JND noise-contaminated version is minimized. For
this purpose, a saliency-weighted perceptual distance is first
defined in the normalized Laplacian domain. It is then used
in an optimization process to estimate the JND thresholds
based on the above-mentioned idea. The proposed method
was compared with the latest JND models in a subjective
experiment. The experimental results demonstrated that, on
average, the proposed method outperforms the compared
methods. Although the proposed method was presented for
grayscale images, it can easily be extended for color images.
For example, for color images, D(I,J)can be defined as
the mean of the NLP distance of individual color channels
and MSE(I,J)can be computed over all the color channels.
The new cost function can then be minimized using the same
optimization procedure proposed for the grayscale images.
1070-9908 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LSP.2017.2717946, IEEE Signal
Processing Letters
5
REF ER EN CE S
[1] A. B. Watson, Digital Images and Human Vision. The MIT press,
1993.
[2] F. A. A. Kingdom, Psychophysics: A Practical Introduction. Academic
press, 2009.
[3] X. Yang, W. Lin, Z. Lu, E. Ong, and S. Yao, “Motion-compensated
residue pre-processing in video coding based on just-noticeable-
distortion profile,” IEEE Trans. Circuits Syst. Video Technol., vol. 15,
pp. 745–752, 2005.
[4] W. Lin and C. J. Kuo, “Perceptual visual quality metrics: A survey,”
J. Visual Communication and Image Representation, vol. 22, no. 4, pp.
297–312, 2011.
[5] Y. Niu, M. Kyan, L. Ma, A. Beghdadi, and S. Krishnan, “Visual saliencys
modulatory effect on just noticeable distortion profile and its application
in image watermarking,” Signal Process.: Image Comm., vol. 28, pp.
917–928, 2013.
[6] L. Itti, G. Rees, and J. K. Tsotsos, Neurobiology of Attention. Academic
Press, 2005.
[7] J. Wu, L. Li, W. Dong, G. Shi, W. Lin, and C.-C. J. Kuo, “Enhanced
just noticeable difference model for images with pattern complexity,”
IEEE Trans. Image Process. (to appear), 2017.
[8] H. Hadizadeh, “A saliency-modulated just-noticeable-distortion model
with non-linear saliency modulation functions,” Pattern Recognit. Lett.,
vol. 84, no. C, pp. 49–55, 2016.
[9] A. Liu, W. Lin, M. Paul, C. Deng, and F. Zhang, “Just noticeable
difference for images with decomposition model for separating edge
and textured regions,IEEE Trans. Circuits Syst. Video Technol., vol. 20,
no. 11, pp. 1648–1652, 2010.
[10] J. Wu, W. Lin, G. Shi, X. Wang, and F. Li, “Pattern masking estimation
in image with structural uncertainty,IEEE Trans. Image Process.,
vol. 22, no. 12, pp. 4892–4904, 2013.
[11] X. Zhang, W. Lin, and P. Xue, “Improved estimation for just-noticeable
visual distortion,” Signal Processing, vol. 28, pp. 795–808, 2005.
[12] A. Ahumada and H. Peterson, “Luminance-model-based DCT quan-
tization for color image compression,” Vision Visual Process. Digital
Display III, pp. 365–374, 1992.
[13] C.-H. Chou and Y.-C. Li, “A perceptually tuned subband image coder
based on the measure of just-noticeable-distortion profile,” IEEE Trans.
Image Processing, vol. 5, no. 6, pp. 467–476, 1995.
[14] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Machine
Intell., vol. 20, pp. 1254–1259, Nov. 1998.
[15] L. Zhang, Z. Gu, and H. Li, “SDSP: A novel saliency detection method
by combining simple priors,” Proc. IEEE Int. Conf. Image Process., pp.
171–175, Sep. 2013.
[16] V. A. Mateescu, H. Hadizadeh, and I. V. Baji´
c, “Evaluation of several
visual saliency models in terms of gaze prediction accuracy on video,” in
IEEE QoEMC’12, in conjunction with IEEE Globecom’12, Dec. 2012.
[17] Z. Lu, W. Lin, X. Yang, E. Ong, and S. Yao, “Modeling visual attentions
modulatory aftereffects on visual sensitivity and quality evaluation,”
IEEE Trans. Image Process., vol. 14, pp. 1928–1942, 2005.
[18] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact
image code,” IEEE Trans. Commun., vol. 31, pp. 532–540, 1983.
[19] V. Laparra, J. Balle, A. Berardino, and E. Simoncelli, “Perceptual image
quality assessment using a normalized laplacian pyramid,” Proc. IS&T
Intl Symposium on Electronic Imaging, Conf. on Human Vision and
Electronic Imaging, Feb. 2016.
[20] O. Schwartz and E. P. Simoncelli, “Natural signal statistics and sensory
gain control,” Nat. Neurosci., vol. 4, no. 8, pp. 819–825, 2001.
[21] D. Heeger, “Normalization of cell responses in cat striate cortex,
Journal of Modern Optics Vis. Neurosci., vol. 9, pp. 181–198, 1992.
[22] V. Laparra, J. M. Mari, and J. Malo, “Divisive normalization image
quality metric revisited,JOSA A, vol. 27, no. 4, pp. 852–864, 2010.
[23] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index
for perceptual image quality assessment,” IEEE Trans. Image Process.,
vol. 23, no. 10, pp. 4270–4281, Oct. 2014.
[24] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-
tion,” 3rd Intl. Conf. Learning Represent., 2015.
[25] H. R. Sheikh, K. Seshadrinathan, A. K. Moorthy, Z. Wang, A. C. Bovik,
and L. K. Cormack, “Image and video quality assessment research at
LIVE,” http://live.ece.utexas.edu/research/quality, 2014, [Online].
[26] ITU-R BT.500-11, “Method for the subjective assessment of the quality
of television pictures,” ITU, Tech. Rep., 2002.
[27] D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical
Procedures. Chapman & Hall/CRC, 2007.
... Just Noticeable Difference (JND) refers to the minimum distortion HVS can perceive, which has been widely used in image/video processing, e.g., perceptual image/video coding [5]- [7], image enhancing [8], and objective quality estimation [9]. The existing JND models can be divided into two categories: 1) pixel-domain models [10]- [15] calculate JND threshold for each pixel directly in the pixel domain; 2) sub-band domain models transfer pixel domain images to the sub-band domain, e.g., Discrete Cosine Transformation (DCT), then calculate the JND threshold for each sub-band [16]- [20]. ...
... Wu et al. [14] proposed an improved pattern masking function based model, where the pattern complexity was calculated as the diversity of orientation in local region. In [15], a new JND model was proposed which considers visual saliency. However, the pixel domain JND models can hardly be incorporated into sub-band image/video compression systems. ...
Article
Human visual system has a limitation of sensitivity in detecting small distortion in an image/video and the minimum perceptual threshold is so called Just Noticeable Difference (JND). JND modelling is challenging since it highly depends on visual contents and perceptual factors are not fully understood. In this paper, we propose deep learning based JND and perceptual quality prediction models, which are able to predict the Satisfied User Ratio (SUR) and Video Wise JND (VWJND) of compressed videos with different resolutions and coding parameters. Firstly, the SUR prediction is modeled as a regression problem that fits deep learning tools. Then, Video Wise Spatial SUR method (VW-SSUR) is proposed to predict the SUR value for compressed video, which mainly considers the spatial distortion. Thirdly, we further propose Video Wise Spatial-Temporal SUR (VW-STSUR) method to improve the SUR prediction accuracy by considering the spatial and temporal information. Two fusion schemes that fuse the spatial and temporal information in quality score level and in feature level, respectively, are investigated. Finally, key factors including key frame and patch selections, cross resolution prediction and complexity are analyzed. Experimental results demonstrate the proposed VW-SSUR method outperforms in both SUR and VWJND prediction as compared with the state-of-the-art schemes. Moreover, the proposed VW-STSUR further improves the accuracy as compared with the VW-SSUR and the conventional JND models, where the mean SUR prediction error is 0.049, and mean VWJND prediction error is 1.69 in quantization parameter and 0.84 dB in peak signal-to-noise ratio.
... Later, the edge masking effect was further considered by Yang et al. (2005) due to the higher sensitivity of HVS for distortions contained in edge areas. After that, various visual effects (Lin and Ghinea 2022) have been developed for the modeling of VRPs, such as disorderly masking (Wu et al. 2013), structural sensitivity , pattern masking (Wu et al. 2017), visual saliency (Hadizadeh, Rajati, and Bajić 2017), foveated masking (Chen and Wu 2019), oblique effect , and forward-backward modulation (Yin et al. 2023). On the other hand, these visual effects have also been studied in various transform domains, such as discrete cosine transform (Bae and Kim 2017), discrete wavelet transform , and karhunen-loeve transform ). ...
Article
Composite images (CIs) typically combine various elements from different scenes, views, and styles, which are a very important information carrier in the era of mixed media such as virtual reality, mixed reality, metaverse, etc. However, the complexity of CI content presents a significant challenge for subsequent visual perception modeling and compression. In addition, the lack of benchmark CI databases also hinders the use of recent advanced data-driven methods. To address these challenges, we first establish one of the earliest visual redundancy prediction (VRP) databases for CIs. Moreover, we propose a multi-visual effect (MVE)-driven incremental learning method that combines the strengths of hand-crafted and data-driven approaches to achieve more accurate VRP modeling. Specifically, we design special incremental rules to learn the visual knowledge flow of MVE. To effectively capture the associated features of MVE, we further develop a three-stage incremental learning approach for VRP based on an encoder-decoder network. Extensive experimental results validate the superiority of the proposed method in terms of subjective, objective, and compression experiments.
... However, some of considerations are very rough, e.g., the JND is scaled by a set of handcrafted weights in [20]. According to the related investigations on the visual attention's impacts on the JND [21], the relation between visual attention and the JND is quite complicated, which should be represented with a more accurate model. Besides, all these the characteristics (e.g., the background luminance, contrast, pattern complexity, etc.) are used for luminance component of JND modelling and chrominance components are simply regarded as the scaled versions of luminance. ...
Preprint
Full-text available
Just Noticeable Difference (JND) has many applications in multimedia signal processing, especially for visual data processing up to date. It's generally defined as the minimum visual content changes that the human can perspective, which has been studied for decades. However, most of the existing methods only focus on the luminance component of JND modelling and simply regard chrominance components as scaled versions of luminance. In this paper, we propose a JND model to generate the JND by taking the characteristics of full RGB channels into account, termed as the RGB-JND. To this end, an RGB-JND-NET is proposed, where the visual content in full RGB channels is used to extract features for JND generation. To supervise the JND generation, an adaptive image quality assessment combination (AIC) is developed. Besides, the RDB-JND-NET also takes the visual attention into account by automatically mining the underlying relationship between visual attention and the JND, which is further used to constrain the JND spatial distribution. To the best of our knowledge, this is the first work on careful investigation of JND modelling for full-color space. Experimental results demonstrate that the RGB-JND-NET model outperforms the relevant state-of-the-art JND models. Besides, the JND of the red and blue channels are larger than that of the green one according to the experimental results of the proposed model, which demonstrates that more changes can be tolerated in the red and blue channels, in line with the well-known fact that the human visual system is more sensitive to the green channel in comparison with the red and blue ones.
... Recently, Wu et al. [27] distinguished irregular pattern from regular pattern to formulate the pattern complexity and proposed a pixel domain JND model which performs highly consistently with the human perception. Hadizadeh et al. [8] presented a pixel-wise saliency-guided JND estimation using the normalized Laplacian pyramid. Liu et al. [16] proposed a joint foveation-depth JND (FD-JND) model for pixels in virtual reality environment. ...
Article
Full-text available
Video coding removes spatial, temporal, and statistic redundancies. After H.265, to further improve the coding efficiency, many efforts have been dedicated to removing the perceptual redundancy by using human perception-based methods. Just noticeable difference (JND) gives a good approximation for the human visual system and provides a valuable solution to remove the perceptual redundancy for perceptual video coding (PVC). However, there are still problems in the PVC architecture and the JND profile. One is although the whole discrete cosine transform (DCT) block are suppressed, there are still many transform coefficients below the suppression levels which are not adequately suppressed. Another problem is, to the best of our knowledge, most JND profiles are measured by image-based test methods and past display equipment. However, compared to images, videos exhibits temporal characteristics, and the current trend of the display equipment is towards full high definition. To solve these problems, we first propose a high efficiency video coding (HEVC)-compliant PVC architecture, where the coefficients in a DCT block can be adequately suppressed in a whole block manner. Second, we propose a video-based test method to model the temporal masking (TM) effect, called TM-JND. Experimental results show that the proposed TM-JND model can more accurately estimate the JND values for today’s display equipment and videos, avoiding the overestimate of the JND values like other existing models. The proposed PVC architecture achieves a significant bitrate reduction with a negligible subjective quality loss, compared with the HEVC test model HM 16.9.
Preprint
Full-text available
Significant improvement has been made on just noticeable difference (JND) modelling due to the development of deep neural networks, especially for the recently developed unsupervised-JND generation models. However, they have a major drawback that the generated JND is assessed in the real-world signal domain instead of in the perceptual domain in the human brain. There is an obvious difference when JND is assessed in such two domains since the visual signal in the real world is encoded before it is delivered into the brain with the human visual system (HVS). Hence, we propose an HVS-inspired signal degradation network for JND estimation. To achieve this, we carefully analyze the HVS perceptual process in JND subjective viewing to obtain relevant insights, and then design an HVS-inspired signal degradation (HVS-SD) network to represent the signal degradation in the HVS. On the one hand, the well learnt HVS-SD enables us to assess the JND in the perceptual domain. On the other hand, it provides more accurate prior information for better guiding JND generation. Additionally, considering the requirement that reasonable JND should not lead to visual attention shifting, a visual attention loss is proposed to control JND generation. Experimental results demonstrate that the proposed method achieves the SOTA performance for accurately estimating the redundancy of the HVS. Source code will be available at https://github.com/jianjin008/HVS-SD-JND. Index Terms-Just noticeable difference, human visual system, visual attention, visual perception, deep neural networks.
Article
Just-Noticeable Difference (JND) is the minimal amount of signal change that the human being is able to perceive. The human has five major sensing organs,namely,eyes,ears,nose,skin and tongue,and therefore JND exists for the corresponding five signal modalities and their derivatives. JND can play an important role in many multimedia applications and services,because these imperfect human perceptual characteristics may be turned into advantages for relevant system design,development and optimization. This paper starts off by giving a general description for JND concepts and the related statistical processes. Then,existing computational models for visual JND,which represent the majority of the related research so far,are to be reviewed systematically,with both handcrafted modeling and machine learning approaches. Furthermore,research attempts will be surveyed for JNDs for audio,smell,haptics and gustatory signals,as well as cross-modality/media efforts. Finally,possible future directions and opportunities are analysed and discussed.
Article
Visual redundancy detection is essential for image and video communication. Human visual system (HVS) is difficult to perceive the pixel magnitude change below a certain visibility threshold which is also known as just-noticeable-difference (JND). In this letter, we present an efficient JND estimation approach for screen content images by considering high-frequency sensitivity and orientation sensitivity correction. Specifically, to better quantify the visual redundancy, we investigate the visibility threshold based on the high-frequency distortion sensitivity. To obtain the orientation sensitivity correction, we divide the screen image pixels into three levels based on the oblique effect that considers the sensitive integrity of edges. Compared with several state-of-the-art JNDs, experimental results show that our method tolerates more perceptual redundancy, and delivers better visual quality under the same injected-noise energy. The implementation of the proposed method is publicly available at https://sites.google.com/site/wangmiaohui/.
Article
The just noticeable difference (JND) in an image, which reveals the visibility limitation of the human visual system (HVS), is widely used for visual redundancy estimation in signal processing. To determine the JND threshold with the current schemes, the spatial masking effect is estimated as the contrast masking, and this cannot accurately account for the complicated interaction among visual contents. Research on cognitive science indicates that the HVS is highly adapted to extract the repeated patterns for visual content representation. Inspired by this, we formulate the pattern complexity as another factor to determine the total masking effect: the interaction is relatively straightforward with limited masking effect in a regular pattern, and is complicated with strong masking effect in an irregular pattern. From the orientation selectivity mechanism in the primary visual cortex, the response of each local receptive field can be considered as a pattern; therefore, in this work, the orientation that each pixel presents is regarded as the fundamental element of a pattern, and the pattern complexity is calculated as the diversity of the orientation in a local region. Finally, taking both pattern complexity and luminance contrast into account, a novel spatial masking estimation function is deduced, and an improved JND estimation model is built. Experimental results on comparing with the latest JND models demonstrate the effectiveness of the proposed model, which performs highly consistent with the human perception.
Article
We present an image quality metric based on the transformations associated with the early visual system: local luminance subtraction and local gain control. Images are decomposed using a Laplacian pyramid, which subtracts a local estimate of the mean luminance at multiple scales. Each pyramid coefficient is then divided by a local estimate of amplitude (weighted sum of absolute values of neighbors), where the weights are optimized for prediction of amplitude using (undistorted) images from a separate database. We define the quality of a distorted image, relative to its undistorted original, as the root mean squared error in this “normalized Laplacian” domain. We show that both luminance subtraction and amplitude division stages lead to significant reductions in redundancy relative to the original image pixels. We also show that the resulting quality metric provides a better account of human perceptual judgements than either MS-SSIM or a recently-published gain-control metric based on oriented filters.
Conference Paper
Salient regions detection from images is an important and fundamental research problem in neuroscience and psychology and it serves as an indispensible step for numerous machine vision tasks. In this paper, we propose a novel conceptually simple salient region detection method, namely SDSP, by combining three simple priors. At first, the behavior that the human visual system detects salient objects in a visual scene can be well modeled by band-pass filtering. Secondly, people are more likely to pay their attention on the center of an image. Thirdly, warm colors are more attractive to people than cold colors are. Extensive experiments conducted on the benchmark dataset indicate that SDSP could outperform the other state-of-the-art algorithms by yielding higher saliency prediction accuracy. Moreover, SDSP has a quite low computational complexity, rendering it an outstanding candidate for time critical applications. The Matlab source code of SDSP and the evaluation results have been made online available at http://sse.tongji.edu.cn/linzhang/va/SDSP/SDSP.htm.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
Perceptual watermarking should take full advantage of the results from human visual system (HVS) studies. Just noticeable distortion (JND) gives us a way to model the HVS accurately. In this paper, another very important aspect affecting human perception, visual saliency, is introduced to modulate JND model. Based on the visual saliency's modulatory effect on JND model which incorporates visual attention's influence on visual sensitivity, the saliency modulated JND profile guided image watermarking scheme is proposed. The saliency modulated JND profile guided watermarking scheme, where the visual sensitivity model combined with visual saliency's modulatory effect is fully used to determine image-dependent upper bounds on watermark insertion, allows us to provide the maximum strength transparent watermark. Experimental results confirm the improved performance of our saliency modulated JND profile guided watermarking scheme in terms of transparency and robustness. Our watermarking scheme is capable of shaping lower injected-watermark energy onto more sensitive regions and higher energy onto the less perceptually significant regions in the image, which yields better visual quality of the watermarked image. At the same time, the proposed saliency modulated JND profile guided image watermarking scheme is more robust compared to unmodulated JND profile guided image watermarking scheme.
Article
Perceptual image quality assessment (IQA) aims to use computational models to measure the image quality in consistent with subjective evaluations. Visual saliency (VS) has been widely studied by psychologists, neurobiologists, and computer scientists during the last decade to investigate which areas of an image will attract the most attention of the human visual system. Intuitively, VS is closely related to IQA in that suprathreshold distortions can largely affect VS maps of images. With this consideration, we propose a simple but very effective full reference IQA method by using VS. In our proposed IQA model, the role of VS is twofold. First, VS is used as a feature when computing the local quality map of the distorted image. Second, when pooling the quality score, VS is employed as a weighting function to reflect the importance of a local region. The proposed IQA index is called Visual Saliency-based Index, VSI for short. Several prominent computational VS models have been investigated in the context of IQA and the best one is chosen for VSI. Extensive experiments performed on four large scale benchmark databases demonstrate that the proposed IQA index VSI works better in terms of the prediction accuracy than all state-of-the-art IQA indices we can find while maintaining a moderate computational complexity. The Matlab source code of VSI and the evaluation results are publicly available online at http://sse.tongji.edu.cn/linzhang/IQA/VSI/VSI.htm.
Conference Paper
A number of methods have been recently proposed to highlight salient regions in images and videos. Considering the importance of attention in video quality evaluation, it would be useful to know how accurate these methods are in terms of predicting viewers' gaze locations in video. However, independent quantitative evaluations of saliency methods are lacking in the current literature. In this paper, we test nine different bottom-up saliency detection models on a set of standard video sequences. The eye-tracking data from 15 viewers for the first and second viewings of a sequence is evaluated against the normalized saliency maps obtained for each frame of the sequence. An accuracy score is determined for each frame and averaged across all frames to provide a measure of performance. For each sequence, the scores of all methods are compared and analyzed statistically to determine if there is a clear winner for that sequence. Further analysis and discussion of the performance of various methods is provided in an attempt to discover which aspects of the saliency models lead to high gaze prediction accuracy.