Video quality assessment by decoupling additive impairments and detail losses.
-
Citations (0)
-
Cited In (0)
Page 1
VIDEO QUALITY ASSESSMENT BY DECOUPLING ADDITIVE IMPAIRMENTS AND
DETAIL LOSSES
Songnan Li, Lin Ma, King Ngi Ngan
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR
ABSTRACT
In this paper, a review on existing methods of extending im-
age quality metric to video quality metric is given. It is found
that three processing steps are usually involved which include
the temporal channel decomposition, temporal masking and
error pooling. They are utilized to extend our previously pro-
posed image quality metric, which separately evaluates ad-
ditive impairments and detail losses, to video quality metric.
The resultant algorithm is tested on subjective video database
LIVE and shows a good performance in matching subjective
ratings.
Index Terms— video quality assessment, distortion de-
coupling, human visual system, visual masking
1. INTRODUCTION
Since the human visual system (HVS) is the ultimate receiver
of the video service, subjective viewing test is considered to
be the most reliable way to evaluate visual quality. However,
subjective viewing test is expensive, and not feasible for on-
line manipulations, which makes it impractical for system de-
sign, quality monitoring, etc. Therefore, an accurate objec-
tive VQA algorithm, or namely video quality metric (VQM),
becomes of fundamental importance to future multimedia ap-
plications.
It is customary to classify VQM into three categories
according to the reference availability: full-reference (FR),
reduced-reference (RR), and no-reference (NR) metrics. In
FR metrics, the reference is fully available and is assumed to
have maximum quality. They can be applied in applications
where the reference is fully available, such as image/video
coding, watermarking etc. RR metrics extract features from
the reference video, transmit them to the receiver side to
compare against the corresponding features extracted from
the distorted video. The design of RR metric mainly targets
at quality monitoring. These features should be carefully
selected to achieve both effectiveness and efficiency, i.e.,
predicting quality with great accuracy and small overhead
for feature representation. NR metrics require no reference,
therefore are most broadly applicable. For many no-reference
This work was partially supported by a grant from the Chinese Univer-
sity of Hong Kong under the Focused Investment Scheme (Project 1903003).
applications, such as video signal acquisition, enhancement
etc., NR metric is their only choice for on-line quality assess-
ment. Not surprisingly, NR metric design is tough, facing
challenges of limited input information. Therefore, to make
sure acceptable prediction performance, many NR metrics
are designed to cope with specific artifacts, such as blocking,
blurring, ringing, jitter/jerky motion, etc., scarifying versatil-
ity for prediction accuracy. For a comprehensive overview on
NR metrics, please refer to [1].
In this paper, we propose a FR video quality metric. It is
an extension of our previously proposed image quality met-
ric [2], which separately evaluates detail losses and additive
impairments for visual quality assessment. In Section 2, we
briefly review our IQM, and then discuss how to extend it to
VQM. Section 3 elaborates the implementation details. Sec-
tion 4 shows the performance of the proposed VQM in match-
ing subjective ratings. Section V provides the concluding re-
marks.
2. BACKGROUND
2.1. Spatial distortion measurement
Limited by the paper length, please refer to [3] for an
overview on image quality assessment. In our VQM, we
adopt our previous work [2] to measure the spatial distor-
tions. Instead of treating the spatial distortions indistinguish-
ably, they are decomposed into details losses and additive
impairments. As the name implies, detail losses refer to the
loss of useful information which affects the content visibil-
ity. Additive impairments, on the other hand, refer to the
redundant visual information which does not belong to the
original image but appears in the distorted image.
appearance will distract viewer’s attention from the useful
picture contents, causing unpleasant viewing experience. To
assist understanding, an illustration is given in Fig. 1. In Fig.
1 (a), the distorted image is separated into the original image
and the error image. Typically, HVS-model based IQMs will
try to simulate low-level HVS responses to the error image,
treating these distortions as being homogeneous. As shown
in Fig. 1 (b), the proposed method will further separate the
distortions into detail losses and additive impairments. For
JPEG compressed images, as the one shown in Fig. 1, the
Their
2011 Third International Workshop on Quality of Multimedia Experience
978-1-4577-1334-7/11/$26.00 ©2011 IEEE90
Page 2
additive impairment mainly appears to be blocky. In our im-
plementation, we separate the distorted image into an additive
impairment image and a restored image, as shown in Fig. 1
(c). The restored image exhibits the same amount of detail
losses as the distorted image but is additive impairment free.
Then, the detail loss can be obtained by subtracting the re-
stored image from the original image. In [4], the necessity of
decoupling linear frequency distortions and additive noises,
two concepts essentially similar to detail losses and additive
impairments, is firstly advocated and proved to be useful for
visual quality assessment. In our viewpoint, the benefits of
decoupling distortions into additive impairments and detail
losses include several aspects. First, the content visibility
can be more accurately quantified due to the extraction of
detail losses. Second, a better spatial masking scheme can
be designed. This will be explained in Section 3.4. Third,
specific measurement can be developed to associate detail
losses or additive impairments with visual quality. This will
be explained in Section 3.5.
2.2. Extension from IQM to VQM
Reading the literatures, we found that three processing steps
are usually involved for extension of IQM to VQM: temporal
channel decomposition, temporal masking, and error/quality
pooling.
Many existing VQMs decompose the video signal into
multiple spatio-temporal frequency channels and then assign
different weights to them according to, e.g., the contrast sen-
sitivity function (CSF). It is believed that the early stage of
the visual pathway separates visual information into two tem-
poral channels: a low-pass channel and a band-pass channel,
known as the sustained and transient channel, respectively.
Several VQMs model this HVS mechanism by filtering the
videos along the temporal dimension using one or two filters.
Recently, Seshadrinathan et al. [5] proposed to use three-
dimensional Gabor filters to decompose the video locally into
105 spatio-temporal channels enabling the calculation of mo-
tion vectors from the Gabor outputs. Different from typical
CSF weighting, in [5] each channel is weighted according to
the distance between its center frequency and a spectral plan
identified by the motion vectors of the reference video. Lee
et al. [6] proposed to find the optimal weights for channels by
optimizing the metric’s predictive performance on subjective
video databases.
Masking is another visual phenomenon critical for video
quality assessment: the visibility of distortions is highly de-
pendent on both the local spatial and temporal activities.
Lukas et al. [7] used the derivative of the outputs of a spa-
tial visual model along the time axis to measure the local
temporal activities, which then serves as input to a nonlin-
ear temporal masking function. This function was calibrated
by fitting psychophysical data. Lindh et al. [8] extended a
classical divisive normalization based masking model from
Distorted image Original image Error image
(a)
Error imageDetail lossesAdditive impairments
(b)
Distorted image Restored image Additive impairments
(c)
Fig. 1. An example of (a) separating the distorted image into
theoriginalimageandtheerrorimage, (b)separatingtheerror
image into the detail loss image and the additive impairment
image, (c) separating the distorted image into the restored im-
age and the additive impairment image.
spatial to spatio-temporal frequency domain. Chou et al. [9]
proposed to measure temporal activity simply by calculating
pixel differences between adjacent frames. They constructed
a temporal masking function via specifically designed psy-
chophysical experiment. Similar temporal masking functions
were taken by a host of video quality metrics and JND mod-
els.
Pooling models the information integration which is be-
lieved to happen at the late stage of the visual pathway, and
usually it is carried out by summation over all dimensions to
obtain an overall quality score for an image or video. Wang
et al. [10] used relative and background motions to quantify
two terms: motion information content and perceptual uncer-
tainty, whichinthenextstepwereusedasweightingfactorsin
the spatial pooling process. In TetraVQM [11], a degradation
duration map is generated for each frame by analyzing the
motion trajectory, and serves as a weighting matrix in spatial
2011 Third International Workshop on Quality of Multimedia Experience
91
Page 3
Wavelet transform
Temporal filtering
Decouple
Contrast sensitivity function
Spatial Masking
Temporal Masking
Detail loss
measure (DLM)
Combine DLM and AIM
Temporal pooling
Additive
impairment
measure (AIM)
on
input
dn
input
on
dn
Dn
On
On
An
Rn
On
csf
An
csf
Rn
csf
On-1
csf
Rn
sm
An
sm
FDn
csf
An
tm
Sn
tm
fn
DLM
fn
AIM
fn
s
Sn
sm
Fig. 2. The framework of the proposed VQM.
pooling. Ninassi et al. [12] proposed to take into account the
temporal variation of spatial distortions in the temporal pool-
ing process. They also considered the asymmetric human be-
havior in responding to quality degradation and improvement.
This asymmetric human behavior was also modeled in [13].
We extend our IQM [2] to VQM by incorporating all the
three processing steps mentioned above. To reduce compu-
tational complexity, the adopted decomposition, masking and
pooling methods are simple and time efficient. In general,
only the sustained channel is extracted after the temporal de-
composition; temporal masking is calculated based on dif-
ferences between adjacent frames; spatial pooling is formu-
larized using Minkowski summation, and temporal pooling
considers the aforementioned asymmetric human behavior.
Implementation details will be given in the next section.
3. THE PROPOSED METHOD
The proposed video quality metric works with luminance
only. Its framework is illustrated in Fig. 2. Detailed infor-
mation on each processing component and the meaning of
notations will be given below.
3.1. Temporal filtering
As introduced in Section 2, it is believed that there are two
temporal channels in the HVS, a low-pass one, known as the
sustained channel, and a band-pass one, known as the tran-
sient channel. However, it is the sustained channel that carries
most visual information. In [14], it is confirmed that a major-
ity of distortions exists in the sustained channel. Therefore as
in [13] the proposed VQM uses the sustained channel only, to
reduce the computational complexity. Specifically, both the
original and distorted sequences are subjected to a 30Hz low-
pass temporal filter. To reduce time delay, a three-tap infinite
impulse response (IIR) filter is chosen [15]:
+ 0.12 × xinput
where xinput
n
is either the nthframe of the original sequence
(oinput
n
) or that of the distorted sequence (dinput
or dn) is the low-pass temporal filtering result.
xn= 0.8 × xinput
n
n−1+ 0.08 × xn−1
(1)
n
), and xn(on
3.2. Decoupling additive impairments and useful image
contents
Each local patch riof the restored image can be decomposed
into image components:
ri=
S
?
s=0
rs
i
(2)
where i is the local index, rs
structed by the wavelet coefficients of the sthsubband (totally
S + 1 subbands), and s = 0 indicates the approximation sub-
band. The same decomposition can be applied to the original
image and the distorted image to derive os
tively. For s ∈ {1,...,S}, the mean value of rs
zero. In general, we intend to get rs
additive-impairment-free and exhibits the same amount of de-
taillossesasds
ks
detail loss. To the second end, we maximize the similarity
between rs
larity between rs
differences to facilitate its optimization. Thus, the similar-
ity maximization is implemented as: minks
Given an orthonormal discrete wavelet transform (DWT), the
following equations hold:
||rs
=
minks
i∈[0,1]
=
minks
i∈[0,1]
=
minks
i∈[0,1]
=
minks
i∈[0,1]
where Os
respectively. From (3), we can get the closed-form solution
for the scale factor ks
iindicates the component recon-
iand ds
i/ds
irespec-
i/os
iequals
i, s ∈ {1,...,S}, that is
i. Tothefirstend, wemakers
iis between 0 and 1 to take into account the influence of the
i= ks
i×os
i, where
iand ds
iby setting the scale factor ks
iand ds
i. The simi-
iis measured by the sum of squared
i∈[0,1]||rs
i− ds
i||2.
minks
i∈[0,1]
i− ds
||DWT[rs
||DWT[ks
||ks
||ks
idenote the DWT coefficients of os
i||2
i− ds
i× os
i]||2
i− ds
i] − DWT[ds
i− Ds
i]||2
i× DWT[os
i× Os
i]||2
i||2
(3)
iand Ds
iand ds
i
i:
ks
i= clip(< Os
i· Ds
||Os
i>
i||2
,0,1)
(4)
2011 Third International Workshop on Quality of Multimedia Experience
92
Page 4
Simplification can be made that instead of using a vector of
DWT coefficients, Os
DWT coefficient. In this way, (4) is simplified to the division
of two scalar values. In the following discussion, n index
each frame, λ and θ index subband scale and orientation, re-
spectively, and {i,j} indexes the DWT coefficient position.
A four-level Haar DWT is applied to the temporally low-
pass filtered original and distorted frames (onand dn), gener-
ating the DWT coefficients On(λ,θ,i,j) and Dn(λ,θ,i,j).
Based on the abovementioned simplification, scale factors of
the high frequency subbands are given by:
ior Ds
ican be represented by a single
kn(λ,θ,i,j) = clip(
Dn(λ,θ,i,j)
On(λ,θ,i,j) + 10−30,0,1)
(5)
where the constant 10−30is to avoid dividing by zero. Since
intuitively the original mean luminance cannot be recovered
from the distorted image, the approximation subband of the
restored image is made to equalize that of the distorted image.
Eventually, the DWT coefficients of the restored image can be
obtained by:
?
where θ = 1 indicates the approximation subband. Since
DWT is a linear operator and the additive impairment image
is given by an = dn− rn, DWT coefficients of ancan be
calculated by:
Rn(λ,θ,i,j) =
Dn(λ,θ,i,j)
kn(λ,θ,i,j) × On(λ,θ,i,j)
θ = 1
otherwise
(6)
An(λ,θ,i,j) = Dn(λ,θ,i,j) − Rn(λ,θ,i,j)
Notably, different from our previous work [2], the decou-
pling algorithm described in the paper cannot handle con-
trast enhancement, for the purpose of reducing computational
complexity.
(7)
3.3. Contrast sensitivity function
HVS contrast sensitivity is the reciprocal of the contrast
threshold, i.e., the minimum contrast value for an observer
to detect a stimulus. It is found in psychovisual experiments
that HVS contrast sensitivity depends on the characteristics
of the visual stimulus, e.g., its spatial frequency, orientation,
etc. Contrast sensitivity function (CSF) quantifies such de-
pendences. The proposed VQM adopts the CSF used in [16].
It can be given by:
?
where f denotes the radial spatial frequency in cycles per
degree of visual angle, and the constants are a = 0.049,
b = 0.592, c = 0.228, d = 3.4. According to [17], the
nominal spatial frequency of each DWT coefficient in scale λ
can be given by:
H(f,θ) =
(a + bfθ)exp[−(cfθ)1.1]
0.981
f ≥ d
otherwise
(8)
f =
π × fq× d
180 × h × 2λ
(9)
1/30 1/30 1/30
1/30 1/15 1/30
1/30 1/30 1/30
Fig. 3. The weighting matrix w.
where d is the viewing distance, h is the picture height, and fq
is the cycles per picture height. In the following experiments,
we set the ratio of d to h to be 6. fθ= f/[0.15p(θ) + 0.85]
accounts for the oblique effect, i.e., the HVS is more sensi-
tive to the horizontal and vertical channels than the diagonal
channels. p(θ) = 1 for the vertical and horizontal DWT sub-
bands, and p(θ) = −1 for the diagonal DWT subband. As
illustrated in Fig. 2, we simulate CSF processing for the orig-
inal image and the two decoupled images. It is implemented
by multiplying each DWT coefficient with its corresponding
CSF value derived from (8) and (9).
3.4. Spatial and temporal masking
Spatial masking refers to the visibility threshold elevation of
a target signal caused by the presence of a superposed masker
signal. Traditional spatial masking methods use original im-
age to mask the distortions. However, artifacts may make the
distorted image less textured compared to the original, espe-
cially for low-quality images where the contrasts of the tex-
tures or edges have been significantly reduced. In our met-
ric, the restored image and the additive impairment image are
decoupled, as illustrated in Fig.1 (c). Since the two decou-
pled images are superposed to form the distorted image, one’s
presence will affect the visibility of the other. Therefore, in
the proposed metric both images serve as the masker to mod-
ulate the intensity of the other. We use (10) to calculate the
spatial masking thresholds:
THλ= mi×
3
?
θ=1
(|Mλ,θ| ⊗ w),i ∈ {s,t}
(10)
where w is a 3×3 weighting matrix as shown in Fig. 3, |Mλ,θ|
istheabsoluteDWTsubbandofthemaskersignal, operator⊗
indicates convolution, and THλis the spatial masking thresh-
old map for each of the three DWT subbands in scale λ.The
mscan be used to alter the slope of the masking function.
As in [2], msis set to 1 for all subbands. We take the abso-
lute value of the CSF-weighted DWT coefficients, i.e., |Rcsf
(|Acsf
measured by (10) using Acsf
n
(Rcsf
the resultant negative values to 0. After the spatial masking,
the DWT coefficients of the restored and additive impairment
images can be represented by Rsm
shown in Fig.1, Ssm
n
which denotes detail losses can be de-
rived by subtracting Rsm
n
from Ocsf
DWT coefficients of the nthoriginal frame.
n |
n |), subtract from them the spatial masking thresholds
n ) as the masker, and clip
n
and Asm
n, respectively. As
n , i.e., the CSF-weighted
2011 Third International Workshop on Quality of Multimedia Experience
93
Page 5
Temporal masking (TM) usually is modeled as a function
of temporal discontinuity in intensity: the higher the inter-
frame difference, the stronger the temporal masking effect.
This method is also adopted by our VQM for computational
simplicity. More precisely, the difference map between Ocsf
and Ocsf
types of spatial distortions: the detail losses (Ssm
ditiveimpairments(Asm
masking threshold, and the masking process also follows the
aforementioned three steps: taking absolute value, subtract-
ing threshold and then clipping to zero. mtis set to 0.5 for all
subbands. The value is determined by training, which will be
introduced in Section 4.
n
n−1is used as the masker to temporally mask the two
n) and the ad-
n). Eq. (10)isusedtogetthetemporal
3.5. DLM, AIM and their combination
The additive impairment measure (AIM) and detail loss mea-
sure (DLM) are given by:
?
Np
?
λ
fAIM
n
=
λ
?
?
θ[?
θ[?
i,j∈centerAtm
n(λ,θ,i,j)2]1/2
,θ ?= 1
(11)
fDLM
n
=
λi,j∈centerStm
i,j∈centerOcsf
n(λ,θ,i,j)2]1/2
n (λ,θ,i,j)2]1/2,θ ?= 1
??
θ[?
(12)
where θ ?= 1 means that we exclude the use of approximation
subband in spatial pooling, and (i,j) ∈ center indicates that
only the central region of each subband is used, which serves
as a simple region of interest (ROI) model. Since additive
impairments are relatively independent of the original image
content, we assume that visual quality with respect to addi-
tive impairments can be predicted by analyzing their inten-
sities without considering the original content. On the other
hand, visualqualitywithrespecttodetaillossesissupposedto
be determined by the percentage of visual information losses.
Therefore, in (11) and (12) the integrated distortion intensity
is normalized by the pixel number Npand the original image
content, respectively. It should be noted that fDLM
proximate calculation of the percentage of visual information
losses. Its low complexity makes the proposed metric time
efficient.
fAIM
n
and fDLM
n
are combined by weighted summation:
n
is an ap-
fn= w × fAIM
n
+ fDLM
n
;
(13)
The weighting factor w is determined by training, as will be
introduced in Section 4. The training result is w = 27.45.
Considering the typical range of fAIM
(0 ∼ 0.3), we can see that fAIM
weights.
n
and fDLM
(0 ∼ 0.01) and fDLM
n
are given similar
n
n
3.6. Temporal pooling
Wetakethemethodusedin[13]toperformthetemporalpool-
ing. In general, the asymmetric human behavior in respond-
ing to quality degradation and improvement is taken into ac-
count, that is, human observers are quick to criticize qual-
ity degradation and slow to response to quality improvement.
It is achieved by using the following temporal pooling equa-
tions:
?
f
?
n=
f
f
?
n−1+ a−?n
?
n−1+ a+?n
if
if
?n≤ 0
?n> 0
(14)
s =
1
N
N
?
n=1
f
?
n
(15)
where ?n = fn− f
a+=0.5.
?
n−1. As in [13], we set a−=0.04, and
4. EXPERIMENTS
In this section we present the predictive performance of the
proposed VQM on subjective video database LIVE [18].
LIVE consists of 10 reference videos, and 150 test videos
each of which is distorted by one of the four distortion types,
i.e., H.264 compression, MPEG2 compression, wireless or
IP transmission error, with various distortion intensities. The
video resolution is 768 × 432. The database provides a sub-
jective score (difference mean opinion scores, i.e., DMOS)
for each of the distorted sequences. The subjective scores are
derived from subjective viewing tests. They are taken as the
ground truth to be compared with the metric outputs to evalu-
ate the predictive performance. It is customary to nonlinearly
map the metric scores to the ones that have a linear rela-
tionship with the subjective scores. And after the non-linear
mapping1, we use three objective criteria to measure the cor-
relation between the subjective scores and the nonlinearly
mapped objective scores, which are the Linear Correlation
Coefficient (LCC), the Spearman Rank-Order Correlation
Coefficients (SROCC), and the Root Mean Squared Error
(RMSE). Higher LCC and SROCC values indicate stronger
correlation, i.e., better metric performance; while a smaller
RMSE value indicates better metric performance.
As mentioned in Section 3, there are two parameters, i.e.,
the temporal masking factor mtand the weighting factor w,
which are determined by training. The training set consists
of 45 videos2from video database LIVE. The training objec-
tive is to maximize the SROCC value of the proposed VQM
on the training set. The other 105 distorted videos are used
for performance evaluation. The proposed VQM is compared
with PSNR, a standardized metric Video Quality Model (VQ
model) [19], and a state-of-the-art metric MOVIE [5]. As
shown in Table 1 and 2, the proposed VQM demonstrates
thebestoverallperformance, andrelativelygoodperformance
1Limited by the paper length, please refer to [18] for the motivation and
implementation of the non-linear mapping.
2The 45 distorted videos are generated from three reference sequences,
i.e., Station, Sunflower and Tractor, randomly chosen from the LIVE video
database.
2011 Third International Workshop on Quality of Multimedia Experience
94