ArticlePDF Available

A packet-layer video quality assessment model with spatiotemporal complexity estimation

Authors:

Abstract and Figures

Packet-layer video quality assessment model is a light-weight model that is useful for application scenarios like inservice video quality monitoring and network service planning, which is under standardization in ITU-T. In this paper we first differentiate the requirements for video quality assessment model from these two different application scenarios. Then different criteria and methods are analyzed and used to construct two types of test datasets for model building and evaluation. Finally, a novel packet-layer video quality assessment model dealing with video quality monitoring is proposed by incorporating the estimation of the spatiotemporal complexity of video content. The interaction between content features and the error concealment effects plus error propagation effects is considered. Experiment results demonstrate that the proposed model achieves preferable and robust performance improvement compared with the existing models in both datasets for scenarios of video quality planning and monitoring. Especially, larger Pearson correlation increase from 0.75 to 0.93 and RMSE decrease from 0.56 to 0.31 are obtained in dataset for video quality monitoring.
Content may be subject to copyright.
RESEARCH Open Access
A packet-layer video quality assessment model
with spatiotemporal complexity estimation
Ning Liao
*
and Zhibo Chen
Abstract
A packet-layer video quality assessment (VQA) model is a lightweight model that predicts the video quality
impacted by network conditions and coding configuration for application scenarios such as video system planning
and in-service video quality monitoring. It is under standardization in ITU-T Study Group (SG) 12. In this article, we
first differentiate the requirements for VQA model from the two application scenarios, and state the argument that
the dataset for evaluating the quality monitoring model should be more challenging than that for system planning
model. Correspondingly, different criteria and approaches are used for constructing the test datasets, for system
planning (dataset-1) and for video quality monitoring (dataset-2), respectively. Further, we propose a novel video
quality monitoring model by estimating the spatiotemporal complexity of video content. The model takes into
account the interactions among content features, the error concealment effectiveness, and error propagation
effects. Experiment results demonstrate that the proposed model achieves robust performance improvement
compared with the existing peer VQA metrics on both dataset-1 and dataset-2. It is noted that on the more
challenging dataset-2 for video quality monitoring, we obtain a large increase in Pearson correlation from 0.75 to
0.92 and a decrease in the modified RMSE from 0.41 to 0.19.
Keywords: video quality assessment, quality of experience, packet-layer model, spatiotemporal complexity
estimation
1. Introduction
With the development of video service delivery over IP
networks, there is a growing interest in low-complexity
no-reference video quality assessment (VQA) models for
measuring the impact of transmission losses on the per-
ceived video quality. No-reference VQA model generally
uses only the received video with compression and
transmission impairment as model input to estimate the
video quality. No-reference model fits better with the
real-world situation where customers usually watch
IPTV or streaming video without the original video as
reference.
In ITU-T Study Group (SG) 12, there is a recent study
[1] on the no-reference objective VQA models (e.g., P.
NAMS [2], G. Opinion Model for Video Streaming
(OMVS), P.NBAMS [3]) considering impairment caused
by both transmission and video compression. In litera-
tures, depending on the inputs, the no-reference models
can be classified as packet-layer model, bitstream-level
model, media-layer model, and hybrid model, as shown
in Figure 1.
A media-layer model employs with pixel signal. Thus,
it can easily obtain content-dependent features that
influence video quality, such as texture-masking effects
and motion-masking effects. However, a media-layer
model usually needs special solutions (e.g., [4]) for locat-
ing the impaired parts in the distorted video because of
the lack of information on packet loss.
A packet-layer model (e.g., P.NAMS) utilizes various
packet headers (e.g., RTP header, TS header), network
parameters (e.g., packet loss rate (PLR), delay), and
codec configuration information as input to the model.
Obviously, this type of model can roughly locate the
impaired parts by analyzing the packet headers. How-
ever, how to take the content-dependent features into
account is a big challenge to this model.
A bitstream-level model (e.g., P.NBAMS, [5]) uses the
compressed video bitstream in addition to the packet
headers as input. Thus, it is not only aware of the location
* Correspondence: ning.liao@technicolor.com
Media Processing Laboratory, Technicolor Research & Innovation, Beijing,
China
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
© 2011 Liao and Che n; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unr estricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
of the loss-impaired parts of video, but also has access to
video-content feature and the detailed encoding para-
meters by parsing the video bitstream. It is supposed to be
more accurate than a packet-layer model at a cost of
slightly higher computational complexity. However, in the
case that video bitstream is encrypted, only packet-layer
model works.
Hybrid model uses the pixel signal in addition to the
bitstream and the packet headers to further improve
video quality prediction accuracy. Because the various
error concealment (EC) artifacts become available only
after decoding video bitstream into pixel signal, in prin-
ciple it can provide the most accurate quality prediction
performance. However, it has much higher computa-
tional complexity.
The packet-layer model, which primarily estimates the
video quality impairment caused by unreliable transmis-
sion, is studied in this article.
Two use cases of packet-layer VQA models have been
identified in ITU-T SG12/Q14: video system planning
and in-service video quality monitoring.
As a video system planning tool, parametric packet-
layer model can help to determine the proper video enco-
der parameters and network quality of service (QoS)
parameters. This can avoid over-engineering the applica-
tions, terminals, and networks while guaranteeing users
satisfactory QoE. ITU-T G.OMVS and G.1070 [6] for
videophone service are the examples of the video system
planning model.
For video quality monitoring application, usually opera-
tors or service providers need to ensure video quality ser-
vice level agreement by monitoring and diagnosing video
quality degradation caused by network issues. Since
packet-layer model is computationally lightweight, it can
be deployed in large scale along the media service chain.
The video quality model of ITU-T standard P.NAMS
(Non-intrusive parametric model for the Assessment of
performance of Multimedia Streaming) is specifically
designed for this purpose.
In general, two approaches can be followed in packet-
layer modeling. One is the parameter-based modeling
approach [6-9] and another is the loss-distortion chain-
based modeling approach [5]. The parameter-based
approach estimates perceptual quality by extracting the
parameters of a specific application (e.g., coding bitrate,
frame rate) and transmission packet loss, then building a
relationship between the parameters and the overall video
quality. Obviously, the parametric packet-layer model is in
nature consistent with the requirement of system plan-
ning. However, it predicts the average video quality over
different video contents. The coefficient table of this
model needs to change with the codec type and configura-
tion, the EC strategy of a decoder, the display resolution,
and the video content types. Noticeably, the models in
[6,8,9] were claimed to achieve a very high Pearson corre-
lation above 0.95, and the RMSE lower than 0.3 on the
5-point rating scale or 7 on the 0-100 rating scale, even if
the video content features were not considered in the
models. This motivated us to verify the results and look
into the ways of setting up training and evaluation dataset
on which the model performance directly depends.
Loss-distortion chain-based approach [5] has the merit
of accounting in error propagation, content features, and
EC effectiveness. Since iteration process is generally
involved in, it is suitable for quality monitoring, not for
system planning model. Keeping low computational com-
plexity, which is very important to in-service monitoring,
is one challenge for this approach. Another challenge is
to estimate the video content and compression informa-
tion at packet layer. Our proposed model follows this
approach and deals with the challenges.
The main contributions of this article are in two
aspects. First, we differentiate the requirements for
packet-layer model from two application scenarios: video
General Codec Information
- codec type
- framerate
- bitrate
- error concealment method
- ....
Packet headers
- RTP header
- TS header
- PES header
Compressed video
bitstream
- quantization parameters
- frame type
- macroblock coding mode
- motion vectors
-
Decoded video signal
- various error
concealment artifacts
- ...
Packet-layer VQA model
Bitstream-level VQA model
Media-layer model
Hybrid VQA model
Figure 1 Scope of the four types of VQA models. The columns are four types of input information to the models.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 2 of 13
system planning and video quality monitoring. We design
the respective criteria and methods to select the pro-
cessed video sequences (PVSs) for subjective evaluation
when setting up the subjective mean opinion score
(MOS) database. This helps us to explain why the above-
mentioned parametric packet-layer models had a high
performance even if the video content feature was not
taken into consideration. Furthermore, we state the argu-
ment that the dataset for evaluating the video quality
monitoring model should be more challenging than that
for video system planning model.
Second, we propose a novel quality monitoring model,
which has low complexity and fully utilizes the video
spatiotemporal complexity estimation at packet layer. In
contrast to the parametric packet-layer models, it takes
into consideration the interaction among video content
features and EC effect and error propagation effect, thus
improves estimate accuracy.
The rest of the article is organized as follows. In Section
2, we review several literatures that motivated this study.
The novelty of this study is then discussed. In Section 3,
two different criteria and methods are used to set up
respective datasets for monitoring and planning scenarios.
In Section 4, the proposed VQA model is described.
Experimental results are discussed in Section 5. Conclu-
sions and future work are discussed in Section 6.
2. Related work
The recent studies [10-13] are somehow related to the
idea of our proposed model. In [10,11], the contributing
factors to the visibility of artifacts caused by lost packet(s)
were studied; video quality metrics based on the visibility
of packet loss were developed in [12,13].
The factors to the visibility of a single packet loss were
studied in [10] for MPEG-2 compressed video. The top
three most important factors were the magnitude of over-
all motion which is the average across all macroblocks
(MBs) initially affected by loss, the type (I, B, or P) of the
frame (FRAMETYPE) in which packet loss occurred, and
the initial MSE (IMSE) of the error-concealed pixels.
Further, the visibility of multiple packet losses in H.264
video was studied in [11]. Again, the IMSE and the FRA-
METYPE are identified as the most important factors to
the visibility of losses. Besides, it was shown that the IMSE
is very different because of the different concealment stra-
tegies [11]. It can be seen that the accurate detection of
the initial visible artifacts (IVA) and the error propagation
effects are two important aspects to be considered in a
packet-layer VQA model. Furthermore, the different EC
effects should be considered when estimating the annoy-
ance level of IVA.
Yamada et al. [12] developed a no-reference hybrid
videoqualitymetricbasedonthecountoftheMBsfor
which the EC algorithm of a decoder is identified as
ineffective. Classifying lost MBs based on the error-
concealment effectiveness can be essentially regarded as
an operation to classify the visibility of the artifacts caused
by packet loss(s). Suresh [13] reported that the simple
metric of mean time between visible artifacts has an aver-
age correlation of 0.94 with subjective video quality.
There are two major novel points in our proposed
model. First, the IVA of a frame suffering from packet loss
and EC is estimated based on the EC effectiveness. Unlike
[12], the EC effectiveness is determined based on the
spatiotemporal complexity estimation with packet-layer
information; and the different EC effects are considered.
Second, the IVA is incorporated into an error propagation
model to predict the overall video quality. The estimate of
spatiotemporal complexity is employed to modulate the
propagation of the IVA in the error propagation model.
The performance gain resulting from the spatiotemporal
complexity-based IVA assessment and from using the
error propagation model is analyzed in the experiment
section.
3. subjective dataset and analysis
As described above, the packet-layer video QoE assess-
ment model has two typical application scenarios, video
system planning and in-service video quality monitoring,
each of which has different requirements. The video
system planning model is for network QoS parameter
planning and video coding parameter planning, given a
target video quality. It predicts average perceptual qual-
ity degradation, ignoring the impact of different distor-
tion and content types on the perceived quality.
Therefore, it should predict well the quality of the loss-
affected sequences with large occurrence probability.
Whereas, the VQA model for monitoring purpose is
expected to give quality degradation alarm with high
accuracy and should be able to estimate as accurate as
possible the quality of each specific video sequence dis-
torted by packet losses. Correspondingly, the respective
subjective dataset for training and evaluating the plan-
ning model and the monitoring model should be built
differently. Further analysis of the PVSs in Sections 3.3
and 3.4 illustrates that the different EC effects and the
different error propagation effects are two of the most
important factors to the perceptual quality of packet-
loss distorted videos.
There are mutual influences between the perception
of coding artifacts and that of transmission artifacts
especially at low coding bitrate [14]. In our subjective
database, visible coding artifact is not considered by set-
ting the quantization parameter (QP) to a certain smal-
ler value. Only the video quality degradation cause by
transmission impairments is discussed in this article.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 3 of 13
3.1 Subjective test
Video QoE is both application-oriented and user-
oriented assessments [15]. Viewers individual interests,
quality expectation, and service experience are among
the contributing factors to the perceived quality. To
compensate the subjective variance of these factors,
usually MOS averaged over a number of viewers (called
subjects hereafter) is used as the quality indication of a
video sequence. Moreover, to minimize the variance of
subjectsopinion caused by these factors, subjective test
should be conducted under well-controlled environ-
ment; subjects should be well instructed about the task
and video application scenario, which influences the
subjectsexpectation to video quality.
The absolute category rating with hidden reference
method specified in ITU-P.910 [16] is adopted in our
experiment. It is a single stimulus method where a pro-
cessed video is present alone. The five scales shown in
Figure 2 are used for evaluating the video quality.
Observers are instructed to focus on watching video
program instead of scrutinizing visual artifacts. Before
the subjective test, observers are required to watch 20
training sequences that evenly cover the five scales, and
to write down their understanding of the verbal scales
in their own words. Interestingly, the most of the
description of the five scales are heavily related to video
content, not merely related to the amount of noticeable
artifacts as described in [17]. The descriptions can be
summarized as follows:
- Imperceptible:no artifact (or problematic area) can
be perceived during the whole video display period.
-Perceptible but not annoying:artifact can be per-
ceived occasionally, but it does not influence the inter-
ested content, or it appears in the background for an
instant moment.
- Slightly annoying:the noticeable artifact appearing
in the region of interest (ROI) is identified, or noticeable
artifacts are detected for several instant moments even if
they do not appear in the ROI.
- Annoying:noticeable artifact appears in ROI for sev-
eral times or many noticeable artifacts are detected and
last for a long time.
-Veryannoying:video content cannot be understood
well due to artifacts and the artifacts spread all over the
sequence.
Twenty-five non-expert observers are asked to rate the
quality of the selected 177 PVSs of 10 s. The scores given
by these subjects are processed to discard subjects who
are suspected to have voted randomly. Then for each
PVS, a subjective MOS and a 95% confidence interval
(CI) are computed using the scores of the valid subjects.
As shown in Figure 3, for PVSs of middle quality, the
subjectivity variation is higher; for sequences of very
good or very bad quality, the subjects tend to reach a
more consistent opinion with high probability. This
observation is similar to the previous report in [14]. Since
the subjective MOS itself has statistical uncertainty
because of the abovementioned subjective factors, it is
reasonable to allow certain prediction error (e.g., less
than CI
95
) when evaluating the prediction accuracy of an
objective model. Therefore, the modified RMSE [18]
described later in Equation 8 is used in our experiment.
3.2 Select PVSs for dataset
Six CIF format video contents, which cover a wide range
of spatial complexity (SC) index and temporal complexity
(TC) index [19], are used as original sequences, namely
Foreman, Hall, Mobile, Mother, News,andParis.Thesix
sequences are encoded using H.264 encoder with two
-
5: imperceptible
4: perceptible but not annoyin
g
3: slight annoying
2: annoying
1: very annoying
Figure 2 Five point impairment scales of perceptual video
quality.
Figure 3 Standard deviations of MOSs; each point corresponds
to the standard deviation of the MOS of a PVS.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 4 of 13
sequence structures, namely, IBBPBB and IPPP. Group of
picture (GOP) size is 15 frames. A proper fixed QP is
used to prevent the compressed video from visible coding
artifacts. Each row of MBs is encoded as an individual
slice, and one slice is encapsulated into an RTP packet.
To simulate transmission error, the loss patterns gener-
ated at five PLRs (0.1, 0.4, 1, 3, and 5%) in [17] are used.
For each nominal PLR, 30 channel realizations are gener-
ated by starting to read the error pattern file at a random
point. Thus, for each original sequence, there are 150
realizations of packet loss corrupted sequences. Before
subjective evaluation test, we must choose some typical
PVSs from the large numbers of realizations.
Owing to the different requirements of planning and
monitoring scenarios, we choose the PVSs for subjective
test according to different criteria:
1. For each video content, select the PVSs that are
representatives of the dominant MOS-PLR distribu-
tion as done in [17];
2. For each video content, select the PVSs that cover
the MOS-PLR distribution widely by including the
PVSs of the best and the poorest quality at a given
PLR level, in addition to those representing the
dominant MOS-PLR distribution.
Actually, when we select the PVSs for the subjective test,
the subjective MOSs of the abovementioned 150
sequences is not available before subjective test. The
objective measurement PSNR is used as substitute of
MOS in the initial selection of PVSs; then the PVSs
selected in the initial round are watched and adjusted if
necessary to make sure that the subjective qualities of the
selected PVSs satisfy the above criteria. The PVSs chosen
by criteria-1 and criteria-2 are collectively named as data-
set-1 and dataset-2, respectively. Figure 4 shows the PLR-
MOS distribution and PSNR-MOS distribution of dataset-
1 and dataset-2. The PLR here is calculated as the ratio of
actually lost packets to the total transmitted packets for a
PVS. It can be seen that the PVSs in dataset-2 present
much more diverse relationship between PLR and subjec-
tive video quality than those in dataset-1. Because the
scales of annoyingand very annoyingare equally unac-
ceptable in real-world applications, we selected sequences
mostly of the MOSs ranging from 2 to 5, as shown in
Figure 4a,b. It is noted that, in subjective test, one
sequence with score one point for each video content is
included in each test session to balance the range of rating
scales, although they are not included in the datasets as
drawn in Figure 4.
In Figure 4c, the PLR-PSNR distribution for all the six
video contents spreads away from each other, whereas
in Figure 4a the PLR-MOS distributions for the mostly
video contents are mixed together. This phenomenon
partially illustrates that the PSNR is not a good objective
measurement of video quality because it fails to take
into consideration the impact of video content feature
on human perception of video quality.
Figure 4b shows that PVSs present very different per-
ceptual qualities in dataset-2 even under the same PLR.
Taking the PLR of 0.86% for an example, the MOSs
vary from Grade 2 to Grade 4. PLR treats all lost data
as equal important to perceived quality, ignoring the
content and compressions influence on perceived qual-
ity. It may be an effective feature on dataset-1 as shown
in Figure 4a, but is not an effective feature on dataset-2
for quality monitoring applications.
Unlike [6,8,9], our proposed objective model targets at
video quality monitoring application. The objective
model for monitoring purpose should be able to estimate
as accurately as possible the video quality of each specific
sequence distorted by packet loss. Correspondingly, the
dataset for evaluating the model performance should be
more challenging than that for planning model, i.e., the
proposed model should work well not only on dataset-1
but also on dataset-2.
3.3 Impact of EC
Both the duration and the annoyance level of the visible
artifacts contribute to the perceived video quality degrada-
tion. The annoyance level of artifacts produced by packet
loss depends heavily on the EC scheme of a decoder. The
goal of EC is to estimate the missing MBs in a compressed
video bitstream with packet losses, in order to provide a
minimum degree of perceptual quality degradation. EC
methods that have been developed roughly fall into two
categories: spatial EC approach and temporal EC
approach. In the spatial EC class, spatial correlation
between local pixels is exploited; missing MBs are recov-
ered by interpolation from neighbor pixels. In the tem-
poral EC class, both the coherence of motion field and the
spatial smoothness of pixels along edges cross block
boundary are exploited to estimate motion vector (MV) of
a lost MB. In H.264 JM reference decoder, spatial
approach is applied to conceal lost MBs of Intra-coded
frame (I-frame) using bilinear interpolation technique;
temporal approach is applied to conceal lost MBs for
inter-predicted frame (P-frame, B-frame) by estimating
MV of the lost MB based on the neighbor MBsMVs.
Minimum boundary discontinuity criterion is used to
select the best MV estimate.
Visible artifacts produced by spatial EC scheme and by
temporal EC scheme are very different. In general, spatial
EC approach produces blurred estimates of the lost MB
as shown in Figure 5a, while the temporal EC approach
produces edge artifacts as shown in Figure 5b, if the
guessed MV is not accurate. The effectiveness of spatial
EC scheme is significantly affected by SC of the frame
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 5 of 13
with loss, while that of the temporal EC scheme is signifi-
cantly affected by motion complexity around the lost
area. In Figure 5c, although the fourth row of MBs is lost,
almost no visual quality degradation can be perceived
because of the stationary nature of the lost content.
Whereas, in Figure 5e, slightly noticeable artifacts appear
at the area near the mothers hand, because of inconsis-
tent motion of the lost MBs and its neighbor MBs. In
Figure 5d, the second row of MBs is lost, but resulting in
hardly noticeable artifacts. This is because the lost con-
tent is of smooth texture.
3.4 Impact of error propagation
The duration of visible artifact depends on the error
propagation effects resulting from the inter-frame pre-
diction technique used in video compression. For the
same encoder configuration and channel conditions,
Figure 6 shows that the error propagation effects vary
significantly depending on different video contents, in
particular, on the SC and the TC of the video content.
For example, the 93th frame, in which four packets are
lost, is a P-frame. Because the head moves largely in the
ensuing frames of sequence foreman, the error in the
P-frame is propagated up to the 120th frame, which cor-
responds to about 1 s. Even if there is a correctly
received I frame at the 105th frame, the error is still
propagated to the 120th frame because of large motion,
two reference frames, and open GOP structure. In con-
trast, for sequence hall and mother having small motion,
propagated artifacts are almost invisible.
In general, an I-frame packet loss results in artifact
duration of GOP length, or even longer if open GOP
structure is used in compression configuration. The
more intra-coded MBs exist in inter-coded frames, the
more easily the video quality recovers from error, and
the shorter the artifact duration is. In general, the
(a) PLR-MOS of Dataset-1 selected by criteria 1 (b) PLR-MOS of Dataset-2 selected by criteria 2
(c) PLR-PSNR on Dataset-1 selected b
y
criteria 1 (d) PLR-PSNR on Dataset-2 selected b
y
criteria 2
Figure 4 The processed sequences selected by criteria-1 and criteria-2.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 6 of 13
artifact duration caused by P-frame packet loss is less
than that by I-frame packet loss. However, the impact of
a P-frame packet loss can be significant, if large motion
exists in the packet and/or the packets temporally adja-
cent to it. The artifacts caused by a B-packet loss, if
noticeable, look like an instant glitch, because there is
no error propagation from B-frame and the artifacts last
merely for 1/30 s. When the motion in a lost B slice is
low, there are no visible artifacts at all.
4. VQA model with spatiotemporal complexity
estimation
Both the effects of EC and the effects of error propaga-
tion have close relationship with the spatiotemporal
complexity of the lost packets and its spatiotemporally
adjacent packets. To improve prediction accuracy of
packet-layer VQA model in the quality monitoring case,
influence from video content property, EC strategy, and
error propagation should be taken into consideration as
much as possible. The proposed objective quality assess-
ment model is based on the video spatiotemporal com-
plexity estimation.
4.1 Spatiotemporal complexity estimation
For a video frame indexed as i, the parameter set π
i
including frame size s
i
, number of total packets N
i,total
,
number of lost packets N
i,lost
, and the location of lost
packet in the frame is calculated or recorded. The location
of lost packets in a video frame is detected with the assis-
tance of the sequence number field of RTP header. To
identify different frames, the timestamp in RTP header is
used. The frame size includes both lost packet size and
received packet size. For a lost I-frame packet, its size is
estimated as the average of the two spatially adjacent I-
frame packets that are correctly received or equal to the
(a)
(b)
(c)
(d)
(e)
Figure 5 Illustration of EC effectiveness.(a) Artifacts produced by spatial EC technique; (b) artifacts produced by temporal EC technique in
area with camera pan; (c) no visible artifacts due to the stationary nature of the lost MBs; (d) very slightly noticeable artifacts produced by
spatial EC technique in area with smooth texture; (e) noticeable artifacts only in small area produced by temporal EC technique.
Figure 6 MSE per frame for different video sequences under
the same test condition.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 7 of 13
size of the spatially adjacent I-frame packet if there is only
one spatially adjacent I-frame packet correctly received.
For a lost P-frame packet, its size is estimated as the aver-
age size of the two temporally adjacent collocated P-frame
packets that are correctly received. Similar method is used
for size estimate of lost B-frame packet.
TheSCandtheTCofasliceencapsulatedina
packet, which can be roughly reflected by the packet
size variation, are estimated using an adaptive threshold-
ing method as shown in Figure 7. In general, I-frame
size is much larger than P-frame size, and P-frame size
larger than B-frame size. However, when the texture in
anI-frameisverysmooth,thesizeoftheI-frameis
small, which depends on QP used. In the extreme case
that the objects in a P-frame are almost stationary, the
size of the P-frame can be as small as that of a B-frame;
in another extreme case where the objects in a P- or B-
frame is rich of texture and diverse motion, the size of
theP-orB-framecanbeaslargeasthatofaI-frame.
In our database, each row of MBs is encoded as a slice;
therefore, each detected lost slice is classified with a SC
or TC level using adaptive threshold.
For P- or B-slice, if the slice size is larger than a
threshold Thrd
r
, then the slice is classified as high-TC
slice; otherwise, if the slice size is larger than a threshold
Thrd
p
, then the slice is classified as medium-TC slice;
otherwise, the slice is classified as low-TC slice. The two
thresholds are adapted from the empirical equations
[20] below. The variable av_nbytes is the average frame
size in a sliding window. The variant max_iframe is the
maximum I-frame size, and nslices is the number of
slices per frame.
Thrd
I
=
(max iframe ×0.995/4 + av nbytes ×2)/2
/nslice
s
(1)
ThrdP=
av nbytes ×3/4
/nslice
s
(2)
For a I slice, if its size is smaller than thrd
smooth
, then
the slice is classified as smooth-SC slice; otherwise, as
edged-SC slice. The thrd
smooth
is a function of coding
bitrate. In our experiment, thrd
smooth
is set to 200 bytes
for CIF format sequences coded with H.264 encoder
and QP equal to 28.
4.2 Objective assessment model
The building block diagram of the proposed model is
shown in Figure 8. The packet information analysis
block uses the RTP/UDP header information to get a
set of parameters π
i
for each frame. These parameters
and the encoder configuration information are used by
visible artifacts detection module to calculate the level
of visible artifacts (LoVA) for each frame. The encoder
configuration information includes GOP structure,
number of reference frames, error resilience tools like
slicing mode, and intra refresh ratio. For a sequence
of tseconds, we calculate the mean LoVA (MLoVA)
and map the MLoVA to an objective MOS value
according to a second-order polynomial function,
which is trained using least square fitting technique.
The results in [13] showed that the simple metric of
mean time between visible artifacts has an average
correlation of 0.94 with MOS. Thus, the simple aver-
aging method is used as the temporal pooling strategy
in our model.
For the ith frame, the LoVA is modeled as the sum of
the IVA
caused by the loss of the packets of the cur-
rent frame and the propagated visible artifacts (PVA)
VP
i
due to error propagation from the reference frame, as
shown in Equation 3. It is assumed here that the visible
artifacts caused by current-frame packet loss and by the
reference-frame packet loss are independent.
Vi=V0
i
+V
P
i
(3)
Figure 7 Illustration of the frame-by-frame slice complexity classification based on the adaptive thresholds. The 14th slice of foreman
bitstream coded with IPPP GOP structure.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 8 of 13
The IVA V
is calculated by
V
0
i=Ni,lost
j=1 wlocation
i,j×wEC
i,j
Ni
,
total
(4)
Depending on the location of the lost packets in one
frame, different weight wlocation
i,
j
is assigned to the lost
packet (i.e., lost slice because one coded slice is encap-
sulated in one RTP packet in our dataset). The location
weight allows us differentiating the slice with attention
focus from others. In experiments, we found that the
contribution of location weight to performance gain is
small as compared to EC and EP weights. Thus, simply
set location weight to 1. wEC
i,
j
is the EC weight which
reflects the effectiveness of EC technique. As discussed
in Section 3.3, the visible artifacts produced by temporal
EC approach and spatial EC approach are quite differ-
ent, correspondingly present different level of annoy-
ance. The blurring artifacts of spatial EC are visibly
more annoying than the edged artifacts of temporal EC
generally. Further, the EC effectiveness depends on the
SC and the TC of the lost slices. For the lost I-slice hav-
ing smooth texture, the loss can be concealed well with
little visible artifacts by the bilinear interpolation-based
spatial EC technique. For the lost P- or B-slice having
zero MV or same MV as its adjacent slices, it can be
recovered well with little noticeable artifacts by the tem-
poral EC technique. It is reported in [10] that, when
IVA is above the medium, increasing the distance
between the current frame with packet loss and the
reference fame used for concealment increases the visi-
bility of packet loss impairment. Therefore, we applied
different weights for P-slices of IBBP GOP structure and
those of IPPP GOP structure. In summary, the weight
wE
C
i,
j
is set according to EC method used and spatial-TC
classification as in Table 1. As shown in Figure 5a,b, the
perceptual annoyance of the artifacts produced by spa-
tial EC method and temporal EC method is almost at
the same level, so we applied the same weight for lost
slices of edged-SC type and those of H-TC type. In
experiment, the values a
1
to a
5
are set empirically to
0.01, 1, 0.01, 0.1, and 0.3, in order to reflect the relative
annoyance of the respective typical artifacts on the arti-
facts scale ranging from 0 to 1.
The PVA is zero for I frame, because I frame is coded
with intra-frame prediction only. For the inter-frame
predicted P/B frames, the PVA
VP
i
is calculated as
VP
i=Ni,total
j=1 Eprop
i,j×wEP
i
Ni
,
total
(5)
Epro
p
i
denotes the amount of visible artifacts of refer-
ence frames. Its value depends on the encoder config-
uration information, i.e., GOP structure and the number
of reference frames. Taking IPPP structure and two
reference frames for an example, the Epro
p
i
is calculated
as
E
prop
i,
j
=(1b)×Vi1,j+b×Vi2,
j
(6)
where bis weight for the propagated error from
respective reference frames. For our datasets, b=0.75
for P frames, and b= 0.5 for B frames.
Weight wE
P
i
modulates the propagation effects of refer-
ence framesartifacts to current frame. The reference
framesartifacts may attenuate because of error resilience
tool like Intra MB Refresh or more prediction residual
left in the ensuring frames. No matter more Intra-MBs
are used or more prediction residual information remains
in the compressed bitstream of current slice, the bytes of
current slice will be larger than the slice that have fewer
Intra-MBs and easy-to-predict content. Therefore, the
value of w
EP
is set according to the spatiotemporal com-
plexity of the frame as in Table 2. In experiment, b
1
is set
to 1 which means no artifacts attenuation, and b
2
is set
to 0.5, which means visible artifacts attenuates by half.
Finally, clip the value of V
i
to [0,1]. Record the value
of the LoVA of the frame in a frame queue, and put the
frame in the queue according to its displaying order.
Visible artifacts
detection for each
video frame
Objective vide
o
quality value
Packet information
analysis
pac
k
et
l
ayer
information Parameter set
per frame
encoder confi
g
uration information
Calculate Mean
LoVA
mapping MLoVA
to a MOS value
Figure 8 Building block diagram of the proposed model.
Table 1 The value of wEC
i,
j
depending on EC method and SC/TC classification
Spatial EC method Temporal EC method
Smooth-SC Edged-SC L-TC M-TC & IPPP structure M-TC & IBBP structure H-TC
a
1
a
2
a
3
a
4
a
5
a
2
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 9 of 13
When time interval of tseconds is reached, the algo-
rithm will calculate the mean LoVA by
MLoVA =
1
MM
i=1 Vi

f
r
(7)
where Mis the total number of frames in tseconds; f
x
is the frame rate of a video sequence.
5. Experimental results
First, we compare the correlation between the subjective
MOS and some affecting parameters that are used in
the existing packet-layer models. These parameters
include PLR [6], burst loss frequency (BLF) [8], and
invalid frame ratio (IFR) [21]. In the existing work, these
parameters and other video coding parameters like cod-
ing bitrate, frame rate, are modeled together. In order to
fairly compare the performance of the above parameters
that reflect transmission impairment, the coding artifacts
are prevented by properly setting QP in our datasets.
Two metrics, Pearson correlation and the modified
RMSE, shown in Equation 8 are used to evaluate perfor-
mance. In the ITU-T test plan draft [18], it is recom-
mended to take the modified RMSE as primary metric
and Pearson correlation as informative. The scope of
modified RMSE is to remove from the evaluation the
possible impact of the subjective scoresuncertainty.
The modified RMSE is described as:
Perror(i) = max(0,
MOS(i)MOSp(i)
CI95(i)
)
(8)
The final modified RMSE* is calculated as usual, but
based on P
error
with the equation below.
rmse=
1
Nd
N
i=1
(Perror(i))2(9)
where the index idenotes the video sample; Ndenotes
the number of samples; and dthe number of freedoms.
Thedegreeoffreedomdis set to 1 because we did not
apply any fitting method to the predicted MOS score
before comparing it with the subjective MOS.
When evaluating the performance of the features on
dataset-1 or dataset-2, the dataset is partitioned into the
training sub-dataset and the validation sub-dataset in
50% versus 50% proportion to perform the cross-evalua-
tion process. The Pearson correlation and the modified
RMSE in Tables 3 and 4 are the average performance
over 100 runs of the cross-evaluation process.
The results using least square curve fitting are shown
in Table 3. From Figure 9, it can be seen that the corre-
lation between the subjective MOSs and the PLR/BLF/
IFR reaches up to 0.94 on dataset-1, but is only 0.75 on
dataset-2. This shows that the features PLR/BLF/IFR are
effective for video system planning modeling, but are
not effective for quality monitoring model.
It can be seen in Figure 9 that our model proposed a
better metric, MLoVA, which is more consistent with
subjective MOS. When we use second-order polynomial
function to fit the curve, the correlation and RMSE pair
of predicted MOS versus subjective MOS is (0.96, 0.12)
and (0.93, 0.17) on dataset-1 and dataset-2, respectively,
Figure 10 shows the predicted MOS as compared with
the subjective MOS. This demonstrates that the pro-
posed model has robust performance on both datasets.
Second, the contributions of two factors, namely EC
effectiveness and EP model, are quantified on dataset-2. If
we set the weights for EC effectiveness to one in Equation
4 and ignore the second item of propagated artifacts by
setting it to zero in Equation 3, then the MLoVA regresses
to PLR, where the data losses are regarded as equally
important to perceptual quality. As described in Section 3,
the EC strategy employed at decoder can hide the visible
artifacts caused by packet loss to a degree that depends on
the spatiotemporal complexity of the lost content. When
the complexity estimation-based EC weights are applied to
calculate IVA and still ignore the item of propagated
error, it is shown in Figure 10b that the correlation of
mean IVA (MIVA) with subjective MOS is 0.86, and the
modified RMSE is reduced to 0.27. The performance is
significantly improved as compared with PLR. Further, the
improvement brought by incorporating the error propaga-
tion model of Equation 5 was evaluated. As we know,
Table 2 The value of wE
P
i
depending on TC classification
L-TC & M-TC H-TC
b
1
b
2
Table 3 The correlation and modified RMSE between
different artifact features and subjective MOS
Feature RMSE* Pearson correlation
Dataset-1 Dataset-2 Dataset-1 Dataset-2
PLR 0.1636 0.4094 0.9397 0.7544
BLF 0.1622 0.4082 0.9409 0.7558
IFR 0.2456 0.4185 0.8973 0.7388
MLoVA 0.1158 0.1932 0.9591 0.9174
Table 4 Quantitative analysis of the contribution from EC
effectiveness estimation and EP model
Feature RMSE* Pearson correlation
Dataset-1 Dataset-2 Dataset-1 Dataset-2
PLR 0.1647 0.4095 0.9396 0.7511
MIVA 0.1559 0.2897 0.9408 0.8504
MLoVA
0
0.1478 0.2375 0.9490 0.8929
MLoVA 0.1400 0.1909 0.9516 0.9185
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 10 of 13
inter-frame prediction is used in video compression, as a
result, the influence of an I-packet loss, or a P-packet loss
appearing early in a GOP, is quite different from that of a
B-packet loss or a P-packet loss appearing later in a GOP.
By setting b
1
=b
2
=1,wedidnotconsidertheerror
attenuation effects during propagation, and denoted the
corresponding result of Equation 7 as MLOVA
0
. It can be
seen that introducing the EP model and the complexity
estimation-based EP attenuation weight can further
improve the prediction accuracy on dataet-2.
Figure 9 Performance evaluation compared with existing metrics in dataset-1 and dataset-2.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 11 of 13
6. Conclusion and future work
In this study, the different requirements of two applica-
tion scenarios of a parametric packet-layer model are
discussed. We provide the insight that different criteria
and methods should be used to select processed
sequences for subjective evaluation when setting up the
evaluation dataset. It is shown that the parameters PLR/
BLF/IFR used in existing models are effective for video
system planning modeling, but are not effective for
video quality monitoring applications.
Further, a model is proposed for video monitoring
scenario, taking into consideration the interaction
between video content features and EC effects and
error propagation effects. It achieves much better per-
formance on both types of datasets for planning and
monitoring applications. The result also shows that,
for the encoding configuration given in this article, the
packet-layer model taking packet header information
and encoder configuration information as inputs is
able to estimate video quality with enough accuracy
for practical use. However, there are many error-resili-
ence tools (e.g., flexible MB order) in H.264 to combat
the video quality degradation in case of transmission
losses and different EC strategies that may be
employed at a decoder. A packet-layer model must be
tailored to the specific video application configuration.
For future study, the influence of the distribution of
visible packet losses on the overall perceived quality
will be studied.
Competing interests
The authors declare that they have no competing interests.
Received: 1 November 2010 Accepted: 22 August 2011
Published: 22 August 2011
References
1. Takahashi A, Hands D, Barriac V: Standardization activities in the ITU for a
QoE assessment of IPTV. IEEE Commun Mag 2008, 46(2):78-84.
2. ITU-T document, Draft terms of reference (ToR) for P.NAMS. 2009
[http://www.itu.int/md/meetingdoc.asp?lang=en&parent=T09-SG12-091103-
TD-GEN-0146].
3. ITU-T document, Draft Terms of Reference (ToR) for P.NBAMS. 2009
[http://www.itu.int/md/T09-SG12-110118-TD-GEN-0521].
4. Rui H, Li C, Qiu S: Evaluation of packet loss impairment on streaming
video. J Zhejiang Univ Sci 2006, A7:131-136.
5. Reibman AR, Vaishampayan VA, Sermadevi Y: Quality monitoring of video
over a packet network. IEEE Trans Multimedia 2004, 6(2):327-334.
6. Yamagishi K, Hayashi T: Video-quality planning model for videophone
services. Inf Media Technol 4(1):1-9.
7. Mohamed S, Rubino G: A study of real-time packet video quality using
random neural networks. IEEE Trans Circ Syst Video Technol 2002,
12(12):1071-1083.
8. Yamagishi K, Hayashi T: Parametric packet-layer model for monitoring
video quality of IPTV services. IEEE International Conference on
Communications 2008, 110-114.
9. Raake A, Garcia M-N, Moller S, Berger J, Kling F, List P, Johann J,
Heidemann C: T-V-model: parameter-based prediction of IPTV quality.
Proc ICASSP 2008, 1149-1152.
10. Kanumuri S, Cosman PC, Reibman AR, Vaishampayan VA: Modeling packet
loss visibility in MPEG-2 video. IEEE Trans Multimedia 2006, 8(2):341-355.
11. Reibman AR, Poole D: Predicting packet-loss visibility using scene
characteristics. Proceedings of the International Workshop in Packet Video
2007, 308-317.
12. Yamada T, Miyamoto Y, Serizawa M: No-reference video quality estimation
based on error-concealment effectiveness. IEEE Packet Video Workshop
2007, 288-293.
13. Suresh N: Mean time between visible artifacts in visual communications.
PhD thesis, Georgia Institute of Technology 2007.
14. Winkler S, Dufaux F: Video quality evaluation for mobile applications. Proc
VCIP 2003, 593-603.
15. Winkler S, Mohandas P: The evolution of video quality measurement:
from PSNR to hybrid metrics. IEEE Trans Broadcast 2008, 54(3):660-668.
16. ITU-T Rec. P.910, Subjective video quality assessment methods for
multimedia applications. Geneva; 2008.
17. Simone FD, Naccari M, Tagliasacchi M, Dufaux F, Tubaro S, Ebrahimi T:
Subjective assessment of H.264/AVC video sequences transmitted over a
noisy channel. Proc International Workshop on Quality of Multimedia
Experience (QoMEx) 2009, 204-209 [http://mmspl.epfl.ch/].
18. ITU-T document, Qualification test plan for P.NAMS. [http://www.itu.int/
md/meetingdoc.asp?lang=en&parent=T09-SG12-091103-TD-GEN-0150],
accessed on October 2009.
(
a
)
p
erformance on dataset-1
(
b
)
p
erformance on dataset-2
Figure 10 Subjective MOS versus predicted MOS by proposed model in different dataset.
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 12 of 13
19. ITU-R Rec. BT.500-10, Methodology for the subjective assessment of the
quality of the television pictures. 2000.
20. Clark A: Method and system for viewer quality estimation of packet
video streams. 2009 [http://www.freepatentsonline.com/y2009/0041114.
html], U.S. Patent 2009/0041114A1.
21. Hayashi T, Masuda M, Tominaga T, Yamagishi K: Non-intrusive QoS
monitoring method for realtime telecommunication services. NTT Tech
Rev 2006, 4(4):35-40.
doi:10.1186/1687-5281-2011-5
Cite this article as: Liao and Chen: A packet-layer video quality
assessment model with spatiotemporal complexity estimation. EURASIP
Journal on Image and Video Processing 2011 2011:5.
Submit your manuscript to a
journal and benefi t from:
7 Convenient online submission
7 Rigorous peer review
7 Immediate publication on acceptance
7 Open access: articles freely available online
7 High visibility within the fi eld
7 Retaining the copyright to your article
Submit your next manuscript at 7 springeropen.com
Liao and Chen EURASIP Journal on Image and Video Processing 2011, 2011:5
http://jivp.eurasipjournals.com/content/2011/1/5
Page 13 of 13
... In ITU-T Study Group (SG) 12, there is a study [22] on the non-intrusive objective parametric and well-structured QoE assessment models (e.g., G.OMVAS [22], P.NAMS [23] and P.NBAMS [24] as planning, packet-layer and bit stream models, respectively) that can predict the perceptual impact of network impairments on video applications, considering the kind of impairment caused by both transmission and video compression issues [7,25]. The prediction is based on packet header information [26,27] and prior knowledge of the media stream [28]. ...
... The prediction is based on packet header information [26,27] and prior knowledge of the media stream [28]. However, in practice, existing solutions [22][23][24][25][26][27] have not been implemented and validated in wireless multimedia systems, where the mapping of packet/network information into MOS is required. MultiQoE follows the ITU-SG 12 recommendations, defines its specific input video/packet/network parameters, and validates an accuracy parametric video quality estimator solution for multimedia WMNs. ...
... The most popular objective quality inference techniques include PSNR [12], VQM [13], and SSIM [14]. Although attempts to assess coding quality have often focused on estimating the PSNR, the PSNR by itself, does not always correlate well with perceived quality of the HVS [25,28,29]. PSNR can only be computed once the image is received, which is not appropriate for real-time prediction systems [29][30][31]. ...
Article
Full-text available
Wireless Mesh Networks (WMNs) are increasingly deployed to enable thousands of users to share, create, and access live video streaming with different characteristics and content, such as video surveillance and football matches. In this context, there is a need for new mechanisms for assessing the quality level of videos because operators are seeking to control their delivery process and optimize their network resources, while increasing the user’s satisfaction. However, the development of in-service and non-intrusive Quality of Experience assessment schemes for real-time Internet videos with different complexity and motion levels, Group of Picture lengths, and characteristics, remains a significant challenge. To address this issue, this article proposes a non-intrusive parametric real-time video quality estimator, called MultiQoE that correlates wireless networks’ impairments, videos’ characteristics, and users’ perception into a predicted Mean Opinion Score. An instance of MultiQoE was implemented in WMNs and performance evaluation results demonstrate the efficiency and accuracy of MultiQoE in predicting the user’s perception of live video streaming services when compared to subjective, objective, and well-known parametric solutions.
... Este tipo de métodos presentan la desventaja que requieren la señal original y la señal distorsionada para su análisis, esto conlleva a que debe existir una sincronización entre emisor y receptor, lo cual es bastante complejo de lograr si se desea hacer mediciones en tiempo real, y adicionalmente algunos de los algoritmos mencionados arriba son computacionalmente complejos. Por su parte los métodos No Reference se ajustan mejor en entornos de video en tiempo real donde no se posee la señal de referencia, entre los modelos propuestos por la ITU-T se encuentran P.NAMS (Non-intrusive parametric model for the assessment of performance of multimedia streaming) [12] y P.NBAMS (Parametric non-intrusive bit stream assessment of video streaming quality) [13] y en [14] proponen un nuevo modelo donde se estima la complejidad espacio-temporal del contenido del video para calcular la QoE. ...
... Nuestro objetivo es contribuir a la solución del problema, proponiendo un modelo donde se involucren un número determinado de parámetros de QoS y un método No Reference con el fin obtener la medición de la QoE percibida por el usuario para el servicio de IPTV residencial. Este tipo de aproximación tiene varias ventajas, una de ellas es que no se requiere que el video se decodifique completamente para cada video bitstream y requiere menor ancho de banda para su monitoreo [26], otra es que este tipo de modelos se ajustan mejor en entornos en tiempo real, donde usualmente no se tiene la señal de video original como referencia [14]. ...
... CONCLUSIONES Y TRABAJO FUTURO El modelo a proponer utiliza un enfoque diferente a los planteados por los autores consultados, ya que nuestro modelo contribuirá a la solución del problema utilizando un número de parámetros de QoS superior permitiendo al proveedor de servicios obtener una visión más aproximada de QoE percibida por el usuario y sin tener como base pruebas datos de pruebas subjetivas realizadas a los usuarios. Se está en el proceso de definir qué algoritmo No Reference utilizar [12], [13] o [14], para empezar o obtener de los datos para la generación del modelo. Esta parte es la más crítica del proyecto, ya que después de elegir el algoritmo No Reference, este se tiene que integrar en el escenario de simulación en OPNET Modeler con el fin de analizar su comportamiento ante variaciones de los parámetros de QoS. ...
... A avaliação objetiva de qualidade de vídeo digital com métricas sem referência repre- senta um desafio, uma vez que esta deve ser realizada sem o auxílio de qualquer informação oriunda da fonte. Este assunto, embora incipiente na literatura, desperta grande interesse tanto industrial quanto acadêmico, pois essas métricas são adequadas para aplicações do mundo real, no qual os usuários tipicamente assistem a conteúdos audiovisuais transmitidos pela Internet ou por radiodifusão BOVIK, 2006;HEYNDERICKX, 2009;LIU et al., 2010;CHOI;LEE, 2011;KEIMEL et al., 2011aKEIMEL et al., , 2011bLIU et al., 2011;LIAO;CHEN, 2011;DOERMANN, 2012 definido segundo uma equação que incorpora as características espaço-temporais, ponderadas por parâmetros (coeficientes) otimizados durante a fase de treinamento. Além disso, outra abordagem é apresentada com o emprego de uma RNA e o algoritmo de aprendizado ELM. ...
... A avaliação objetiva de qualidade de vídeo digital com métricas sem referência repre- senta um desafio, uma vez que esta deve ser realizada sem o auxílio de qualquer informação oriunda da fonte. Este assunto, embora incipiente na literatura, desperta grande interesse tanto industrial quanto acadêmico, pois essas métricas são adequadas para aplicações do mundo real, no qual os usuários tipicamente assistem a conteúdos audiovisuais transmitidos pela Internet ou por radiodifusão BOVIK, 2006;HEYNDERICKX, 2009;LIU et al., 2010;CHOI;LEE, 2011;KEIMEL et al., 2011aKEIMEL et al., , 2011bLIU et al., 2011;LIAO;CHEN, 2011;DOERMANN, 2012 definido segundo uma equação que incorpora as características espaço-temporais, ponderadas por parâmetros (coeficientes) otimizados durante a fase de treinamento. Além disso, outra abordagem é apresentada com o emprego de uma RNA e o algoritmo de aprendizado ELM. ...
Thesis
Full-text available
O desenvolvimento de métodos sem referência para avaliação de qualidade de vídeo é um assunto incipiente na literatura e desafiador, no sentido de que os resultados obtidos pelo método proposto devem apresentar a melhor correlação possível com a percepção do Sistema Visual Humano. Esta tese apresenta três propostas para avaliação objetiva de qualidade de vídeo sem referência baseadas em características espaço-temporais. A primeira abordagem segue um modelo analítico sigmoidal com solução de mínimos quadrados que usa o método Levenberg-Marquardt e a segunda e terceira abordagens utilizam uma rede neural artificial Single-Hidden Layer Feedforward Neural Network com aprendizado baseado no algoritmo Extreme Learning Machine. Além disso, foi desenvolvida uma versão estendida desse algoritmo que busca os melhores parâmetros da rede neural artificial de forma iterativa, segundo um simples critério de parada, cujo objetivo é aumentar a correlação entre os escores objetivos e subjetivos. Os resultados experimentais, que usam técnicas de validação cruzada, indicam que os escores dos métodos propostos apresentam alta correlação com as escores do Sistema Visual Humano. Logo, eles são adequados para o monitoramento de qualidade de vídeo em sistemas de radiodifusão e em redes IP, bem como podem ser implementados em dispositivos como decodificadores, ultrabooks, tablets, smartphones e em equipamentos Wireless Display (WiDi).
... However, full-reference MSE does not perfectly match with perceptual quality, let alone the no-reference estimated MSE. The extent of visible distortion, as opposed to MSE, caused by packet loss was directly estimated in[19]–[21]. It was shown that the estimation of overall visible distortion can significantly improve the quality estimate accuracy. ...
... It was shown that the estimation of overall visible distortion can significantly improve the quality estimate accuracy. However,[19]and[20]are for packet-layer models, and[21], although for bitstream-layer model, only considered the slicing artifacts and assumed the compression artifacts were almost invisible in IPTV services. Freezing artifact is usually regarded as temporal quality degradation in the literature, and referred to as motion jerkiness and jitter in[22]–[28]. ...
... The decoding of B frames depends on the successful decoding of I and P frames. As noted in [26] , an I-frame packet loss results in a visible artifact duration of GoP length or even longer. The artifact duration caused by P frame packet loss is smaller, although it can be significant in specific cases. ...
Article
The major disadvantage of the Enhanced Distributed Channel Access (EDCA, the contention-based channel access function of 802.11e) is that it is unable to guarantee priority access to higher priority traffic in the presence of significant traffic loads from low priority users. This problem is enhanced by the continuously growing number of multimedia applications and the popularity of Wireless Local Area Networks (WLANs). Hence, solutions in scheduling multimedia traffic transmissions need to take into account both the Quality of Service (QoS) requirements and the Quality of Experience (QoE) associated with each application, especially those of urgent traffic, like telemedicine, which carries critical information regarding the patients’ condition. In this work, we propose an easy-to-implement token-based and self policing-based scheduling scheme combined with a mechanism designed to mitigate congestion. Our approach is shown to guarantee priority access to telemedicine traffic, to satisfy its QoS requirements (delay, packet dropping) and to offer high telemedicine video QoE while preventing bursty video nodes from over-using the medium.
... There are proposed several reference free [10], [11] video quality evaluation methods, but they are not yet standardized and have own shortcomings. ...
Article
In this paper a simple and robust method for estimation of distorted video quality, which is perceived by human observer in mobile video streaming applications, is proposed and assessed. Increasing bandwidth of mobile communication systems expand the variety of offered multimedia services such as video streaming. However, the quality of these services is very dependent on rapidly varying mobile communication conditions. Most widely used video quality estimation methods, such as Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), and Video Quality Metric (VQM) are based on the presence of full or reduced reference video. These methods could be used to assess video quality of video transmission system only during test stage and in the limited number of scenarios. In order to assess user experienced video quality in real conditions, methods with no reference must be employed. Such existing methods as video quality metric use bit-error rate that has low correlation with human perceived video quality. More precise methods usually are too complex and require too much processing power that cannot be tolerated in handheld mobile devices. In this paper it is shown that developed no reference low complexity video quality estimation method based on H.264/AVC video stream packet structure delivers estimate of received video quality comparable with results of subjective MOS tests.
... For instance, slicing degradations are more visible in the case of panning or complex movements than in the case of almost static-content. In the case of an encrypted stream, this content complexity may be captured by the frame sizes and frame types, as proposed in [64,43,42,103,104,101]. In particular, in the case of coding degradations, the I-frame sizes are used [43,42], reflecting the observation that high Iframe sizes indicate low content complexity at low bitrates. ...
Chapter
Full-text available
This chapter addresses QoE in the context of video streaming services. Both reliable and unreliable transport mechanisms are covered. An overview of video quality models is provided for each case, with a focus on standardized models. The degradations typically occurring in video streaming services, and which should be covered by the models, are also described. In addition, the chapter presents the results of various studies conducted to fill the gap between the existing video quality models and the estimation of QoE in the context of video streaming services. These studies include work on audiovisual quality modeling, field testing, and on the user impact. The chapter finishes with a discussion on the open issues related to QoE.
... This type of an approximation has various advantages: the complete decodification of the video for each video bitstream is not necessary; it requires a lower bandwidth for monitoring purposes [10] and these types of models are better suited for real time environments where the original video signal is not available to use as a reference [11]. Our proposal will be the first one to use a No Reference method to measure QoE in IPTV service as other authors [8] propose models using a subjective method to measure the QoE. ...
Conference Paper
Full-text available
This article presents advances in research to obtain a hybrid model which allows the quality of experience within the residential IPTV service to be measured. The contents include a description of background information about objective and hybrid methods, the problem that needs to be resolved and the methodology to be used during the project. Our hybrid model will use quality of service parameters and a NR (No Reference) algorithm to evaluate the quality of video. This approach is not based on the results of subjective tests put forward to users previously. I. INTRODUCTION The current growth in IPTV service leads us to forecast that by 2016 this service will represent 88% of all global Internet traffic [1]. As a result, at some point in the future, this situation could lead to congested networks, degradation in the level of service provided, and therefore, end user dissatisfaction. Based on the situation above, it is very important for service providers to be able to know the quality of experience (QoE) associated with the services offered. This knowledge allows for constant monitoring, thereby avoiding degradation in the level of service and subsequent user dissatisfaction. To measure QoE, the authors in [2] propose the following three methods: subjective methods, objective methods and hybrid methods. These will be discussed below. In subjective methods QoE is measured by means of surveys answered by a group of users. These surveys are generally conducted in a controlled manner following the guidelines proposed in [3][4][5][6][7]. MOS (Mean Opinion Score) is used as the scale of measurement. In objective methods there are two ways to classify the different algorithms or forms of measuring the quality of video. One of them depends on whether it is necessary or not to utilize a reference signal to measure QoE. The other depends on the type of analysis realized for the video stream. For the type of measurement that uses a reference signal, there are three categories: [8] Full Reference (FR): in this method the original video signal that is transmitted, available in the receptor, is compared with the signal that is received to determine the quality of video sent to the user.
Conference Paper
Due to the lightweight measurement and no access to the media signal, the packet-layer video quality assessment model is highly preferable and utilized in the non-intrusive and in-service network applications. In this paper, a novel packet-layer model is proposed to monitor the video quality of Internet protocol television (IPTV). Apart from predicting the coding distortion by the compression, the model highlights a novel loss-related scheme to predict the transmission distortion introduced by the packet loss, based on the structural decomposition of the video sequence, the development of temporal sensitivity function (TSF) simulating human visual perception, and the scalable incorporation of content characteristics. Experimental results demonstrate the performance improvement by comparing with existing models on cross-validation of various databases.
Article
Multimedia transmission over Mobile Ad-hoc Networks (MANETs) is crucial to many applications. However, MANETs possess several challenges including transmitting large size packets, minimizing delay, loss-tolerant and buffer size estimation. For effective multimedia transmission, delay should be minimized and packets should be received in the defined order. The existing standards such as 802.11b and 802.11e perform well in wireless networks, but exhibit poor response in MANETs for multimedia traffic, especially in multi-hop networks. In this paper, we first establish the dependency of delay on buffer size and packet size, and then present a delay optimization approach for multimedia traffic in MANETs. We use Knapsack algorithm for buffer management to maximize the in-order packets and minimize the out-of-order packets simultaneously. Our approach exploits the buffer internals and dynamically adjusts the buffer usage so that a node transmits the packets in the desired order to its successive nodes. Careful estimation of packet size and buffer size helps in minimizing the delay, improving the capability of receiving packets in the correct order and reducing out-of-order packets in the buffer at intermediate nodes. Our approach also controls the loss of multimedia data packets during transmission. We validate our approach with real-world examples using network simulator.
Article
Full-text available
Abstract—We consider the problem of predicting packet loss vis- ibility in MPEG-2 video. We use two modeling approaches: CART and GLM. The former classifies each packet loss as visible or not; the latter predicts the probability that a packet loss is visible. For each modeling approach, we develop three methods, which differ in the amount of information available to them. A reduced reference method has access to limited information based on the video at the encoder’s side and has access to the video at the decoder’s side. A no-reference pixel-based method has access to the video at the de- coder’s side but lacks access to information at the encoder’s side. A no-reference bitstream-based method does not have access to the decoded video either; it has access only to the compressed video bit- stream, potentially affected by packet losses. We design our models using the results of a subjective test based on 1080 packet losses in 72 minutes of video. Index Terms—Packet-loss visibility, perceptual quality metrics,
Conference Paper
Full-text available
In this paper we describe a database containing subjective assessment scores relative to 78 video streams encoded with H.264/AVC and corrupted by simulating the transmission over error-prone network. The data has been collected from 40 subjects at the premises of two academic institutions. Our goal is to provide a balanced and comprehensive database to enable reproducible research results in the field of video quality assessment. In order to support research works on full-reference, reduced-reference and no-reference video quality assessment algorithms, both the uncompressed files and the H.264/AVC bitstreams of each video sequence have been made publicly available for the research community, together with the subjective results of the performed evaluations.
Conference Paper
Full-text available
ABSTRACT This paper presents the results of a quality evaluation of video sequences encoded for and transmitted over a wireless channel We selected content, codecs, bitrates and bit error patterns representative of mobile applica - tions, focusing on the MPEG - 4 and Motion JPEG2000 coding standards We carried out subjective experiments using the Single Stimulus Continuous Quality Evaluation (SSCQE) method on this test material We analyze the subjective data and use them to compare codec performance as well as the e ects of transmission errors on visual quality Finally, we use the subjective ratings to validate the prediction performance of a real - time non - reference quality metric
Article
This article provides the first results of qualification and control tests over BFRP conducted at the material Laboratory of the ENEA Trisaia Research Center, in collaboration with HG GBF, a Chinese company specialized in bsalt fiber manufacturing. HG GBF has provided all the needed material required by ENEA to execute tests; in particular, given the basalt fiber application areas consistent with ENEA policy purposes, fabrics with different weave, nets, basalt fiber rebar from φ8 to φ16, chopped fiber of different length, continuous basalt fiber. Tests are carried out by ENEA primarily to understand and research about the properties of a material - basalt fiber - that in the western world is not widely used and known. In many cases research are in progress and however are still limited to a limited number of applications.
Article
This article describes a non-intrusive QoS (quality of service) monitoring method for realtime telecommunication services. The key point of this method is that the invalid packet ratio and the invalid frame ratio are introduced to represent the factors affecting the speech and video qualities, respectively. Moreover, we can estimate multimodal quality by considering the individual qualities and their interaction. The results of subjective tests showed that the estimation accuracy of these factors was sufficient for practical use. This method enables us to manage QoS on a call-by-call basis for interconnections among multiple network service providers including a variety of terminals and applications.
Article
Videophone services over IP (Internet protocol) will become key services in the next-generation network (NON). To provide a high-quality service for users, designing and managing the quality of experience (QoE) appropriately is extremely important. To do this, developing an objective quality assessment method that estimates subjective quality by using quality parameters of the videophone system is desirable. We propose a parametric-planning model for assessing video quality affected by coding and packet loss. The results indicated quality estimation accuracy was sufficient for practical use because the estimation errors of our model were equivalent to the statistical reliability of subjective assessment. Therefore, our model could be applied to effective design of video quality for videophone applications and networks. As an example of using our model, we show video-quality planning of videophone services.
Article
Video compression technologies are essential in video streaming application because they could save a great amount of network resources. However compressed videos are also extremely sensitive to packet loss which is inevitable in today’s best effort IP network. Therefore we think accurate evaluation of packet loss impairment on compressed video is very important. In this work, we develop an analytic model to describe these impairments without the reference of the original video (NR) and propose an impairment metric based on the model, which takes into account both impairment length and impairment strength. To evaluate an impaired frame or video, we design a detection and evaluation algorithm (DE algorithm) to compute the above metric value. The DE algorithm has low computational complexity and is currently being implemented in the real-time monitoring module of our HDTV over IP system. The impairment metric and DE algorithm could also be used in adaptive system or be used to compare different error concealment strategies.
Conference Paper
The paper presents a parameter-based model for predicting the perceived quality of transmitted video for IPTV applications. The core model we derived can be applied both to service monitoring and network or service planning. In its current form, the model covers H.264 and MPEG-2 coded video (standard and high definition) transmitted over IP-links. The model includes factors like the coding bit-rate, the packet loss percentage and the type of packet loss handling used by the codec. The paper provides an overview of the model, of its integration into a multimedia model predicting audio-visual quality, and of its application to service monitoring. A performance analysis is presented showing a high correlation with the results of different subjective video quality perception tests. An outlook highlights future model extensions.
Article
As digital communication of television content becomes more pervasive, and as networks supporting such communication become increasingly diverse, the long-standing problem of assessing video quality by objective measurements becomes particularly important. Content owners as well as content distributors stand to benefit from rapid objective measurements that correlate well with subjective assessments, and further, do not depend on the availability of the original reference video. This thesis investigates different techniques of subjective and objective video evaluation. Our research recommends a functional quality metric called Mean Time Between Failures (MTBF) where failure refers to video artifacts deemed to be perceptually noticeable, and investigates objective measurements that correlate well with subjective evaluations of MTBF. Work has been done for determining the usefulness of some existing objective metric by noting their correlation with MTBF. The research also includes experimentation with network-induced artifacts, and a study on statistical methods for correlating candidate objective measurements with the subjective metric. The statistical significance and spread properties for the correlations are studied, and a comparison of subjective MTBF with the existing subjective measure of MOS is performed. These results suggest that MTBF has a direct and predictable relationship with MOS, and that they have similar variations across different viewers. The research is particularly concerned with the development of new no-reference objective metrics that are easy to compute in real time, as well as correlate better than current metrics with the intuitively appealing MTBF measure. The approach to obtaining greater subjective relevance has included the study of better spatial-temporal models for noise-masking and test data pooling in video perception. A new objective metric, 'Automatic Video Quality' metric (AVQ) is described and shown to be implemented in real time with a high degree of correlation with actual subjective scores, with the correlation values approaching the correlations of metrics that use full or partial reference. This is metric does not need any reference to the original video, and when used to display MPEG2 streams, calculates and indicates the video quality in terms of MTBF. Certain diagnostics like the amount of compression and network artifacts are also shown. Ph.D. Committee Chair: Jayant, Nikil; Committee Member: Altunbasak, Yucel; Committee Member: Dovrolis, Constantine; Committee Member: Mersereau, Russ; Committee Member: Tannenbaum, Allen
Conference Paper
IPTV services will become key services in the next- generation network (NGN). To provide a high-quality service for users, designing and managing the quality of experience (QoE) appropriately is extremely important. To do this, developing an objective quality-assessment method that estimates subjective quality based on physical characteristics of the IPTV system is desirable. We propose a parametric packet-layer model for monitoring video quality of IPTV services. Our proposed model is useful as a network monitoring tool for assessing several video parameters that affect the quality of IPTV services. For constructing the parametric packet-layer model, we derived a relationship between video quality and quality parameters from a subjective quality assessment. The results indicated that cross- correlation was larger than 0.9, and the evaluation error was smaller than the statistical uncertainty of the value of subjective quality. Therefore, our proposed model could be applied to effective design, implementation, and management of IPTV services.