The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics
Symmetricom, San Jose, CA
Journal Article: IEEE Transactions on Broadcasting (impact factor: 2.24). 10/2008; DOI: 10.1109/TBC.2008.2000733
Abstract
Source: IEEE Xplore
Comments on this publication
ResearchGate members can add comments. Sign up now and post your comment!
Similar publications
Video quality measurement standards — Current status and trends
Authors: S. Winkler
Information, Communications and Signal Processing, 2009. ICICS 2009. 7th International Conference on;
A new quality metric based on just-noticeable difference, perceptual regions, edge extraction and human vision
Authors: Shan Suthaharan, Seong-Whan Kim, K.R. Rao
Electrical and Computer Engineering, Canadian Journal of.
A Structural Similarity Metric for Video Based on Motion Models
Authors: K. Seshadrinathan, A.C. Bovik
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on;
Video quality measures based on the standard spatial observer
Authors: A.B. Watson, J. Malo
Image Processing. 2002. Proceedings. 2002 International Conference on;
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
The Evolution of Video Quality Measurement:
From PSNR to Hybrid Metrics
Stefan Winkler and Praveen Mohandas
Abstract—This paper reviews the evolution of video quality
measurement techniques and their current state of the art. We
start with subjective experiments and then discuss the various
types of objective metrics and their uses. We also introduce V-
Factor, a “hybrid” metric using both transport- and bitstream
information. Finally, we summarize the main standardization
activities, such as the work of the Video Quality Experts Group
(VQEG), and we take a look at emerging trends in quality
measurement, including image preference, visual attention, and
audiovisual quality.
I. INTRODUCTION
QUALITY of experience (QoE) has become a termcommonly used to describe the application- and user-
oriented quality of video and multimedia services. QoE ac-
tually encompasses many different aspects – video quality is
just one of them, arguably one of the most important [1].
Unfortunately, quality in this context is a rather ill-defined
concept – we list just some of the numerous factors contribut-
ing to QoE here [2]–[4]:
• Individual interests of the viewer, such as favorite pro-
grams, which determine the level and focus of attention;
• Quality expectations of the viewer, for example a feature
film screened in a cinema vs. a short clip watched on a
mobile device;
• Video experience of the viewer, which also determines
quality expectations (once you have seen high-definition
content it’s hard to go back);
• Display type (CRT, LCD, etc.) and properties (size,
resolution, brightness, contrast, color, response time);
• Viewing setup and conditions, such as viewing distance
or ambient/exterior light;
• Quality and synchronization of the accompanying audio;
• Interaction with the service or display device (e.g. zap
time, remote control, electronic program guide).
As the wide variety and subjectivity of some of these factors
indicate, the measurement (and ultimately optimization) of the
quality of digital video systems is a highly complex problem.
Most of today’s quality metrics only account for a small
subset of the factors listed above and focus on measuring
the visual fidelity of the video in terms of the distortions
introduced by various processing steps (mainly compression
and transmission). Even if we constrain ourselves to this more
well-defined problem space, two challenging issues remain:
• Video systems are complex and consist of many compo-
nents, including capture and display hardware, converters,
S. Winkler and P. Mohandas are with Symmetricom, San Jose, CA 95131.
e-mail: swinkler@symmetricom.com
Manuscript received November 27, 2007; revised April 9, 2008.
multiplexers, codecs, streamers, routers, switches. All
of them process the video in some way, which can
potentially affect its quality.
• Visual perception is even more complex. If we are to mea-
sure quality in a meaningful way, we need to understand
how people perceive video and its quality.
These two issues and metrics addressing them are also the
focus of this review.
The paper is organized as follows. Section II briefly intro-
duces subjective quality assessment, which forms the bench-
mark for objective metrics. Section III discusses objective
quality metrics, various classifications and some specific im-
plementations. Section IV introduces V-Factor as an example
of a hybrid metric. Section V reviews standardization activities
related to video quality. Section VI takes a look at some recent
trends in quality measurement, and Section VII concludes the
paper.
II. SUBJECTIVE QUALITY ASSESSMENT
The reference for multimedia quality are subjective ex-
periments, which represent the most accurate method for
obtaining quality ratings. In subjective experiments, a number
of “subjects” (typically 15-30) are asked to watch a set of
video clips and rate their quality. The average rating over all
viewers for a given clip is also known as the Mean Opinion
Score (MOS).
Since each individual has different interests and expecta-
tions for video, the subjectivity and variability of the viewer
ratings cannot be completely eliminated. Subjective exper-
iments attempt to minimize these factors through precise
instructions, training and controlled environments. Yet it is
important to remember that a quality score is a noisy mea-
surement that is defined by a statistical distribution rather than
an exact number.
There are a wide variety of subjective testing methods.
Psychophysics provides the tools for measuring the perceptual
performance of subjects [5], beginning with visibility thresh-
olds and just-noticeable differences (JND’s), which are most
suitable for small impairments. The ITU has formalized direct
scaling methods in various recommendations [6]–[8], which
are often used in practice for larger quality ranges. They
suggest standard viewing conditions, criteria for the selection
of observers and test material, assessment procedures, and data
analysis methods. Recommended testing procedures include
implicit comparisons such as Double Stimulus Continuous
Quality Scale (DSCQS), explicit comparisons such as Double
Stimulus Impairment Scale (DSIS), or absolute ratings such
as Single Stimulus Continuous Quality Evaluation (SSCQE)
or Absolute Category Rating (ACR). The procedure used for
a given experiment is generally selected as a function of the
application, the quality range, and the viewer tasks. More
details on subjective testing can be found in [9], for example.
Subjective experiments are invaluable tools for assessing
multimedia quality. Their main shortcoming is the requirement
for a large number of viewers, which limits the amount of
video material that can be rated in a reasonable amount of
time; they are neither intended nor practical for 24/7 in-service
monitoring applications. Nonetheless, subjective experiments
remain the benchmark for any objective quality metric.
III. OBJECTIVE QUALITY METRICS
Objective quality metrics are algorithms designed to charac-
terize the quality of video and predict viewer MOS. Different
types of objective metrics exist [10]. For the analysis of
decoded video, we can distinguish data metrics, which mea-
sure the fidelity of the signal without considering its content,
and picture metrics, which treat the video data as the visual
information that it contains. For compressed video delivery
over packet networks, there are also packet- or bitstream-based
metrics, which look at the packet header information and the
encoded bitstream directly without fully decoding the video.
Furthermore, metrics can be classified into full-reference, no-
reference and reduced-reference metrics based on the amount
of reference information they require. These classifications are
discussed next.
A. Data Metrics
The image and video processing community has long been
using mean squared error (MSE) and peak signal-to-noise ratio
(PSNR) as fidelity metrics (mathematically, PSNR is just a
logarithmic representation of MSE). There are a number of
reasons for the popularity of these two metrics. The formulas
for computing them are as simple to understand and implement
as they are easy and fast to compute. Minimizing MSE is also
very well understood from a mathematical point of view. Over
the years, video researchers have developed a familiarity with
PSNR that allows them to interpret the values immediately.
There is probably no other metric as widely recognized as
PSNR, which is also due to the lack of alternative standards
(cf. Section V).
Despite its popularity, PSNR only has an approximate rela-
tionship with the video quality perceived by human observers,
simply because it is based on a byte-by-byte comparison of the
data without considering what they actually represent. PSNR
is completely ignorant to things as basic as pixels and their
spatial relationship, or things as complex as the interpretation
of images and image differences by the human visual system.
Let’s look at the example shown in Figure 1. Both images
have the same PSNR, yet their perceived quality is very
different – it is hard to see anything wrong with Figure 1(a),
whereas the distortions are quite obvious in Figure 1(b). There
are two main reasons for this discrepancy, both of which are
closely linked to the way the human visual system processes
information:
(a) (b)
Fig. 1. Illustration of the influence of impairment type and image content on
the visibility of distortions (see text for details). Both images have identical
PSNR, yet their perceived quality is very different.
• Data metrics are distortion-agnostic. Distortions may be
more or less apparent to the viewer depending on their
type and properties. The human visual system is not
sensitive to the high-frequency noise inserted into the
left image. The noise in the right image is a well-
localized, lower-frequency noise, whose pattern is much
more apparent.
• Data metrics are content-agnostic. Viewer perception
varies based on the part of the image or video where the
distortion occurs. The noise in the left image is contained
almost exclusively in the bottom region of the image,
where we already have a lot of image activity from the
content itself (edges, texture from the rocks and sea). The
image activity masks the distortion in this region. The
noise in the right image is contained in a region devoid of
content activity (the smooth sky). Because little masking
is present there, distortions stand out immediately.1
Using MSE and various modifications as a basis, a number
of additional data metrics have been proposed and evaluated
[11]. Although some of these metrics can predict subjective
ratings quite successfully for a given compression technique,
distortion type or scene content, they are not reliable for
evaluations across techniques. MSE was found to be an
accurate metric for additive noise, but it is outperformed by
vision-based quality metrics for coding artifacts [12].
The network quality of service (QoS) community has
equally simple metrics to quantify transmission errors, such
as bit error rate (BER) or packet loss rate (PLR). Again,
these are relevant for data links, where every bit and packet is
equally important, but not for video delivery. The reasons for
their popularity are similar to those given for PSNR above.
Problems arise when relating these measures to perceived
quality; they were designed to characterize data fidelity, but
again they do not take into account the content, i.e. the
meaning and thus the visual importance of the packets and
bits concerned. The same number of lost packets can have
drastically different effects on the video content, depending
on which parts of the bitstream are affected.
1 This is not only a spatial phenomenon; masking also occurs with high
temporal activity, such as high-motion scenes or scene cuts.
B. Picture Metrics
Due to the problems with simple data metrics outlined
above, much effort has been spent on designing better visual
quality metrics that specifically account account for the effects
of distortions and content on perceived quality. The approaches
in metric design can be classified in two groups, namely a
vision modeling approach and an engineering approach [13].
The vision modeling approach, as the name implies, is
based on modeling various components of the human visual
system (HVS). HVS-based metrics try to incorporate aspects
of human vision deemed relevant to picture quality, such
as color perception, contrast sensitivity and pattern masking,
using models and data from psychophysical experiments [14].
Due to their generality, these metrics can in principle be used
for a wide variety of video distortions. HVS-based metrics date
back to the 1970’s and 1980’s, when Mannos and Sakrison
[15] and Lukas and Budrikis [16] developed the first image
and video quality metrics. Later well-known metrics in this
category are the Visual Differences Predictor (VDP) by Daly
[17], the Sarnoff JND (just noticeable differences) metric by
Lubin [18], van den Branden Lambrecht’s Moving Picture
Quality Metric (MPQM) [19], and the author’s own perceptual
distortion metric (PDM) [20].
The engineering approach on the other hand is based
primarily on the extraction and analysis of certain features or
artifacts in the video. These can be either structural image
elements such as contours, or specific distortions that are
introduced by a particular video processing step, compression
technology or transmission link, such as block artifacts. The
metrics look at how pronounced these features are in the
video to estimate overall quality. This does not necessarily
mean that such metrics disregard human vision, as they often
consider psychophysical effects as well, but image content and
distortion analysis rather than fundamental vision modeling is
the conceptual basis for their design.
The engineering approach has gained popularity in recent
years. The author’s own metrics [21] look for specific spatial
and temporal artifacts in the video, such as blockiness, blur
or jerkiness, which are then combined into an overall quality
prediction. Wang et al.’s Structural Similarity (SSIM) index
[22] computes the mean, variance and covariance of small
patches inside a frame and combines the measurements into
a distortion map. Motion estimation is used for a weighting
of the SSIM index of each frame in a video. Pinson and
Wolf’s VQM video quality metric [23] divides sequences into
spatio-temporal blocks, and a number of features measuring
the amount and orientation of activity in each of these blocks
are computed from the spatial luminance gradient. The features
extracted from test and reference videos are then compared
using a process similar to masking.
C. Packet- and Bitstream-based Metrics
While a lot of effort in video quality measurement has
been devoted to evaluating compression artifacts from decoded
“base-band” video, there is also a growing interest in quality
metrics specifically designed to measure the impact of network
losses on video quality. This development is the result of in-
creasing video service delivery over IP networks, for example
Internet streaming or IPTV.
Because losses directly affect the encoded bitstream, such
metrics are often based on parameters that can be extracted
from the transport stream and the bitstream with no or little
decoding. This has the added advantage of much lower data
rates and thus lower bandwidth and processing requirements
compared to metrics looking at the fully decoded video. Using
such metrics, it is thus possible to measure the quality of
many video streams/channels in parallel. At the same time,
these metrics have to be adapted to specific codecs and
network protocols. “Hybrid” metrics use a combination of
packet information, bitstream or even decoded video as input.
Figure 2 illustrates the different classes of metrics and their
inputs.
Video signal Bitstream
Picture
Metrics
Packet-based
Metrics
Bitstream-based Metrics
Hybrid Metrics
Packet headers
Fig. 2. Classification of packet-based, bitstream-based, picture and hybrid
metrics (adapted from ITU-T).
Some examples of packet- and bitstream-based metrics are
Verscheure et al. [24], who investigated the joint impact
of packet loss rate and MPEG-2 bitrate on video quality,
or Kanumuri et al. [25], [26], who used various bitstream
parameters such as motion vector length or number of slice
losses to predict the visibility of packet losses in MPEG-2 and
H.264 video. V-Factor, the metric introduced in Section IV
below, also belongs in this category.
D. Reference Information
Quality metrics are generally classified into full-reference,
no-reference and reduced-reference categories based on the
amount of information required about the reference video [13].
Full-reference (FR) metrics perform a frame-by-frame com-
parison between a reference video and the test video. They
require the entire reference video to be available, usually
in unimpaired and uncompressed form, which is quite a
heavy restriction on the practical usability of such metrics.
Furthermore, full-reference metrics generally impose a precise
spatial and temporal alignment of the two videos, so that
every pixel in every frame can be matched with its counterpart
in the other clip. Temporal registration in particular is quite
a strong restriction and can be very difficult to achieve in
practice, because of frame drops, repeats, or variable delay
introduced by the system under test. Aside from the issue of
spatio-temporal alignment, full-reference metrics usually do
not respond well to global shifts in brightness, contrast or
color, and require a corresponding calibration of the videos.
MSE/PSNR and HVS-based metrics typically belong to this
class.
No-reference (NR) metrics analyze only the test video,
without the need for an explicit reference clip. This makes
them much more flexible than FR metrics, as it can be difficult
or impossible to get access to the reference in some cases
(e.g. video coming out of a camera). They are also completely
free from alignment issues. The main challenge of NR metrics
lies in telling apart distortions from content, a distinction
humans are usually able to make from experience. NR metrics
always have to make assumptions about the video content
and/or the distortions of interest. With this comes the risk
of confusing actual content with distortions (as an example,
a chessboard could be interpreted as block artifacts under
certain conditions). The majority of NR metrics are based
on estimating blockiness [27], which is the most prominent
artifact of block-DCT based compression methods such as
H.26x, MPEG and their derivatives.
Reduced-reference (RR) metrics are a compromise between
FR and NR metrics. They extract a number of features from
the reference and/or test video, and the comparison of the two
clips is then based only on those features. Examples of features
are the amount of motion or spatial detail. This approach
makes it possible to avoid some of the assumptions and pitfalls
of pure no-reference metrics while keeping the amount of
reference information manageable. Reduced-reference metrics
also have alignment requirements, but they are typically less
stringent than for full-reference metrics, as only the extracted
features need to be aligned.
These three classes of metrics also have different operational
uses. FR metrics are most suitable for offline video quality
measurement such as codec tuning or lab testing, where
conditions can be well controlled, and where a detailed and
precise analysis of the video is more important than immediate
results. NR and RR metrics are better suited for monitoring
of in-service video systems, where real-time measurement and
alarm triggering are essential. RR metrics still require a back-
channel and access to the reference at some point.
IV. V-FACTOR
We now introduce a real-time video quality metric that uses
the transport stream and the bitstream as input. The method
does not require a reference and works at the packet level.
It combines network impairments with information obtained
from the video stream. The algorithm described here focuses
mainly on MPEG-2 and H.264 video streaming over IP
networks, but it can be adapted to other codecs and other
applications such as video conferencing as well.
A compressed video stream can be viewed as a sequence
of packets that are carrying video and audio information
along with data. De-multiplexing such streams is required in
order to identify the packets that carry video information.
As an example, assessing video packet loss from IP losses
directly will provide inaccurate measurements for an MPEG-
2 transport stream due to the fact that a given IP packet may
not contain any video data.
The V-Factor2 metric is based on deep packet inspection of
the video stream (see Figure 3). It analyzes the bitstream in
real time to collect static parameters such as picture size and
frame rate as well as dynamic parameters such as the variation
of quantization steps. Video quality prediction by the metric
is based on the following:
• The impact of video impairments due to the content char-
acteristics, the compression mechanism and bandwidth
constraints.
• The impact of network impairments such as jitter, delay
and packet loss on the video, including spatial and
temporal loss propagation.
Video Coding
Layer (VCL) PES Headers
Picture
Metrics
Loss & Jitter
Analysis
V-Factor
TS Headers Decoded Video
Timing
Analysis
VCL
Analysis
Fig. 3. V-Factor inspects different sections of the video stream, namely
the transport stream (TS) headers, the packetized elementary stream (PES)
headers, and the video coding layer (VCL), in addition to the decoded video
signal.
The underlying model used for the objective measurement
of video impairments is based on a paper by Verscheure et
al. [24], who proposed models for the impact of packet loss
rate, MPEG-2 quantizer scale and data rate on quality using the
moving picture quality metric (MPQM). We have generalized
these models to state-of-the-art codecs such as H.264, and
further enhanced them to take into account the complexity of
the video content. Networks impairments are also analyzed
in real time in order to provide the model with packet loss
probability ratio (single loss, bursty loss) through a series of
hidden Markov models. The models were optimized for real-
time multi-channel assessment of video quality.
A. Bitstream Analysis
MQUANT (the quantizer scale on a macroblock basis in an
MPEG-2 video stream) provides a first approximation of how
video compression affects video quality. MQUANT was shown
to exhibit an approximately linear relationship to quality for
MPEG-2 clips [24].
We account for spatial and temporal image coding com-
plexity and the impact of packet loss on the spatial and
temporal content at the coding layer. Video quality without
any network impairments is influenced by video coding layer
(VCL) complexity. The VCL complexity is modeled using
quantizer values, motion vector information and intra/inter-
predicted frame/slice ratios.
As an example, videos with a lot of scene changes would
have a very high VCL complexity. The scene changes can
be detected in different ways: a Scene Information Message,
2 Parts of this technology are patent-pending.
which labels pictures with scene identifiers; an instantaneous
decoding refresh, where all slices are intra-coded; or intra-
period changes resulting from I-slice insertion.
Figure 4 depicts the video coding layer complexity analysis
for H.264 streams. The model performs bandwidth and band-
width variation analysis as well as an inspection of slices and
macroblocks in order to analyze the variation of the quantizer,
combined with a loss model.
VCL
slice / MB
monitor
VCL
bandwidth
monitor
Bandwidth
model
Video coding layer (VCL)
V-Factor
Slice / MB
packet loss
monitor
Codec-specific
curve fit
Loss model
Complexity
model
Fig. 4. H.264 video coding layer (VCL) complexity model. The model
performs bandwidth and bandwidth variation analysis as well as an inspection
of slices and macroblocks, combined with a loss model.
The VCL input is read from the Network Abstraction
Layer (NAL)3 or transport layer. The VCL packet size is
used to compute instantaneous and average bandwidth. The
bandwidth model is constructed using a 3-state (bandwidth
low/average/high) Markov model.
For every macroblock, the VCL complexity model is run.
Macroblock and slice quantization parameters are read from
the NAL/transport layer by parsing the slice data inside the
VCL. A VCL complexity quantization transition probability
matrix is computed, and limiting state probabilities are com-
puted. VCL parameters are also monitored for scene transitions
and picture quality. Inter/intra macroblock types are analyzed
to determine scene transitions and quantization parameter.
Computing VCL complexity also follows a Markov process
similar to the one for bandwidth, but limited to two states.
The transition probabilities are derived from counters that
are incremented each time a certain macroblock/slice type is
detected.
The visual impact of packet losses on the video content is
expressed by combining the video complexity model with a
loss model. It includes an analysis of the part of the stream that
is lost and its impact on video quality. This allows the model
to distinguish between losses involving intra-/inter-predicted
macroblocks, I-slices, B-slices, P-slices, high motion, scene
cuts, etc.
The overall V-Factor value, which represents a MOS esti-
mate, is computed by a codec-specific curve fit equation using
inputs from the bandwidth model, the VCL complexity model
and the loss model.
3 The Network Abstraction Layer (NAL) is an intermediate layer between
the video coding layer and the transport layer. It was introduced in H.264 to
allow for more flexible packaging of the elementary streams.
B. Network Losses
Losses can occur either due to IP packet loss in the network,
or due to de-jitter buffer under-/over-flow. In parallel to the
VCL analysis described above, the content of each packet is
inspected in order to determine if the packet contains part
of a reference frame or slice, or a predicted frame, slice or
macroblock. This analysis is again codec-specific and produces
a first statistical model of the distribution of I, B, and P (or
SI and SP) frames/slices/macroblocks. Using counters from
the complexity analysis, a second statistical model of the
distribution of the quantizer values is produced that leads to
a combined model of the bandwidth and bandwidth variation
for a video stream. Furthermore, inter/intra macroblocks and
motion vectors are analyzed; if high motion loss is detected,
the loss factor is updated accordingly.
By tracking the inserted time stamps (depending on the en-
capsulation such as MPEG-2 Transport Stream or RTP) as well
as the time stamp carried by some packets, and comparing the
difference with a de-jitter buffer, we can produce a jitter model
that is used to assess the packet loss probability due to high
jitter. The system then assesses whether the computed loss
probability will affect a reference or non-reference frame/slice.
C. Encryption
When the video stream is encrypted, as is often the case in
commercial video distribution networks, the VCL Raw Byte
Sequence Packet (RBSP) segments are not decodable. This
imposes a severe limitation on computing video quality, both
for traditional metrics and for hybrid metrics, as the impact
of losses and loss propagation at the VCL layer cannot be
measured directly.
A possible solution to this problem is to perform monitoring
before encryption (e.g. at the video head-end) as well as
downstream where the video is encrypted. Video timing infor-
mation is obtained either from the Program Clock Reference
(PCR) or the Presentation/Decode Time Stamps (PTS/DTS)
from both the head-end and downstream locations; alterna-
tively, GPS/NTP time stamping for correlating head-end and
downstream information can be used. This timing information
along with VCL information from before encryption and loss
event/distribution information from downstream can be used
in a reduced-reference manner for correlating the effects of
IP packet loss on the quality of the video content even in an
encrypted environment.
D. Results
Some V-Factor measurements are shown in Figure 5 to
demonstrate how the method combines transport and video
stream information to compute quality. This particular ex-
ample highlights how different losses and loss types (I-, P-
or B-slices) have different impact on quality predictions, in
addition to the dependence on video content and complexity
characteristics such as the quantizer scale (MQUANT) and the
number of scene cuts during loss periods.
Resources
Science & Research Jobs
PRAS - Principle Senior Scientist for Water Treatment
Position: Other
Employer: Philips (China) Investment Co.,Ltd
