PreprintPDF Available

Fine-grained subjective visual quality assessment for high-fidelity compressed images

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Advances in image compression, storage, and display technologies have made high-quality images and videos widely accessible. At this level of quality, distinguishing between compressed and original content becomes difficult, highlighting the need for assessment methodologies that are sensitive to even the smallest visual quality differences. Conventional subjective visual quality assessments often use absolute category rating scales, ranging from ``excellent'' to ``bad''. While suitable for evaluating more pronounced distortions, these scales are inadequate for detecting subtle visual differences. The JPEG standardization project AIC is currently developing a subjective image quality assessment methodology for high-fidelity images. This paper presents the proposed assessment methods, a dataset of high-quality compressed images, and their corresponding crowdsourced visual quality ratings. It also outlines a data analysis approach that reconstructs quality scale values in just noticeable difference (JND) units. The assessment method uses boosting techniques on visual stimuli to help observers detect compression artifacts more clearly. This is followed by a rescaling process that adjusts the boosted quality values back to the original perceptual scale. This reconstruction yields a fine-grained, high-precision quality scale in JND units, providing more informative results for practical applications. The dataset and code to reproduce the results will be available at https://github.com/jpeg-aic/dataset-BTC-PTC-24.
Content may be subject to copyright.
Fine-grained subjective visual quality assessment for
high-fidelity compressed images
Michela Testolina,1, Mohsen Jenadeleh,1, Shima Mohammadi, Shaolin Su,
Jo˜ao Ascenso, Touradj Ebrahimi, Jon Sneyers, and Dietmar Saupe
Multimedia Signal Processing Group, EPFL, Lausanne, Switzerland
{michela.testolina,touradj.ebrahimi}@epfl.ch
Multimedia Signal Processing Group, Universit¨at Konstanz, Konstanz, Germany
{mohsen.jenadeleh,shaolin.su,dietmar.saupe}@uni-konstanz.de
Instituto Superior ecnico, Lisbon, Portugal
{shima.mohammadi,joao.ascenso}@lx.it.pt
Media Technology Research Group, Cloudinary, Belgium, jon@cloudinary.com
Abstract
Advances in image compression, storage, and display technologies have made high-quality
images and videos widely accessible. At this level of quality, distinguishing between com-
pressed and original content becomes difficult, highlighting the need for assessment method-
ologies that are sensitive to even the smallest visual quality differences. Conventional sub-
jective visual quality assessments often use absolute category rating scales, ranging from
“excellent” to “bad”. While suitable for evaluating more pronounced distortions, these
scales are inadequate for detecting subtle visual differences. The JPEG standardization
project AIC is currently developing a subjective image quality assessment methodology for
high-fidelity images. This paper presents the proposed assessment methods, a dataset of
high-quality compressed images, and their corresponding crowdsourced visual quality rat-
ings. It also outlines a data analysis approach that reconstructs quality scale values in just
noticeable difference (JND) units. The assessment method uses boosting techniques on vi-
sual stimuli to help observers detect compression artifacts more clearly. This is followed by
a rescaling process that adjusts the boosted quality values back to the original perceptual
scale. This reconstruction yields a fine-grained, high-precision quality scale in JND units,
providing more informative results for practical applications. The dataset and code to re-
produce the results will be available at https://github.com/jpeg-aic/dataset-BTC-PTC-24.
Introduction
The rapid advancements in image coding technologies and increasing demands for
high-quality visual content have created a need for more sophisticated visual quality
evaluation methodologies. Traditional subjective quality assessment techniques, like
those presented in ITU-T Recommendation BT.500 [1] and reviewed in Part 1 of
the JPEG AIC standard [2], are often effective for evaluating images with low and
medium visual quality. However, when compared to quality scale reconstruction
from pair comparisons, they lack precision [3], and they fall short when adopted
to evaluate the visual quality of high-fidelity contents, which requires distinguishing
1These authors contributed equally to this work.
arXiv:2410.09501v1 [cs.CV] 12 Oct 2024
images with subtle variations in visual quality [4]. For these reasons, the JPEG
Committee launched a new activity in 2021, known as JPEG AIC-3 [5], with the goal
of a fine-grained quality assessment of compressed images with high-fidelity.
This paper presents and evaluates the anticipated JPEG AIC-3 subjective quality
assessment methodology. In order to better distinguish and rank small-scale com-
pression artifacts, two techniques are adopted. The first is artifact boosting, which
emphasizes small differences between compressed and source images. The second is
a triplet comparison in which observers compare two such compressed and boosted
versions of the same source which is also displayed. From the collected responses
that identify the stimulus with the perceived stronger distortion, a scaling procedure
is used. It quantifies the perceived impairments on a linear scale, expressed in Just
Noticeable Difference (JND) units. This scaling model assumes that the probability
of correctly identifying the stimulus with a distortion that is 1 JND unit stronger than
the one that it is compared with is 75%. Since the goal is to estimate the perceived
distortion of the original compressed images rather than the boosted ones, a rescaling
procedure is applied. For this purpose, a limited number of triplet comparisons are
carried out for a subset of the original compressed images, followed by a nonlinear
regression procedure that aligns the two sets of reconstructed scale values.
The proposed subjective quality assessment method was applied in a large-scale
crowdsourcing campaign using five source images compressed with five recent and
legacy compression methods yielding 250 decoded images (stimuli). About 440,000
triplet question responses were collected. The triplet responses, together with valuable
auxiliary information such as subject IDs and response times, the resulting quality
scale values, and the source code will be made publicly available.
Related work
Subjective image quality assessment has been widely studied and standardized by
organizations such as the International Telecommunication Union, with key recom-
mendations found in ITU-T P.910 [6] and ITU-R BT.500 [1]. The two most commonly
used methods for full-reference image quality assessment (IQA) are absolute category
rating with hidden reference (ACR-HR) and double stimulus continuous quality scale
(DSCQS). In ACR-HR, observers rate the quality of the source and the distorted
image separately usually with five categories ranging. In DSCQS observers are shown
both the reference and the test images side by side or sequentially and asked to rate
the quality of each image on a continuous scale. The difference in their mean opinion
scores gives the result, called DMOS.
The ISO/IEC 29170-2 standard, also known as JPEG AIC-2 [7], defines a flicker
test to classify compressed images as either visually lossless or visually lossy. The
boundary between lossy and lossless, as defined in this standard, can be understood
as thresholding at the JND in the flicker test condition. Methods are also available
to identify the distortion level or encoding parameter that produces a compressed
stimulus at the JND threshold. In lab studies of [8] and [9], comparisons between
sources and distorted images or videos are used with binary search algorithms to
estimate the JND thresholds per observer. Crowdsourced JND assessments using a
slider-based interactive selection of the JND with the flicker test have shown promise
for scalable subjective assessments of image quality [10]. However, all of the above
JND-based procedures can only distinguish between two stimuli if one of the encoded
images is visually lossless (i.e., below 1 JND) and the other is not.
The flicker test can be considered as a kind of boosting method to enhance ob-
servers’ sensitivity in visual quality assessment, particularly in the high-quality range.
Recent work has proposed zooming and artifact amplification as additional elements
of boosting [11]. Finally, the triple stimulus boosted pairwise comparison (TSBPC),
proposed in [12], follows a similar procedure and was adopted to create a large-scale
dataset for medium to visually lossless quality image compression. Elements from
these methods have been incorporated into the work presented here.
Proposed JPEG AIC-3 test methodologies
Boosted triplet comparison (BTC)
This methodology adopts boosting techniques to allow easier identification of subtle
artifacts inspired by the work of [11]. The following boosting techniques are consid-
ered:
Zooming: Images are first cropped to half the size in both dimensions and then
upscaled to the initial size using Lanczos resampling. Alternatively, the size of
the original images can be chosen to be small enough to allow zooming without
cropping.
Artefact amplification: The pixel-wise difference between the original and dis-
torted stimuli are linearly scaled in the three color channels separately with an
amplification factor of 2.
Flicker: The test images are temporarily interleaved with the reference image
at a change rate of 10 Hz, i.e., in each cycle, the original image is shown for
100 ms and then the distorted image is also displayed for 100ms.
Triplets are denoted as (Ii, I0, Ik), where Iiand Ikare two compressed versions of
the source image I0. The two test stimuli Iiand Ikare displayed side-by-side, each
alternating with the source to create a flicker on both sides (Figure 1a). For each
such triplet, observers are asked to identify the stimulus with the strongest flicker
effect, giving an answer “Left”, “Right”, or “Not Sure”. In [13], it was shown that
providing an indecision response option reduces mental load while maintaining the
homogeneity of psychometric functions. Triplets are shown for 8 seconds and then
blanked for 3 seconds. Anytime during the 11 seconds, subjects can select their
answer, after which the next triplet is shown. The sequence of triplet questions is
randomized and different for each observer.
Plain Triplet Comparison (PTC)
PTC differs from BTC in a few key aspects. Most importantly, the decoded images
are left untouched and used in place of the boosted versions. Source images are shown
in-place with the test images, but the flicker is replaced by a toggle button that the
(a) BTC (b) PTC
Figure 1: Sketch of the interface of the BTC and PTC experiment.
observer holds to switch between the decoded and the source image. A label indicates
whether the currently displayed images are distorted or the source. Before submitting
their answers, subjects are required to toggle at least once to make sure the original
image is shown. The time window for answering is 30 seconds, and the maximal toggle
frequency is limited to 2 Hz. The in-place presentation of PTC differences between a
compressed image and the source image appear at the same locations on the display,
thereby reducing eye movement and short-term memory needed for their detection,
compared to side-by-side presentation. Still, the quality scale derived from PTC uses
the plain decoded images and is regarded as the one that must be estimated at the
end by the methodology.
Experimental setup
Test material
Five images from the JPEG AIC-3 dataset were selected to represent a diverse range
of image types and content and cropped to a size of 620 ×800 pixels, as presented in
Figure 2. These images had been compressed with five codes, JPEG, JPEG 2000,
VVC Intra, JPEG XL, and AVIF at 10 bitrates each, corresponding approximately to
JND values equally spaced from 0.25 to 2.5, as determined by a pairwise comparison
experiment [14].
For the BTC experiment, all 10 distortion levels were considered, while for PTC,
only distortion levels numbered 0, 2, 4, 6, 8, and 10 were considered, where 0 indicates
the best visual quality (the source image) and level 10 corresponds to compressed
images with the strongest artifacts having a perceptual distance from the source of
approximately 2.5 JND.
Triplet comparisons and generation of batches
Four types of triplet questions were used in the experiment:
Same-codec comparisons: Triplet questions comparing decoded images with
different distortion levels but generated by the same codec.
(a) SRC 00002 (b) SRC 00006 (c) SRC 00007 (d) SRC 00009 (e) SRC 00010
Figure 2: Crops of the reference images adopted in the experiment.
Cross-codec comparisons: Triplet questions comparing decoded images gen-
erated by two different codecs, used to help align the scales between images
compressed by different codecs.
Bias-checking comparisons: Forced choice experiments are often biased since the
temporal or spatial ordering of the alternatives for the response has a significant
influence on the result. The most direct test to check for such an ordering bias
in our experiment is by same-codec triplet questions that compare two identical
decoded images.
Trap questions: Triplets where one side (left or right) contains a decoded image
with the strongest distortion level 10, while the other side displays the original
image (level 0). The quality difference between the two images in these trap
triplets is intentionally clear, and they are used to identify unreliable workers
and determine whether to accept or reject assignments in the crowdsourcing
study.
All possible same-codec comparisons were included, i.e., 110 and 30 per source
and codec for BTC and PTC experiments, respectively. Cross-codec comparisons
made up 20% of same-codec comparisons, resulting in one cross-codec comparison
for every five same-codec comparisons. The cross-codec comparisons were chosen
randomly targeting similar bitrates. The following table lists the total number of
different types of questions used for boosted and plain triplet comparisons.
Same-codec Cross-codec Bias-checking Trap questions Total Batch size
BTC 2750 550 100 200 3600 360
PTC 750 150 50 100 1050 105
For the BTC and the PTC experiment, the questions were randomly divided into
10 batches as follows. The bias-checking comparisons and the trap questions were
uniformly distributed into the batches. The same- and cross-codec questions were
joined before splitting them uniformly.
Crowdsourcing Study
Test subjects were recruited through Amazon Mechanical Turk (MTurk). The BTC
and PTC experiments were conducted separately, with several months in between,
and 778 and 352 assignments were made available for BTC and PTC, respectively.
In each campaign, a crowd-worker could take only a single assignment containing
one or two different batches of triplet questions. The batches were selected randomly
from the available pool of batches. The sequence of questions within each batch
was randomized for each worker. Ethical approval for the experimental procedures
and protocols was obtained from the Institutional Review Board of the University of
Konstanz.
Data analysis
The data analysis proceeded in four steps:
1. Filtering of reliable batches. The reliability of each batch was evaluated
based on same-codec triplet questions, where one test image was the source
(distortion level 0), and the other had the highest distortion (level 10). These
questions included all trap questions and some same-codec study questions. A
batch was deemed reliable if at least 70% of the responses were correct. Figures
3a and 3b show the histograms of the accuracy of batches for BTC and PTC.
Raw data After filtering
Batches Subjects Batches Subjects
BTC 1166 778 615 423
PTC 494 352 260 208
Figure 3c shows the responses collected for the bias questions in which there is
no difference between the compared stimuli. Clearly, there is a pronounced order
bias towards the response “Right”. After filtering out the unreliable batches,
the number of responses “Left” and “Right” are equal. This indicates that the
original order bias is due to unreliable subjects.
2. Reconstruction of scales. The BTC responses were used to reconstruct
(boosted) scale values for the corresponding compressed images from the five
codecs for each of the five source images. For this purpose, each “not sure” re-
sponse was split into half “left” and half “right”. All responses were interpreted
in the sense of two-alternative forced choice in pair comparisons. Then, the re-
construction proceeded by maximum likelihood estimation (MLE) of the scale
values in the corresponding Thurstonian Case V model, see [11, Sect. IV.B] for
more details and a comparison with other approaches. The scale values from
the PTC responses were computed in the same way. Note that for Thurstonian
reconstruction, it is usually assumed that the variance of the perceived difference
in stimuli quality is 1 when the two qualities are 1 unit apart. In order to convert
this scale into JND units, one divides the scale values by Φ1(0.75) 0.6745,
where Φ is the normal cumulative distribution function.
3. Alignment of boosted scales. To rescale the boosted scores to the range of
perceptual quality without boosting, linear regression was used. In this regres-
sion, the JND scale values from BTC are the predictor variables, and those from
PTC are the target variables for a fit by the degree-two polynomial ax +bx2.
(a) BTC batches (b) PTC batches (c) Bias questions
Figure 3: Accuracy of batches for BTC and PTC. Responses for bias questions (right).
Figure 4: Results of impairment scale reconstruction, aligned boosted scales, and
their 95% confidence intervals. Each row shows the plots for one of the five sources,
and columns correspond to the five codecs. The yellow shaded area is narrow and
indicates the confidence interval of the aligned scale values.
Figure 5: Psychometric functions and JND threshold determination with and without
the artifact boosting. The ratio of correct responses to same-codec triplet questions
comparing distortion levels 0 and 10 averaged over all sources and codecs. The JND
threshold, according to the AIC-2 flicker test, is between the two dotted lines. The
dark-shaded region, therefore, is in the visually lossless region, and the lightly shaded
region is partly visually lossless.
4. Bootstrap for confidence intervals. n= 10,000 bootstrap samples of the
(filtered) BTC and PTC data were generated by resampling with the replacement
of the responses for each triplet question. The following scale reconstructions
gave nvalues for each scale variable, which yielded the 95% confidence intervals.
Experimental results and discussion
The alignment of boosted scales can be performed using different procedures. The
transformation can be made by one unique polynomial for all sources and codecs or by
five polynomials, one per source or codec, respectively. Or one can define a different
transformation for each of the 25 source-codec pairs. The latter method yielded
the best fitting w.r.t. the PTC data, however, at the expense of a higher number
of parameters (300 versus 252 or 260 for the other settings). For the performance
comparison, we therefore applied the Akaike information criterion that takes the
number of parameters into account besides the log-likelihood of the fitted models.
The result is that the best fit by taking a different polynomial for each source-codec
pair is not outweighed by the cost of the larger number of parameters. Therefore, the
option of one transformation per source-codec pair was selected and it was concluded
that the boosting transformation modeled by quadratic polynomials depends on the
source image as well as on the distortion type.
Figure 4 shows an overview of the reconstruction and regression results for all
sources and codecs. Generally, the PTC scales (open circles) are within the confidence
intervals of the aligned boosted scale values. This shows that the assessment of the
visual quality of images with controlled boosting by zooming, artifact amplification,
and flicker can successfully be rescaled to match the scales obtained by assessing
image quality without boosting. Moreover, the boosted scales are, as expected, larger
than the unboosted ones by a factor of about 2. This means that the precision of
the aligned scales is also about twice as good as the precision obtainable without
boosting.
Figure 5 presents the psychometric functions of the proportions of correct re-
sponses for BTC and PTC that illustrate the JND thresholds with and without
boosting. For each tested distortion level, it shows the ratio of correct responses
to those same-codec triplet questions in which a compressed image was compared
with the source at level 0, which was averaged over all sources and codecs. For ex-
ample, for PTC at distortion level 4 on the right dotted line, the ratio is 0.75. By
definition, this corresponds to a perceived impairment magnitude of 1 JND. Thus,
this result confirms that, for distortion level 4, the corresponding decoded images were
indeed chosen as anticipated, with a distance of 0.25 JND per level. As expected, the
JND threshold for the more sensitive BTC methodology is closer to distortion level
2, indicated by the left dotted line. The threshold for visually lossless compression
as defined by JPEG AIC-2 for the flicker test should be between the two dotted lines
in the light green-shaded area, since this test is more sensitive than PTC that has a
limited manual toggle in place of flicker, and less sensitive than BTC that exploits
additional boosting techniques besides flicker.
Conclusions and future work
A comprehensive dataset and framework for full-reference subjective quality assess-
ment of high-fidelity decoded images was presented, using boosted and plain triplet
comparisons. Boosting techniques are employed to enhance sensitivity to compres-
sion artifacts, thereby improving the perceptual ranking of test images. A smaller
set of plain triplet comparisons is used in a regression to rescale the boosted quality
scales to match those of the original stimuli. The results, provided in JND units, offer
new useful information in applications that are not available from traditional DMOS
values. For example, this facilitates the estimation of satisfied user ratios in the most
relevant range from high to lossless visual quality. The dataset and code will be made
available online. JPEG AIC will continue to develop the proposed framework and
the methods for data analysis aiming at an international ISO standard on subjective
quality assessment in the high fidelity range.
Acknowledgments
This research is funded by the DFG (German Research Foundation) Project ID
496858717 and DFG Project ID 251654672 TRR 161. EPFL-affiliated authors
acknowledge support from the Swiss National Scientific Research project under grant
number 200020 207918. Instituto Superior ecnico-affiliated authors were supported
by FCT/MCTES through national funds under the project DARING with reference
PTDC/EEI-COM/7775/2020.
References
[1] Recommendation ITU-T BT.500-15, “Methodologies for the subjective assessment of
the quality of television images,” International Telecommunication Union, 2023.
[2] ISO/IEC TR 29170-1:2017, “Information technology advanced image coding and
evaluation part 1: Guidelines for image coding system evaluation,” .
[3] Rafa l K Mantiuk, Anna Tomaszewska, and Rados law Mantiuk, “Comparison of four
subjective methods for image quality assessment,” Computer Graphics Forum, vol. 31,
no. 8, pp. 2478–2491, 2012.
[4] Michela Testolina, Davi Lazzarotto, Rafael Rodrigues, Shima Mohammadi, Jo˜ao As-
censo, Ant´onio MG Pinheiro, and Touradj Ebrahimi, “On the performance of subjec-
tive visual quality assessment protocols for nearly visually lossless image compression,”
in Proc. 31st ACM International Conference on Multimedia, 2023, pp. 6715–6723.
[5] Michela Testolina, Evgeniy Upenik, and Touradj Ebrahimi, “On the assessment of
high-quality images: advances on the JPEG AIC-3 activity,” in Applications of Digital
Image Processing XLVI. SPIE, 2023, vol. 12674, pp. 180–190.
[6] Recommendation ITU-T P.910, “Subjective video quality assessment methods for
multimedia applications,” International Telecommunication Union, 2008.
[7] ISO/IEC 29170-2:2015, “Information technology Advanced image coding and eval-
uation Part 2: Evaluation procedure for nearly lossless coding,” .
[8] Joe Yuchieh Lin, Lina Jin, Sudeng Hu, Ioannis Katsavounidis, Zhi Li, Anne Aaron, and
C-C Jay Kuo, “Experimental design and analysis of JND test on coded image/video,”
in Applications of Digital Image Processing XXXVIII, 2015, vol. 9599, pp. 324–334.
[9] Haiqiang Wang, Ioannis Katsavounidis, Jiantong Zhou, Jeonghoon Park, Shawmin Lei,
Xin Zhou, Man-On Pun, Xin Jin, Ronggang Wang, Xu Wang, Yun Zhang, Jiwu Huang,
Sam Kwong, and C.-C. Jay Kuo, “VideoSet: A large-scale compressed video quality
dataset based on JND measurement,” Journal of Visual Communication and Image
Representation, vol. 46, pp. 292–302, 2017.
[10] Hanhe Lin, Guangan Chen, Mohsen Jenadeleh, Vlad Hosu, Ulf-Dietrich Reips, Raouf
Hamzaoui, and Dietmar Saupe, “Large-scale crowdsourced subjective assessment of
picturewise just noticeable difference,” IEEE Transactions on Circuits and Systems
for Video Technology, vol. 32, no. 9, pp. 5859–5873, 2022.
[11] Hui Men, Hanhe Lin, Mohsen Jenadeleh, and Dietmar Saupe, “Subjective image
quality assessment with boosted triplet comparisons,” IEEE Access, vol. 9, pp. 138939–
138975, 2021.
[12] Jon Sneyers, Elad Ben Baruch, and Yaron Vaxman, “CID22: Large-scale subjective
quality assessment for high fidelity image compression,” TechRxiv, April 2023.
[13] Mohsen Jenadeleh, Johannes Zagermann, Harald Reiterer, Ulf-Dietrich Reips, Raouf
Hamzaoui, and Dietmar Saupe, “Relaxed forced choice improves performance of visual
quality assessment methods,” in 2023 15th International Conference on Quality of
Multimedia Experience (QoMEX). IEEE, 2023, pp. 37–42.
[14] Michela Testolina, Vlad Hosu, Mohsen Jenadeleh, Davi Lazzarotto, Dietmar Saupe,
and Touradj Ebrahimi, “JPEG AIC-3 Dataset: Towards defining the high quality
to nearly visually lossless quality range,” in 2023 15th International Conference on
Quality of Multimedia Experience (QoMEX). IEEE, 2023, pp. 55–60.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The picturewise just noticeable difference (PJND) for a given image, compression scheme, and subject is the smallest distortion level that the subject can perceive when the image is compressed with this compression scheme. The PJND can be used to determine the compression level at which a given proportion of the population does not notice any distortion in the compressed image. To obtain accurate and diverse results, the PJND must be determined for a large number of subjects and images. This is particularly important when experimental PJND data are used to train deep learning models that can predict a probability distribution model of the PJND for a new image. To date, such subjective studies have been carried out in laboratory environments. However, the number of participants and images in all existing PJND studies is very small because of the challenges involved in setting up laboratory experiments. To address this limitation, we develop a framework to conduct PJND assessments via crowdsourcing. We use a new technique based on slider adjustment and a flicker test to determine the PJND. A pilot study demonstrated that our technique could decrease the study duration by 50% and double the perceptual sensitivity compared to the standard binary search approach that successively compares a test image side by side with its reference image. Our framework includes a robust and systematic scheme to ensure the reliability of the crowdsourced results. Using 1,008 source images and distorted versions obtained with JPEG and BPG compression, we apply our crowdsourcing framework to build the largest PJND dataset, KonJND-1k (Konstanz just noticeable difference 1k dataset). A total of 503 workers participated in the study, yielding 61,030 PJND samples that resulted in an average of 42 samples per source image. The KonJND-1k dataset is available at http://database.mmsp-kn.de/konjnd-1k-database.html</uri
Article
Full-text available
In subjective full-reference image quality assessment, a reference image is distorted at increasing distortion levels. The differences between perceptual image qualities of the reference image and its distorted versions are evaluated, often using degradation category ratings (DCR). However, the DCR has been criticized since differences between rating categories on this ordinal scale might not be perceptually equidistant, and observers may have different understandings of the categories. Pair comparisons (PC) of distorted images, followed by Thurstonian reconstruction of scale values, overcomes these problems. In addition, PC is more sensitive than DCR, and it can provide scale values in fractional, just noticeable difference (JND) units that express a precise perceptional interpretation. Still, the comparison of images of nearly the same quality can be difficult. We introduce boosting techniques embedded in more general triplet comparisons (TC) that increase the sensitivity even more. Boosting amplifies the artefacts of distorted images, enlarges their visual representation by zooming, increases the visibility of the distortions by a flickering effect, or combines some of the above. Experimental results show the effectiveness of boosted TC for seven types of distortion (color diffusion, jitter, high sharpen, JPEG 2000 compression, lens blur, motion blur, multiplicative noise). For our study, we crowdsourced over 1.7 million responses to triplet questions. We give a detailed analysis of the data in terms of scale reconstructions, accuracy, detection rates, and sensitivity gain. Generally, boosting increases the discriminatory power and allows to reduce the number of subjective ratings without sacrificing the accuracy of the resulting relative image quality values. Our technique paves the way to fine-grained image quality datasets, allowing for more distortion levels, yet with high-quality subjective annotations. We also provide the details for Thurstonian scale reconstruction from TC and our annotated dataset, KonFiG-IQA, containing 10 source images, processed using 7 distortion types at 12 or even 30 levels, uniformly spaced over a span of 3 JND units.
Article
Full-text available
A new methodology to measure coded image/video quality using the just-noticeable-difference (JND) idea was proposed. Several small JND-based image/video quality datasets were released by the Media Communications Lab at the University of Southern California. In this work, we present an effort to build a large-scale JND-based coded video quality dataset. The dataset consists of 220 5-second sequences in four resolutions (i.e., 1920×10801920 \times 1080, 1280×7201280 \times 720, 960×540960 \times 540 and 640×360640 \times 360). For each of the 880 video clips, we encode it using the H.264 codec with QP=1,,51QP=1, \cdots, 51 and measure the first three JND points with 30+ subjects. The dataset is called the "VideoSet", which is an acronym for "Video Subject Evaluation Test (SET)". This work describes the subjective test procedure, detection and removal of outlying measured data, and the properties of collected JND data. Finally, the significance and implications of the VideoSet to future video coding research and standardization efforts are pointed out. All source/coded video clips as well as measured JND data included in the VideoSet are available to the public in the IEEE DataPort.
Article
To provide a convincing proof that a new method is better than the state of the art, computer graphics projects are often accompanied by user studies, in which a group of observers rank or rate results of several algorithms. Such user studies, known as subjective image quality assessment experiments, can be very time-consuming and do not guarantee to produce conclusive results. This paper is intended to help design efficient and rigorous quality assessment experiments and emphasise the key aspects of the results analysis. To promote good standards of data analysis, we review the major methods for data analysis, such as establishing confidence intervals, statistical testing and retrospective power analysis. Two methods of visualising ranking results together with the meaningful information about the statistical and practical significance are explored. Finally, we compare four most prominent subjective quality assessment methods: single-stimulus, double-stimulus, forced-choice pairwise comparison and similarity judgements. We conclude that the forced-choice pairwise comparison method results in the smallest measurement variance and thus produces the most accurate results. This method is also the most time-efficient, assuming a moderate number of compared conditions.
Methodologies for the subjective assessment of the quality of television images
  • Itu-T Recommendation
  • Bt
Recommendation ITU-T BT.500-15, "Methodologies for the subjective assessment of the quality of television images," International Telecommunication Union, 2023.
Experimental design and analysis of JND test on coded image/video
  • Lina Joe Yuchieh Lin
  • Sudeng Jin
  • Ioannis Hu
  • Zhi Katsavounidis
  • Anne Li
  • C-C Jay Aaron
  • Kuo
Joe Yuchieh Lin, Lina Jin, Sudeng Hu, Ioannis Katsavounidis, Zhi Li, Anne Aaron, and C-C Jay Kuo, "Experimental design and analysis of JND test on coded image/video," in Applications of Digital Image Processing XXXVIII, 2015, vol. 9599, pp. 324-334.