Video quality of video professionals for Video Assisted Referee
Kjell Brunnströma,b, Anders Djupsjöbackaa, Johsan Billinghamc, Katharina Wistelc, Börje Andréna, Oskars Ozoliņša,d, Nicolas Evansc
aRISE Research Institutes of Sweden, Kista, Sweden,
bMid Sweden University, Sundsvall, Sweden,
cFédération Internationale de Football Association (FIFA), Zürich, Switzerland
dRoyal Institute of Technology (KTH), Stockholm, Sweden
Changes in the footballing world’s approach to technology and
innovation contributed to the decision by the International Football
Association Board to introduce Video Assistant Referees (VAR).
The change meant that under strict protocols referees could use
video replays to review decisions in the event of a “clear and
obvious error” or a “serious missed incident”. This led to the need
by Fédération Internationale de Football Association (FIFA) to
develop methods for quality control of the VAR-systems, which
was done in collaboration with RISE Research Institutes of
Sweden AB. One of the important aspects is the video quality. The
novelty of this study is that it has performed a user study
specifically targeting video experts i.e., to measure the perceived
quality of video professionals working with video production as
their main occupation. An experiment was performed involving 25
video experts. In addition, six video quality models have been
benchmarked against the user data and evaluated to show which of
the models could provide the best predictions of perceived quality
for this application. Video Quality Metric for variable frame delay
(VQM_VFD) had the best performance for both formats, followed
by Video Multimethod Assessment Fusion (VMAF) and VQM
General model.
TV broadcast consists of multiple quality affecting steps from the
moment of filming until the video or TV program is aired on TV.
The International Telecommunication Union (ITU) identifies three
distinct phases within the production and distribution process of
TV broadcasting [1]:
“Contribution – Carriage of signals to production centers
where post-production processing may take place.
Primary distribution – Use of a transmission channel for
transferring audio and/or video information to one or several
destination points without a view to further post-processing
on reception (e.g., from a continuity studio to a transmitter
Secondary distribution – Use of a transmission channel for
distribution of programs to viewers at large (by over-the-air
broadcasting or by cable television, including retransmission,
such as by broadcast repeaters, by satellite master antenna
television (SMATV), and by community based-network, e.g.,
community antenna television (CATV).”
Video Quality assessment has matured in the sense that there are
standardized, commercial products and established open-source
solutions to measure video quality in an objective way [2-5].
Furthermore, the methods to experimentally test and evaluate the
Quality of Experience (QoE) [6, 7] of a video are also widely
accepted in the research community and in the broadcasting
industry is based on standardized procedures [8-17].
The novelty of this research is that it has conducted a user study
specifically targeting video experts, as the majority of the research
conducted have targeted end or naïve users. Using professionals
whose main occupation is video production. The study measured
how they perceived the quality of the shown videos. In a second
step, six video quality models were benchmarked against the data
that was collected in the first phase of the research. With this, it
was possible to identify those video quality models that are
providing quality predictions with a high degree of confidence in
relation to the perceived video quality.
Video quality user study
To measure the users’ opinion of the video quality the Absolute
Category Rating (ACR) method, with hidden references was used
[8-10]. This method uses single stimulus procedure. One video is
presented at a time to the user, and they are asked to provide their
rating for each video after the video stops. The ratings were
provided in this study via a voting interface on the screen, asking
the user to “judge the video quality of the video?”. The rating scale
used was the five graded ACR quality scale:
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad
To evaluate objective video quality models for different production
formats that are used in the TV production of football games the
subjects were asked to provide their rating on three different video
Full size 1920x1080 video based on progressive source
Full size 1920x1080 video based on interlaced source (1080i).
Quarter size 960x540 video based on interlaced source (540i).
The order in which the different video formats were played to the
subjects, as well as the order of the single video sequences within
each video format were randomized for each subject. For the video
playback and randomization, the VQEGPlayer [32] was used. The
time required by the subjects to watch the videos and provide the
rating for all three sessions was in total approx. 45 minutes, with
short breaks between each session. The total time required for each
user, including instructions, visual testing, training, pre-, and post-
questionnaire, was about 1.5 hours.
There were 60 so called Processed Video Sequences (PVS) to be
evaluated per session. These consisted of 6 different source
sequences (SRC), i.e., different content that each of them was
processed with 10 different error conditions. Each video was 10
seconds and with an average estimated voting time of 5 seconds, a
trial was about 15 seconds.
Instructions were written out for the subject to read, to ensure that
the instructions given were as similar as possible. Some
explanations and backgrounds were given verbally, especially in
response to any questions and uncertainty of the task to perform.
To create a controlled and uniform environment for the subjects
the test room was set-up to comply with the requirements of the
ITU-R Rec. BT.500-13 [8].A high-end consumer-grade 65” 4K TV
(Ultra HD, LG OLED65E7V) was used for the experiments,
having a resolution of 3840 x 2160 pixels. As the videos used in
the experiment had a lower resolution (1920x1080 and 960x540)
than the screen the video was displayed pixel matched in the center
of the screen with a grey surround. The interlaced 1080i video was
deinterlaced in software and the deinterlacing of the TV was not
used. Viewing distance was 3H i.e., 120 cm.
In the experiment, 25 Swedish-speaking video experts participated
as subjects.
All viewers were tested prior for the following:
Visual acuity with or without corrective glasses (Snellen test).
Color vision (Ishihara test).
In total 25 video experts participated: 23 males and 2 females. The
average age of the test users was 37.8 years, with a standard
deviation of 10 years. All subjects had a good visual acuity as
expected for such professionals, average 1.09/1.06 (right/left eye),
standard deviation 0.18/0.20, max 1.4, and min 0.6 on one eye.
About half of them wore glasses or lenses. All had an accurate
color vision.
To rate the video quality a set of six different source video
sequences was shown to the expert panel. The video formats
selected for the SRC were:
1920x1080 progressive 50 frames-per-seconds (1080p)
1920x1080 interlaced 50 fields-per-seconds (1080i)
All SRC were obtained as uncompressed videos during live
football broadcast productions, as well as from the Swedish
Television (SVT) production Fairytale that was produced for
research and standardization purposes[18]. From all collected
videos, video clips of the length of 14 seconds were extracted.
There were 10 different video processing per video format
(including the reference) and each video processing was applied to
each SRC for each of the formats, making 60 processed video
sequences (PVS) per format. All PVSs was 10 seconds long.
A summary of the video processing is the following:
1080p: H.264 (80 Mbit/s – 10 Mbit/s) and Motion JPEG (80
Mbit/s – 20 Mbit/s).
1080i: H.264 (50 Mbit/s – 10 Mbit/s), Motion JPEG (80
Mbit/s – 20 Mbit/s) and bad deinterlacing.
540i: H.264 (50 Mbit/s – 10 Mbit/s) and different scaling
algorithms (lanczos, bilinear and neighbor).
Objective video quality assessment methods
Objective video quality models were evaluated for their
performance on the video format 1080p and 1080i. The methods
considered were:
Video Multimethod Assessment Fusion (VMAF)[3]
Video Quality Metric (VQM) – General model (ITU-T Rec.
Video Quality Metric (VQM) – (VQM_VFD)[2]
Peak Signal to Noise Ratio (PSNR) ITU-T Rec J.340[19] [20]
Structural Similarity Index (SSIM) [21] [20]
Visual Information Fidelity (VIF) [22] [20]
Video quality user study
Characterization of the quality of the video clips is the Mean
Opinion Scores (MOS) which is the mean over the ratings given by
the users
where 𝜇
is the score of the user i for PVS j. N is the number of
users and M is the number of PVSs.
The statistical analysis that has been performed is by first applying
a repeated measures Analysis of Variance (ANOVA) and then
performing a post-hoc analysis based on Tukey Honestly
Significant Difference (HSD)[23, 24].
In Figure 1 the different video processing schemes and bitrates that
were applied to the SRCs for 1080p are shown. The encoding
performed by Motion JPEG is shown in solid black and the H.264
in dashed black curve. The MOS of the reference is marked as a
red line without tying it to the bitrate to not make the x-axis too
long. The quality drops fast with lower bitrates for MJPEG,
whereas the quality for H.264 is indistinguishable from the
reference down to about 20 Mbit/s.
Figure 1. The mean quality for 1080p (y-axis) of the degradations taken over
all source video clips (SRCs) and users, divided into the different codecs used
(MJPEG in solid black curve and H.264 dashed black curve) and plotted
against the bitrate (x-axis). The MOS of the reference is marked as red line
without tying it to the bitrate to not make the x-axis too long.
A breakdown of the different processing schemes and bitrates
applied to the SRCs for 1080i is shown in Figure 2. The encoding
performed by MJPEG is shown in solid black and the H.264 in
dashed black curve. The MOS of the reference is marked as a red
line without tying it to the bitrate to not make the x-axis too long.
One error condition was a simple deinterlacing applied directly to
the uncompressed video and its MOS has been drawn in a similar
way as the reference, as a yellow line across the graph. This error
condition was not liked very much by the users and received very
low ratings. The quality drops fast with lower bitrates for MJPEG,
whereas the quality for H.264 is indistinguishable from the
reference down to about 30 Mbit/s, but in contrast to 1080p 20
Mbit/s is statistically significantly lower for 1080i (p = 0.03 <
Figure 2. The mean quality for 1080i (y-axis) of the degradations taken over all
source video clips (SRCs) and users, divided into the different codecs used
(MJPEG in solid black curve and H.264 dashed black curve) and plotted
against the bitrate (x-axis). The MOS of the reference is marked as red line
without tying it to the bitrate to not make the x-axis too long. Similarly, the
error conditions based on simple deinterlacing on an otherwise uncompressed
video are shown as a yellow line.
Objective video quality models evaluation
In the evaluation, we have studied the overall performance given
by Pearson Correlation Coefficient (PCC)[25] and the Root Mean
Square Error (RMSE)[25], between the scores of the objective
model and the Difference Mean Opinion Scores (DMOS). The
DMOS was calculated by subtracting for each user its rating of the
reference from the rating of the distorted video. To get the values
on the same scale as the Mean Opinion Scores (MOS) i.e., 1-5, the
following formula was used: difference score = 5 – (reference
score – distorted score). The PCC measures the linear relationship
between the model scores and the DMOS. As the relationships
very often are not linear it is recommended to linearize the
dependency by fitting a 3
order monotonic polynomial to the
data[25]. This usually improves the PCC somewhat, but it also
enables the calculation of the RMSE. A statistical hypothesis test
was also applied to the RMSE values. The null hypothesis, H
, is
that there was no statistical difference between two RMSE values,
and the alternative hypothesis, H
, was that there was a statistical
difference. The test was based on forming an F ratio between the
larger RMSE value squared divided with the smaller RMSE value
squared. The degrees of freedom is the number of points in the
RMSE calculation, minus 4 due to the 3
order monotonic
polynomial fit i.e. 54 – 4 = 50[25]. The Spearman Correlation
Coefficient (SCC) was also calculated.
The p-values of the statistical significance tests are shown in Table
1 and Table 2. VQM_VFD is significantly better than all other
models for 1080p and better than PSNR, SSIM, and VIF for 1080i.
VMAF is significantly better than PSNR and VIF for
SSIM has a very low
performance for 1080i and is significantly
worse than all other models.
Table 1: P-values of statistical test on the difference in RMSE
based on ITU-T Rec. P.1401[25] for 1080p. Significant values
are marked with *, based on an alpha of 0.05 and the method of
Holm for multiple comparisons of 15 comparisons.
VQM_VFD 0.00014
VQM_General 0.22 < 0.0001 *
SSIM 0.0067 < 0.0001 * 0.042
PSNR 0.0034
< 0.0001 * 0.024 0.40
VIF 0.0040
< 0.0001 * 0.028 0.43 0.48
Table 2: P-values of statistical test on the difference in RMSE
based on ITU-T Rec. P.1401[25] for 1080i. Significant values are
marked with *, based on an alpha of 0.05 and the method of
Holm for multiple comparisons of 15 comparisons.
VQM_VFD 0.17
VQM_General 0.29 0.066
SSIM 0.00046
< 0.0001 * 0.0027
PSNR 0.044 0.0042 * 0.12 0.049
VIF 0.0343 0.0030 * 0.10 0.062 0.45
The performance of six different video quality models has been
evaluated for 1080p and 1080i. VQM_VFD had the best
performance for both formats, followed by VMAF and VQM
General models. SSIM, PSNR, and VIF have similar performance
that is lower than the evaluated video models. SSIM has
particularly low performance for 1080i, mostly due to the low-
quality deinterlacing method, but from the scatter plots it is evident
that also PSNR and VIF had similar problems.
