Content uploaded by Piet Demeester
Author content
All content in this area was uploaded by Piet Demeester on Jan 30, 2015
Content may be subject to copyright.
1 23
Multimedia Systems
ISSN 0942-4962
Volume 18
Number 6
Multimedia Systems (2012) 18:445-457
DOI 10.1007/s00530-012-0262-4
Assessing the importance of audio/
video synchronization for simultaneous
translation of video sequences
Nicolas Staelens, Jonas De Meulenaere,
Lizzy Bleumers, Glenn Van Wallendael,
Jan De Cock, Koen Geeraert, Nick
Vercammen, et al.
1 23
Your article is protected by copyright and
all rights are held exclusively by Springer-
Verlag. This e-offprint is for personal use only
and shall not be self-archived in electronic
repositories. If you wish to self-archive your
work, please use the accepted author’s
version for posting to your own website or
your institution’s repository. You may further
deposit the accepted author’s version on a
funder’s repository at a funder’s request,
provided it is not made publicly available until
12 months after publication.
REGULAR PAPER
Assessing the importance of audio/video synchronization
for simultaneous translation of video sequences
Nicolas Staelens •Jonas De Meulenaere •Lizzy Bleumers •Glenn Van Wallendael •
Jan De Cock •Koen Geeraert •Nick Vercammen •Wendy Van den Broeck •
Brecht Vermeulen •Rik Van de Walle •Piet Demeester
Received: 19 September 2011 / Accepted: 16 April 2012 / Published online: 3 May 2012
ÓSpringer-Verlag 2012
Abstract Lip synchronization is considered a key
parameter during interactive communication. In the case of
video conferencing and television broadcasting, the dif-
ferential delay between audio and video should remain
below certain thresholds, as recommended by several
standardization bodies. However, further research has also
shown that these thresholds can be relaxed, depending on
the targeted application and use case. In this article, we
investigate the influence of lip sync on the ability to per-
form real-time language interpretation during video con-
ferencing. Furthermore, we are also interested in
determining proper lip sync visibility thresholds applicable
to this use case. Therefore, we conducted a subjective
experiment using expert interpreters, which were required
to perform a simultaneous translation, and non-experts. Our
results show that significant differences are obtained when
conducting subjective experiments with expert interpreters.
As interpreters are primarily focused on performing the
simultaneous translation, lip sync detectability thresholds
are higher compared with existing recommended thresh-
olds. As such, primary focus and the targeted application
and use case are important factors to be considered when
selecting proper lip sync acceptability thresholds.
Keywords Audio/video synchronization Lip sync
Subjective quality assessment Audiovisual quality
Language interpretation
1 Introduction
Perceived quality of audiovisual sequences can be influ-
enced by the quality of the video stream, the quality of the
Communicated by R. Steinmetz.
N. Staelens (&)N. Vercammen B. Vermeulen
P. Demeester
Department of Information Technology, Ghent University,
IBBT, Ghent, Belgium
e-mail: nicolas.staelens@intec.ugent.be
N. Vercammen
e-mail: nick.vercammen@intec.ugent.be
B. Vermeulen
e-mail: brecht.vermeulen@intec.ugent.be
P. Demeester
e-mail: piet.demeester@intec.ugent.be
J. De Meulenaere L. Bleumers W. Van den Broeck
Studies on Media, Information and Telecommunication,
Free University of Brussels, IBBT, Brussels, Belgium
e-mail: jonas.de.meulenaere@vub.ac.be
L. Bleumers
e-mail: lizzy.bleumers@vub.ac.be
W. Van den Broeck
e-mail: wvdbroec@vub.ac.be
G. Van Wallendael J. De Cock R. Van de Walle
Department of Electronics and Information Systems,
Ghent University, IBBT, Ghent, Belgium
e-mail: glenn.vanwallendael@ugent.be
J. De Cock
e-mail: jan.decock@ugent.be
R. Van de Walle
e-mail: rik.vandewalle@ugent.be
K. Geeraert
Televic N.V., Izegem, Belgium
e-mail: k.geeraert@televic.com
123
Multimedia Systems (2012) 18:445–457
DOI 10.1007/s00530-012-0262-4
Author's personal copy
audio stream, and the differential delay between the audio
and video (A/V synchronization) [14]. In the case of
interactive communication, such as video conferencing,
A/V synchronization is considered a key parameter [32]
and is more commonly referred to as lip synchronization
(lip sync) [6]. According to International Telecommuni-
cation Union (ITU)-T Recommendation P.10 [15], the goal
of lip sync is to ‘provide the feeling that the speaking
motion of the displayed person is synchronized with that
person’s voice’.
Several standard bodies such as the ITU, the European
Broadcast Union (EBU), and the Advanced Television
Systems Committee (ATSC) formulated a series of rec-
ommendations [1,5,11,13] concerning the maximum
allowed differential delay between audio and video to
maintain satisfactory perceived quality. However, further
research [3,7,28] has already pointed out that these rec-
ommendations can be relaxed in some cases, depending on
the targeted use case and application.
Similar to video conferencing, simultaneous translation
or language interpretation is also an example of interactive
video communication. In professional environments, such
as the European Parliament, interpreters usually reside in
specially equipped interpreter booths (see Fig. 2) during
the debates. Furthermore, these debates are recorded and
broadcasted to the booths and also made available as live
video streams broadcasted over the Internet. The content of
such a live video stream typically consists of close-up
views of the current active speaker and provides the
interpreters with additional non-verbal information (ges-
tures, facial expressions) which can facilitate the simulta-
neous translation.
In general, the existing recommended A/V synchroni-
zation thresholds are determined based on subjective
experiments conducted using non-expert users [12]. How-
ever, interpreters can be regarded as expert users since they
actively use the video stream while performing the simul-
taneous translation and also process the non-verbal infor-
mation from the video.
Recent studies have shown that non-experts are more
tolerable compared with experts during audiovisual quality
assessment [26]. Furthermore, context and primary focus
are also important factors to consider during quality
assessment [27]. Therefore, additional research is needed
to investigate whether the existing thresholds are also valid
in the expert use-case of language interpretation.
In this article, we are particularly interested in investi-
gating how delay between audio and video is perceived by
real interpreters and how this delay affects their ability to
perform simultaneous translations. Face-to-face interviews
were organized with interpreters to talk about the relative
importance of audio/video synchronization, the added
value of having visual feedback (next to the audio signal),
and which kind of (additional) information interpreters
usually use or require for performing simultaneous trans-
lation. Furthermore, we also conducted a subjective
audiovisual quality experiment during which the inter-
preters were asked to perform simultaneous translation of a
number of video sequences as they would do in real-life.
After each sequence, the interpreters were questioned about
the audio/video delay and the overall audiovisual quality.
The results of the subjective test are then compared with
the results obtained during the face-to-face interviews. As a
last step, we also conducted the same subjective experi-
ment using non-expert users to compare the results con-
cerning audio/video delay visibility and annoyance with
the results of the expert users. In contrast with the inter-
preters, the non-expert users were not asked to perform a
simultaneous translation of the video sequences.
The remainder of this article is structured as follows. In
Sect. 2, we start by describing different techniques for
monitoring and measuring the differential delay between
audio and video. Furthermore, we also provide an overview
of already conducted research and existing standards
defining a wide range of acceptability thresholds related to
A/V synchronization. Based on this study, we highlight the
importance of our study presented in this article. For
obtaining ground-truth data, a subjective experiment has
been set up and conducted. This will be explained in more
detail in Sect. 3. Section 4presents the results of this
subjective experiment which we conducted using both
experts and non-experts. The differences in the results
obtained using these two targeted user groups are also
discussed in more details in the same section. Finally, we
conclude the article in Sect. 5.
2 Monitoring and measuring audio/video
synchronization
In order to ensure and maintain synchronized audio and
video, several measurement and monitoring techniques
have been proposed in literature. Furthermore, research has
already been conducted to determine A/V synchronization
acceptability thresholds for several applications such as
video broadcasting and video conferencing. However, as
will be explained in more detail in the next sections, a wide
range of different thresholds have been identified each of
which are dependent on the application.
2.1 Audio/video synchronization measurement
techniques
In many broadcast systems, ‘off-line’ measurement tech-
niques are used to maintain audio/video synchronization.
Presentation time stamps (PTS), for example, can be
446 N. Staelens et al.
123
Author's personal copy
embedded in MPEG transport streams to avoid A/V syn-
chronization drift. Similarly, comparison of SMPTE time
codes in audio and video signals can be used to synchro-
nize the audio and video signals. These time stamps or time
codes are often added after the video undergoes frame
synchronization, format conversion, and pre-processing. As
a result, delays or misalignment in these stages remain
uncompensated. Also, as time codes have no actual relation
to the signal, mistimed or misaligned information can lead
to a loss of A/V synchronization.
A number of solutions have been proposed that can
overcome these limitations. In order for these techniques to
be useful in conferencing and broadcast environments, a
number of requirements should be met. As the synchroni-
zation errors can vary in time, it is important that the
measurement method responds to the A/V synchronization
in a dynamic way and in real time (in-service). Preferably,
the techniques should work for all types of audio and video
content, independent of the used format. Also, they should
be robust to modifications of the audio and video signals
that can occur during content distribution.
Roughly, three classes of methods can be distinguished
for dynamically measuring the A/V synchronization based
on the correspondence between both signals.
A first class exploits the relationship between acoustic
speech and the corresponding lip features (such as width
and height) and lip movements. In Li et al. [19], a high
correlation between the estimated and measured visual lip
features was found. Evidently, such methods are con-
strained to video content where lip motion is present.
Second, watermarking solutions have been investigated
for A/V synchronization. Watermarking can, e.g., embed
information about the audio signal into the video stream.
The envelope of the audio signal is analyzed, from which a
watermark is generated. This watermark can be embedded
in the corresponding video stream. At a receiver point, the
video and audio streams and the watermark can be
observed to obtain a measure of the A/V synchronization.
One issue with this technique is that the watermark is not
necessarily robust to adaptation of the video and/or audio
signal, for example, when transrating, aspect ratio con-
version, or audio downmixing are applied.
In a third class of techniques, an A/V synchronization
fingerprint (also referred to as ‘signature’ or ‘DNA’) is
added to the audio and video signals. Features from both
signals are extracted and combined into an independent
data stream at a point where both signals are known to be
in-sync. Later on, this data stream can be used to measure
and maintain the A/V synchronization. Fingerprinting
exploits characteristical features of the video or audio (such
as luminance, transitions, edges, motion, etc.) and uses a
formula to condense the data into a small representation
[18], e.g., based on robust hash codes [8]. These hash codes
are sent in the data stream and ensure that small pertur-
bations in the audio and video features caused by signal
processing operations will not change the hash bits dras-
tically. At the detection point, signatures are again
extracted based on the received signals, and a comparison
is made between the generated and transmitted signatures
within a short time window. The output of the correlator
between both signatures will result in an estimated delay.
Real-time systems based on these techniques have been
described in [25,30].
To secure interoperability of A/V synchronization
techniques, standardization initiatives have been started.
Recently, the SMPTE 22TV Lip Sync Ad Hoc Group
(AHG) has been studying the problem. The goal of this
AHG is the creation of a standard for audio and video
fingerprinting algorithms, transport mechanisms, and
associated recommended practices. An overview of their
activities is given in [29].
2.2 Audio/video synchronization perceptibility
thresholds
As mentioned in the introduction, several standard bodies
have already established a set of performance objectives
for audio/video synchronization which has resulted in dif-
ferent detectability and acceptability thresholds. According
to ITU-R Recommendation BT.1359, the thresholds for
detecting A/V synchronization errors are at ?45 and
-125 ms [11], where a negative number corresponds with
audio delayed with respect to the video. The standard also
specifies that synchronization errors become unacceptable
in case the delay exceeds ?90 or -185 ms. Recommen-
dation R37 of the EBU [5] defines that the end-to-end delay
between audio and video in the case of television programs
should lie between ?40 and -60 ms. These thresholds are
lower compared with the detectability thresholds as spec-
ified in ITU-R Rec. BT.1359. The ATSC Implementation
Subcommittee (IS) 191 [1] argues that the recommenda-
tions from ITU-R Rec. BT.1359 are inadequate for digital
TV broadcasting and state that the differential audio/video
delay should remain between ?15 and -45 ms to deliver
tightly synchronized programs. The same thresholds are
also recommended by the DSL Forum [4] and ITU-T
Recommendation G.1080 [13].
Due to the fact that these international standards propose
different audio/video synchronization thresholds, a lot of
research has been performed and is still ongoing to eval-
uate and identify lip sync thresholds for different applica-
tions and use cases.
Steinmetz [28] performed an in-depth analysis of the
influence of jitter and media synchronization on perceived
quality. The goal of his study was to identify the thresholds
at which lip sync becomes noticeable and/or annoying. The
Assessing the importance of audio/video synchronization 447
123
Author's personal copy
test sequences consisted of simulated news broadcasts,
with a resolution of 240 9256 pixels, in which delay up to
320 ms between audio and video was inserted. The
majority of the test subjects did not detect audio/video
delays up to 80 ms, whereas delays of more than 160 ms
are detected by nearly all subjects. Furthermore, these
thresholds are both valid for audio leading the video and
video leading the audio. Results concerning the annoyance
of the perceived lip sync indicate that delays up to 80 ms
are acceptable for most of the subjects. When audio lags
the video with more than 240 ms or audio leads the video
more than 160 ms, lip sync is perceived as distracting.
The interaction effect on perceived quality of providing
high-quality video of Quarter Common Intermediate For-
mat (QCIF) resolution (176 9144 pixels) with accompa-
nying low-quality audio and vice versa has been studied in
[21] in the case of both interactive and passive communi-
cation. The authors conclude that video has a beneficial
influence on overall multimedia quality, which corresponds
with the findings from Garcia et al. [9]. Part of the study
also involved investigating the effect of lip sync on overall
multimedia quality. For the lip sync experiment, audio and
video were delayed up to 440 ms. Almost half of the test
subjects (45 %) did not detect synchronization errors when
the video stream was delayed with respect to the audio
stream. In the case the audio stream was delayed, only
24 % of the subjects indicated that no synchronization
error occurred. These results suggest that subjects are more
tolerable towards audio leading the video. Further research
[20] has also pointed out that more attention to lip sync is
given during passive communication compared with active
communication. During the latter, subjects are more con-
centrated on the conversation itself.
During another multimedia synchronization study, sev-
eral CIF resolution (352 9288 pixels) video sequences
were presented to the test subjects to quantify the effect of
A/V delay [7]. The quality of the audio and the video
stream remained constant during the experiment, only the
differential delay varied between -405 and ?405 ms.
Subjects were only required to evaluate the audiovisual
quality of the presented sequences using a 5-grade scale.
Results show that, even in the case no delay is present in
the sequence, subjects never rated the sequences to be
excellent quality. Furthermore, sequences with an audio
offset of -40 ms were rated slightly better quality com-
pared with the case of no delay. Audio offsets between
-310 and ?140 ms are all rated as of good quality.
Overall, audio lagging the video was perceived as less
annoying compared with audio leading the video, which is
in slight contrast with the results of Mued et al. [21]as
discussed above.
The absolute perceptual threshold for detecting audio/
video synchronization errors when audio is leading the
video is at 185,19 ms according to the results of Younkin
et al. [32]. This experiment did not include sequences in
which the audio was lagging, but the authors assume that
the detection threshold of audio lagging the video should
be higher.
A similar experiment as the one conducted by Steinmetz
[28] has been repeated in [3] where the focus was specif-
ically on mobile environments. The authors argue that
different detection and annoyance thresholds may apply in
mobile environments due to the change in screen size,
viewing distance, and frame rate compared with the TV
viewing environment. As such, small-resolution (QCIF and
Sub-QCIF) low frame rate test sequences were used during
the experiment. The lip sync detection threshold, in the
case of audio leading the video, is at 80 ms. It must be
noted that a stricter evaluation method was used to deter-
mine this threshold compared with the results in [28]. In
the case of audio lagging the video, the detection threshold
appears to be content and frame rate dependent and varies
between -160 and -280 ms.
Figure 1provides a graphical overview of the different
thresholds as identified by the international standards and
research findings described above. It is clear that each
application and use case scenario is characterized by dif-
ferent detectability thresholds. Furthermore, as the figure
also shows, the acceptability thresholds span a wide range
of allowable differential delay between the audio and the
corresponding video stream. Therefore, additional research
is needed to identify proper lip sync detectability thresh-
olds in the case of simultaneous translation of video
sequences and to investigate the relative importance of
providing visual feedback to the interpreters.
3 Subjective quality assessment of audio/video
delay during simultaneous translation
In order to collect ground-truth data concerning the visi-
bility, annoyance and influence of A/V delay in the case of
simultaneous translation, a subjective audiovisual quality
experiment has been set up and conducted using expert
interpreters. Furthermore, the experiment has also been
conducted with non-expert users to investigate whether
there are significant differences with the results obtained
using the interpreters as both user groups have a different
primary focus and expertise.
3.1 Experimental setup
Internationally standardized subjective audiovisual quality
assessment methodologies, such as the ones described in
ITU-T Recommendation P.911 [16] and ITU-T Rec. P.920
[17] include detailed guidelines on how to set up and
448 N. Staelens et al.
123
Author's personal copy
conduct such quality experiments. For the evaluation of
audiovisual sequences, these methodologies describe the
order in which the sequences must be presented to the test
subjects and propose different rating scales which can be
used by the subjects to assign a quality score to the cor-
responding sequence. Furthermore, the standards also pose
some stringent demands related to the viewing and listen-
ing conditions by specifying, amongst others, the viewing
distance between the test subject and the screen, the
luminance level of the screen, the overall room illumina-
tion, and the allowed amount of background noise. As
such, subjective quality experiments are usually conducted
in controlled environments.
Preliminary results in [24] show that subjects’ audiovi-
sual quality ratings are not significantly influenced when
conducting subjective experiments in pristine lab environ-
ments, compliant with the ITU recommendations, or on
location (e.g. in a company’s cafeteria with background
noise and different lighting conditions). This indicates that
the overall test room conditions, as specified in [16] and
[17], can be relaxed to some extent.
In previous research [27], we also investigated the
influence of conducting subjective quality assessment
experiments in real-life environments, where subjects are
not primarily focused on (audio)visual quality evalua-
tion. Our results show that impairment visibility and
annoyance are significantly influenced by subjects’ pri-
mary focus and that measuring quality of experience
(QoE) should ideally be performed in the most natural
environment corresponding to the video service under
test. The latter also complies with the definition of QoE
which states that the quality, as perceived subjectively by
the end-user, can be influenced by user expectations and
context [15].
In the case of performing simultaneous translations,
interpreters usually reside in special designated interpreter
booths as depicted in Fig. 2.
Based on the research findings mentioned above, we
also opted to conduct our subjective experiments in the
interpreter’s most natural environment by mimicking a
typical interpreter’s booth as much as possible. As such,
our assessment environment illustrated in Fig. 3consists of
Fig. 1 Graphical representation of the different audio/video delay and lip sync detectability thresholds as identified by several standard bodies
and already conducted research
Fig. 2 Typical interpreters’ booth for performing simultaneous
translations
Fig. 3 Environmental setup as used during our subjective quality
assessment experiment in order to mimic a realistic environment (cfr.
Fig. 2)
Assessing the importance of audio/video synchronization 449
123
Author's personal copy
similar hardware as the one used in a professional envi-
ronment to ensure that our test subjects have a similar
experience compared with the real-life scenario.
As can be seen from Figs. 2and 3, a display which
shows a live video stream with a close-up of the person
currently talking is also at the interpreter’s disposal.
3.2 Audiovisual subjective assessment methodology
During subjective audiovisual quality assessment, test
subjects watch and evaluate the perceived quality of a
number of video sequences. In general, two different types
of methodologies can be used for displaying the different
test sequences to the subjects.
First of all, sequences can be shown pairwise using a
double-stimulus (DS) methodology. In this case, two
sequences (usually the original version and an impaired or
degraded version of it) are first presented to the test sub-
jects after which they need to evaluate the quality differ-
ences between both sequences. As such, each test sequence
is always presented in relation to a reference sequence.
These methodologies are commonly used for evaluating
the performance of video codecs [12].
A second type of methodologies, called single stimulus
(SS), presents the test sequences one at a time to the sub-
jects. Immediately after watching the video sequence,
subjects have to provide a quality rating. This means that
the quality of each sequence must be evaluated without the
use of an explicit reference sequence representing optimal
quality. A typical trail structure of an SS methodology is
depicted in Fig. 4.
It is clear that SS methodologies correspond more with
the way people watch video on their computer or on their
television [10,31]. This is also the case for the video
streamed to the interpreter booths. As such, we also used
the SS methodology to show the test sequences one after
another to the different subjects.
After watching each video sequence, subjects were
required to answer the following three questions:
1. Did you perceive any audio/video synchronization
issues?
2. Do you think audio was ahead with respect to the video
or vice versa?
3. How annoying does the audio/video synchronization
problem appear to you, on a scale from 1 to 5?
For the last question, subjects were presented with the
five-level impairment scale as depicted in Fig. 5.
In case the user did not perceive any audio/video syn-
chronization problem in the presented video sequence (thus
answering ‘no’ on the first question), questions 2 and 3
were automatically skipped.
As specified in ITU-T Rec. P.911, subjects also received
specific instructions on how to evaluate the different video
sequences. Furthermore, before the start of the real sub-
jective experiment, two training sequences were presented
to the subjects to get them familiarized with the subjective
experiment and the range of audio/video synchronization
issues they could expect. The audiovisual quality ratings
given to these two training sequences are not taken into
account when processing the results. A standard headset
was used for playback of the audio stream. During the
training sequences, the test subjects were allowed to reg-
ulate the volume of the headset.
As we are interested in assessing the influence of lip
synchronization errors on the ability to perform simulta-
neous translation of video sequences, the interpreters par-
ticipating in our subjective experiment were also required
to perform this task during sequence playout. As such, the
interpreters were mainly focused on the simultaneous
translation of the video sequences. It must be noted that
they were still aware of the possibility of audio/video
synchronization errors as this was stated at the beginning of
the trail. As already mentioned, the experiment was also
conducted using non-expert users. These were not required
to simultaneously translate the sequences and were there-
fore mainly focused on detecting audio/video delays.
As recommended in [12], the preferred viewing distance
between the screen and the test subjects should be around
seven times the screen height (H). However, as can be seen
from Fig. 2, interpreters are sitting closer to the screen as
compared with the preferred viewing distance. Since we
are targeting a more realistic setup, we did not force our
test subjects to remain seated at a fixed viewing distance.
Fig. 4 Typical trail structure for an SS methodology [16], during
which sequences are presented one at a time and immediately
evaluated after watching
Fig. 5 Five-level impairment scale [16] used for collecting subjects’
responses concerning audio/video delay annoyance
450 N. Staelens et al.
123
Author's personal copy
The screen used for playback of the video sequences
was a standard 17-inch LCD panel with a resolution of
1,024 9768 pixels.
3.3 Selection, creation and impairing of video
sequences
From Figs. 2and 3, it can be seen that the content shown on the
displays in the interpreter booths typically consists of so-
called ‘talking head’ or ‘news’ sequences. These sequences
are characterized by a close-up of one or more persons talking
in front of the camera. Talking head sequences do not usually
contain a lot of background motion except for the person who
is in front of the camera. Examples of talking head MPEG-4
test sequences [23] include ‘Akiyo’, ‘News’, ‘Mother &
Daughter’ and ‘Silent’.
The source content we used for conducting our sub-
jective experiment consisted of a joint debate during a
plenary session of the European Parliament. During the
debate, the camera always took a close-up of the active
speaker. From that video content, of which we obtained the
original recordings, we then selected one speaker whose
native spoken language was English and who delivered a
continuous speech of about 5 min long.
ITU-R Recommendation BT.1359 [11] specifies that the
overall delay between audio and the corresponding video
track should fall within the range [-185, ?90 ms] and that
the detectability thresholds are at -125 and ?45 ms. In
this study, we want to evaluate how audio/video delay is
perceived by interpreters, who are experts in performing
simultaneous translation of video sequences, but not con-
cerning video quality. As such, their detectability and
acceptability thresholds may be different from the ones
recommended. Therefore, we inserted delay between the
audio and the video in the range of [-240, ?120 ms]. The
source video content was captured at 25 frames/s at a
resolution of 720 9406 pixels. For the experiment, the
delay step size was chosen to match the video frame rate
which implies that the delay varied in steps of 40 ms.
For inserting delay between the audio and the video, the
selected video sequence was first split into ten shorter clips,
each about 30 s long. This duration is slightly longer
compared with the sequence duration as recommended by
the ITU methodologies [16]. However, according to the
results in [28], using clips of 30 s duration is needed for
getting the subjects’ impression on audio/video synchro-
nization. We made sure that no cutting occurred in the
middle of a sentence. Then, the audio and the video track
were demuxed and additional delay was inserted in the
audio track. Finally, the audio and the video track were
remuxed back together. In this article, we are only inves-
tigating the influence of audio/video delay. Therefore, we
changed neither the quality of the video nor the audio
stream. As a result, the quality of the different processed
video sequences matched the quality of the original source
content. During the subjective experiment, the video
sequences were played back in the original order, one after
another. This way, we ensured that the natural flow of the
speech was not broken and that the conversation remained
logical to the interpreters.
A commonly used methodology for determining
detectability thresholds is the staircase method [2] which
would adaptively adjust (increase or decrease) the delay
between the audio and the video in consecutive video
sequences, depending on the subject’s responses. However,
using such methodology, subjects can pick up the delay
behavior in the different sequences and anticipate their
responses [32]. Therefore, we randomly inserted the delay
in each video sequence. Furthermore, as we have a fixed
playout order, no adaptive re-ordering of the sequences is
possible. An overview of the delay inserted in each video
sequence is listed in Table 1.
4 Results
Using the subjective video quality assessment methodol-
ogy, as explained in Sect. 3.2, the expert users were pre-
sented with the ten different audiovisual sequences which
they were asked to simultaneously translate/interpret, just
as they would do in a normal real-life situation. After-
wards, we repeated exactly the same experiment using non-
expert users which were only required to evaluate the
audio/video synchronization of the sequences.
In this section, we first present the results obtained using
our interpreter test subjects. Then, we compare these
results with the findings from the non-experts.
4.1 Interpreters’ evaluation
Fifteen expert users, ten females and five males, partici-
pated in this experiment. The average age was 25, with a
Table 1 Inserted delay
between audio and video in each
video sequence
Negative numbers imply that
the audio is delayed with respect
to the video
Sequence A/V delay
(in ms)
01 0
02 -120
03 80
04 -80
05 -200
06 40
07 -160
08 120
09 -240
10 -40
Assessing the importance of audio/video synchronization 451
123
Author's personal copy
minimum age of 20 and a maximum age of 41. As rec-
ommended by ITU Recommendation P.911, at least 15
subjects should participate in the experiment. In the case of
expert users, Nezveda et al. [22] even showed that a sig-
nificant lower number of subjects can be used.
In order to contextualize these participants and to
elaborate the quantitative data, both interviews and
observational research have been conducted. Before the
experiment was due, a short interview took place, ques-
tioning the participants about their experiences in inter-
preting, the use of video conferencing tools, what they
usually focus on while they interpret, the importance of
visual cues, and how they normally prepare an interpre-
tation session.
4.1.1 Interview to contextualize the interpreters
Of the test subjects, ten had at least 1-year experience in
interpreting English to Dutch (and vice versa) and experi-
ence with performing simultaneous translations during
video conferencing. Their practical knowledge ranged from
exercises in class to actual interpreting at conferences.
In general, real-life interpreting is preferred to the use of
video conferencing tools as the latter may conceal con-
siderable contextual information. It is believed that limited
information about the speaker and the public impedes a
proper translation. In this respect, anticipating unexpected
events were recorded as well. In addition, it was also felt
that one is more dependent on the technological
functioning.
It was repeatedly indicated throughout the interviews
that the primary focus in (real-time) interpreting is
directed to the spoken word. As such, visual cues are only
of secondary importance. Still, the majority of the expert
users consider it helpful to have additional non-verbal
information provided in visual cues such as gestures,
facial expressions, and lip movements. It serves as a
comfort during their translation and it creates the setting
in which the speaker talks. On the other hand, actively
avoiding visual cues was also often cited, especially when
difficulties with translating are encountered (e.g. high
speech rate).
In the beginning of the experiment, the participants were
informed about the nature of the sequences they were about
to see. Consequently, no preparations on the subject could
be made. Normally, the specific vocabulary inherent to the
sector they are about to work for is thoroughly studied as
well as related documents and information about the
speakers. Lacking this information makes translating a
more demanding task. This could affect the translation
performance of the subjects, but because the main focus of
the research is lip synchronization, the effect on the results
is considered small.
4.1.2 Visibility and annoyance of audio/video
synchronization issues
After watching each individual video sequence, subjects
were required to indicate whether they perceived any
audio/video synchronization issues, rate the audiovisual
quality, and identify whether audio was leading the video
or vice versa.
In Fig. 6, the percentage of the expert subjects who
actually perceived the corresponding delay between the
audio and the video is depicted.
In general, almost none of the expert subjects detected
the desynchronization between audio and video (at most
one or two subjects), even in the case where the delay is up
to -240 ms. This can be explained by the fact that the
expert users are primarily focused on the simultaneous
translation of the audio track. As indicated during the pre-
interview, visual cues are only of secondary importance
and by some even actively avoided to focus solely on the
spoken content. The latter is especially the case when parts
of the conversation become more difficult to translate.
During the simultaneous translation of the different video
sequences, the interpreters are also actively communicat-
ing. Results in [20] indicate that less attention is given to
lip sync during active communication. According to the
results presented in Fig. 6, the delay between audio and
video may exceed the 160 ms threshold recommended by
Steinmetz [28].
Due to the low detection thresholds, there is no clear
difference concerning visibility of lip sync when audio is
delayed or ahead of the video signal. The graph shows that
the delay between the audio and the video can be more than
-240 or 120 ms before reaching a detection threshold of
100 %.
During the subjective experiment, we observed that the
participants mainly focused on the screen. Exceptionally,
some of them closed their eyes, looked away or even sat
Fig. 6 Percentage of expert users who did perceive lip sync issues
compared to the actual inserted delay
452 N. Staelens et al.
123
Author's personal copy
back for a while. Afterwards, they explained sometimes
having problems interpreting and translating the sequences,
caused by the high speech rate, the dense information,
uncertainty about a translation or in some cases the asyn-
chronicity between the audio and the video. The latter is
remarkable as the above graph shows that only a small
percentage of the experts actually perceived this
asynchronicity.
The overall average quality ratings given to the different
sequences, as shown in Fig. 7, remain high as only a small
percentage of the experts detect the A/V synchronization
issues.
Even when the delay goes up to -240 ms, the quality of
that particular video sequence is still not perceived as being
annoying (MOS [4), similar to the results obtained in [7].
Analyzing the individual quality ratings given by the test
subjects to each video sequence showed that the quality
score drops on average by 1.3, with a standard deviation of
0.4, in case a lip sync problem is detected.
Finally, the interpreters were also asked to indicate, in
the case of an A/V synchronization issue, whether they
perceived the audio to be delayed with respect to the video
or vice versa. As the graph in Fig. 8shows, very few
experts are able to correctly classify the relationship
between the video and the audio track.
It must be noted that the graph only takes into account
the subjects who actually detected the A/V synchronization
problem. As such, this graph should be closely inspected in
relation to the graph from Fig. 6when interpreting the
results. For example, even though the classification accu-
racy is 100 % in the case of a delay of -200 ms, only one
of the test subjects actually detected this synchronization
issue.
In the case of a delay of -240 ms, 53 % of the subjects
detected the synchronization problem. However, only 38 %
of them could correctly detect that the audio was indeed
delayed with respect to the video. Further analysis of the
individual responses showed that subjects fail to identify
whether audio is ahead or delayed compared with the
video. Even when a particular subject identifies different
sync problems, he/she is not able to differentiate delayed
sound from delayed video. As such, similar to the question
whether they perceived a synchronization issue, subjects
are again trying to guess the answer.
Our results show a high correlation between the differ-
ent test subjects. It also clearly shows that, when inter-
preters are mainly focused on performing the simultaneous
translation, audio/video delay is not a primary concern to
them. Furthermore, the test subjects fail to combine real-
time interpretation with assessing the audiovisual quality of
the presented sequences. Even in the case of a severe dif-
ferential delay (C240 ms) between audio and video, syn-
chronization issues become only slightly detectable.
4.1.3 Post-experimental interview: extending
the quantitative data
Throughout the interviews it was recurrently indicated that
the used audiovisual sequences were demanding and
required high concentration. Interestingly, the provided
reasons included mainly factors associated with the content
of the sequences (e.g. high speech rate, vocabulary, or
diction) or themselves (lack of preparation) and only to
some the detected desynchronizations. Furthermore, the
participants assessed their performance worse than what
they normally achieve. The discrepancy between the low
detection rates and the encountered difficulties suggest that
the participants were highly involved in completing the
test, leaving little to no capacity to assess the (de-)syn-
chronization. This is supported by the expressed uncer-
tainty regarding their detections and whether audio or
video was leading. Furthermore, as the contextualization
interviews indicated, visual cues are secondary to auditory
Fig. 7 MOS scores given by the experts to the sequences with
inserted delay between audio and video
Fig. 8 Percentage of the experts who correctly determined whether
audio was leading video or vice versa, in case they perceived A/V
synchronization issues
Assessing the importance of audio/video synchronization 453
123
Author's personal copy
cues, meaning that less attention is paid to the video in the
first place. Only when the delay was up to -240 ms, the
desynchronization was substantially more detected. A
modest part of the participants expressed during the inter-
views that, when the desynchronization was perceived, it
did disturb them in completing their translation. The de-
synchronization amplified the difficulties one already had,
manifesting itself primarily as a loss of concentration. Yet,
the MOS scores indicate that none of the sequences were
considered annoying.
Despite the low detection rate, audio/video synchroni-
zation is often considered important. A correlation seems to
exist between the experienced difficulties and the allocated
weight to audio/video synchronization. The data suggest
that the more the difficulties encountered while translating,
the more the importance of synchronization is emphasized.
Quoting the participants, the maximal allowed delay varies
from none or milliseconds to not more than a few words.
Nevertheless, an impaired audio–visual stream was
recurrently preferred to a single audio track. As long as the
delay is not too high, nor too long, video is considered a
valuable asset as it provides the interpreter with a certain
comfort. Even in the case of this experiment, in which the
speaker showed little expressions or gestures, the video
was considered helpful to more than one participant.
4.2 Comparison with non-experts users
In this section, we investigate how the average end-users
perceive audio/video synchronization to see whether there
is a significant difference with respect to the interpreters.
Test subjects were asked to watch the same audiovisual
sequences as the interpreters and evaluate whether they
perceived any audio/video synchronization issues. In con-
trast to the expert interpreters, the non-expert users were
not asked to perform a simultaneous translation of the
speech. As a result, the non-experts are primarily focused
on detecting A/V synchronization issues.
A total number of 24 non-expert users, aged between 24
and 34 years old participated with the subjective
experiment.
4.2.1 Detecting audio/video delay
Figure 9shows the percentage of viewers who perceived
any kind of A/V synchronization problem, compared with
the actual delay inserted between the audio and the video
signal.
The graph clearly shows that delays up to one video
frame [-40, 40 ms] are not detected at all. This also cor-
responds with the A/V synchronization thresholds recom-
mended by the ITU [13], the ATSC [1], and the DSL
Forum [4]. Furthermore, when the audio is delayed by
240 ms compared with the video signal, all subjects also
detected the desynchronization. The detection threshold
shows more or less a linear behavior with respect to the
actual inserted delay. As can be seen in Fig. 9, a delay of
-160 ms is slightly strongly detected than a delay of
-200 ms. However, based on the statistical Ztest, we
found that there is no statistical difference between the
percentages of the subjects who perceived the delays of
-160 and -200 ms. In case the audio is 120 ms ahead of
the video signal, only 33 % of the subjects detect that lip
sync is out of sync. This implies that the audio can lead the
video with more than 120 ms difference. Corresponding to
the results in [28], delays up to two video frames [-80,
80 ms] are only detected by a small amount of subjects. An
interesting remark is that audio/video desynchronization is
apparently less detected when the audio is ahead of the
video which was also concluded by Mued et al. [21].
Comparing the visibility of lip sync between the inter-
preters (Fig. 6) and the average end-users (Fig. 9) high-
lights the importance of the primary focus, similar to the
results obtained in [27]. Despite the fact that the inter-
preters were also asked to evaluate the A/V synchroniza-
tion, performing the simultaneous translation requires all
their attention.
In general, our results obtained using non-experts cor-
respond much more with results from already conducted
research.
4.2.2 Audiovisual quality ratings for sequences with
audio/video synchronization delays
When inspecting the MOS scores given to the different
video sequences, as depicted in Fig. 10, we notice that
delays up to two video frames are still rated perfect quality
which corresponds to their corresponding visibility
thresholds (see Fig. 9). Furthermore, delays up to 120 ms
are perceivable but not rated annoying (MOS [4). These
Fig. 9 Percentage of non-expert viewers who perceived lip sync
issues compared to the actual inserted delay
454 N. Staelens et al.
123
Author's personal copy
results are similar to the different A/V synchronization
thresholds proposed by ITU-R Rec. BT.1359 [11] and
Steinmetz [28]. Test subjects also perceive delays of
-240 ms as annoying.
In accordance with our findings in the previous section,
audiovisual quality is rated slightly higher when the audio is
ahead of the video signal. However, this is not a significant
difference. Therefore, it cannot be assumed that sequences
with audio ahead of video are indeed less annoying compared
with the sequences in which the audio is delayed with respect
to the video.
On average, individual quality ratings drop by 1.5, with a
standard deviation of 0.3, in case a non-expert detects an A/V
synchronization problem. This is a slightly higher drop com-
pared with the interpreters because the non-experts are pri-
marily focused on audiovisual quality evaluation.
4.2.3 Identifying whether the audio stream is delayed
or ahead with respect to the corresponding video
track
In case the test subjects perceived an A/V synchronization
issue, they were also asked to indicate whether they perceived
the audio was delayed with respect to the video track or vice
versa. Figure 11 depicts the percentage of the subjects who
correctly determined whether the audio was delayed or was
ahead of the video. Note that only the results of subject who
really perceived an A/V sync issue are taken into account.
As the graph shows, it is difficult for the test subjects to
determine the exact relationship between the audio and the
video. Only a limited number of subjects are capable of
correctly detecting whether the audio leads the video or
vice versa. Even in the case of a delay of -240 ms, which
is detected by 100 % percent of the test subjects (cfr.
Fig. 9), only 29 % of the subjects correctly identified that
the audio was delayed with respect to the video track. In
case of a delay of 80 ms, the plot shows that 100 % of the
subjects correctly classified the relationship between the
video and the audio track. However, it must be noted that
only one subject detected the A/V sync issue in this case. In
general, A/V sync becomes noticeable when the delay is
more than 120 ms (in both directions).
As such, similar to the evaluations done by the inter-
preters, the non-experts also fail to identify the direction of
the differential delay between audio and video.
5 Conclusions
As indicated during the pre-experimental interviews, visual
cues are only of secondary importance to the interpreters.
Having a challenging task to complete, as experienced by
several of the expert users, interpreters are primarily
focused on performing the simultaneous translation. As
such, detecting A/V desynchronization while interpreting a
conversation poses a great challenge to most of our test
subjects. Consequently, the majority of the interpreters do
not perceive lip sync problems when the differential delay
between audio and video remains below 240 ms. This
detection threshold is significantly higher compared with
the thresholds recommended by the different standard
bodies and already conducted research (see Fig. 1).
Both the experimental data and the post-experimental
interviews suggest a low importance of desynchronized
audio/video during simultaneous translation. Desynchro-
nization seems to amplify existing difficulties, rather than
causing difficulties by itself.
Despite the low detection rate and the high MOS scores,
only a minority considers A/V synchronization important.
Underlying this contradictory finding is the expectation
that desynchronized audio and video will hamper the task
of the interpreter eventually.
Fig. 10 MOS scores given by the non-experts to the sequences with
inserted delay between audio and video Fig. 11 Percentage of the non-experts who correctly determined
whether audio was leading video or vice versa, in case they perceived
A/V synchronization issues
Assessing the importance of audio/video synchronization 455
123
Author's personal copy
Conducting the same subjective experiment using non-
experts highlights the importance of the primary focus. It is
clear that lip sync is much easier detected when subjects
are actively evaluating the audiovisual quality of the video
sequences.
In contrast with the research findings from the inter-
preters, the results concerning lip sync visibility and
acceptability obtained from our non-experts correspond to
the results from already conducted subjective studies and
with the recommendations from different standard bodies.
These differences, both in visibility and acceptability
thresholds between the interpreters (experts) and non-
experts, highlight the importance of considering
the targeted application and use case when determining
and investigating appropriate A/V synchronization
thresholds.
Acknowledgments The research activities that have been described
in this paper were funded by Ghent University, the Interdisciplinary
Institute for Broadband Technology (IBBT) and the Institute for the
Promotion of Innovation by Science and Technology in Flanders
(IWT). This paper is the result of research carried out as part of the
OMUS project funded by the IBBT. OMUS is being carried out by a
consortium of the industrial partners: Technicolor, Televic, Stream-
ovations and Excentis in cooperation with the IBBT research groups:
IBCN & MultimediaLab & WiCa (UGent), SMIT (VUB), PATS
(UA), and COSIC (KUL). Glenn Van Wallendael and Jan De Cock
would also like to thank the Institute for the Promotion of Innovation
through Science and Technology in Flanders for financially sup-
porting their Ph.D. and postdoctoral grant, respectively. The authors
would also like to thank Dr. Bart Defrancq, Lecturer and Coordinator
of the PP in Conference Interpreting at University College Ghent, for
his contributions to this work and support in acquiring the expert test
subjects.
References
1. ATSC IS-191: Relative timing of sound and vision for broadcast
operations (2003)
2. von Bekesy, G.: A new audiometer. Acta Otolaryngol. 35,
411–422 (1947)
3. Curcio, I.D., Lundan, M.: Human perception of lip synchroni-
zation in mobile environment. In: IEEE international symposium
on a world of wireless, mobile and multimedia networks (2007)
4. DSL Forum Technical Report TR-126: Triple-play services
quality of experience (QoE) requirements. DSL Forum (2006)
5. EBU Recommendation R37: The relative timing of the sound and
vision components of a television signal (2007)
6. Firestone, S., Ramalingam, T., Fry, S.: Voice and Video Con-
ferencing Fundamentals, chap. 7, pp. 223–255. Cisco Press
(2007)
7. Ford, C., McFarland, M., Ingram, W., Hanes, S., Pinson, M.,
Webster, A., Anderson, K.: Multimedia synchronization study.
Tech. rep., National Telecommunications and Information Admin-
istration (NTIA), Institute for Telecommunication Sciences (ITS)
(2009)
8. Fridrich, J., Goljan, M.: Robust hash functions for digital
watermarking. In: International conference on information tech-
nology: coding and computing (ITCC) (2000)
9. Garcia, M.N., Schleicher, R., Raake, A.: Impairment-factor-based
audiovisual quality model for iptv: Influence of video resolution,
degradation type, and content type. EURASIP J. Image Video
Process. 2011 (2011)
10. Huynh-Thu, Q., Garcia, M.N., Speranza, F., Corriveau, P., Raake,
A.: Study of rating scales for subjective quality assessment of
high-definition video. IEEE Trans. Broadcast. 57(1), 1–14 (2011)
11. ITU-R Recommendation BT.1359: Relative timing of sound and
vision for broadcasting (1998)
12. ITU-R Recommendation BT.500: Methodology for the subjective
assessment of the quality of television pictures (2009)
13. ITU-T Recommendation G.1080: Quality of experience require-
ments for IPTV services. International Telecommunication Union
(ITU) (2008)
14. ITU-T Recommendation J.148: Requirements for an objective
perceptual multimedia quality model. International Telecommu-
nication Union (ITU) (2003)
15. ITU-T Recommendation P.10/G.100 Amd 2: Vocabulary for
performance and quality of service (2008)
16. ITU-T Recommendation P.911: Subjective audiovisual quality
assessment methods for multimedia applications. International
Telecommunication Union (ITU) (1998)
17. ITU-T Recommendation P.920: Interactive test methods for
audiovisual communications. International Telecommunication
Union (ITU) (2000)
18. Kudrle, S., Proulx, M., Carrieres, P., Lopez, M.: Fingerprinting
for solving A/V synchronization issues within broadcast envi-
ronments. SMPTE Motion Imaging J., 47–57 (2011)
19. Mued, L., Lines, B., Furnell, S., Reynolds, P.: Acoustic speech to
lip feature mapping for multimedia applications. In: 3rd inter-
national symposium on image and signal processing and analysis
(ISPA), pp. 829–832 (2003)
20. Mued, L., Lines, B., Furnell, S., Reynolds, P.: The effects of lip
synchronization in ip conferencing. In: International conference
on visual information engineering, pp. 210–213 (2003)
21. Mued, L., Lines, B., Furnell, S., Reynolds, P.: The effects of
audio and video correlation and lip synchronization. Campus
Wide Inf Syst 20, 159–166 (2003)
22. Nezveda, M., Buchinger, S., Robitza, W., Hotop, E., Hummelb-
runner, P., Hlavacs, H.: Test persons for subjective video quality
testing: experts or non-experts? In: QoEMCS workshop at the
EuroITV, 8th European conference on interactive TV (2010)
23. Pereira, F., Alpert, T.: MPEG-4video subjective test procedures and
results. IEEE Trans. Circuits Syst. Video Technol. 7(1), 32–51
(1997)
24. Pinson, M.H.: Impact of lab effects and environment on audiovisual
quality. VQEG_MM_2011_010_audiovisual quality repeatability,
Seoul (2011)
25. Radhakrishnan, R., Terry, K., Bauer, C.: Audio and video sig-
natures for synchronization. In: IEEE international conference on
multimedia and expo (ICME), pp. 1549–1552 (2008)
26. Speranza, F., Poulin, F., Renaud, R., Caron, M., Dupras, J.:
Objective and subjective quality assessment with expert and non-
expert viewers. In: Second international workshop on quality of
multimedia experience (QoMEX), pp. 46–51 (2010)
27. Staelens, N., Moens, S., Van den Broeck, W., Marien, I., Ver-
meulen, B., Lambert, P., Van de Walle, R., Demeester, P.:
Assessing quality of experience of IPTV and video on demand
services in real-life environments. IEEE Trans. Broadcast. 56(4),
458–466 (2010)
28. Steinmetz, R.: Human perception of jitter and media synchroni-
zation. IEEE J. Sel. Areas Commun. 14(1), 61–72 (1996)
29. Stojancic, M., Eakins, D.: Interoperable AV sync systems in the
SMPTE 22TV Lip Sync AHG: content-fingerprinting-based audio–
video synchronization. SMPTE Motion Imaging J., 47–57 (2011)
456 N. Staelens et al.
123
Author's personal copy
30. Terry, K., Radhakrishnan, R.: Detection and correction of lip-sync
errors using audio and video fingerprints. In: SMPTE annual tech
conference and expo (2009)
31. Winkler, S.: Digital video quality—vision models and metrics.
Wiley, New York (2005)
32. Younkin, A., Corriveau, P.: Determining the amount of audio–
video synchronization errors perceptible to the average end-user.
IEEE Trans. Broadcast. 54(3), 623–627 (2008)
Assessing the importance of audio/video synchronization 457
123
Author's personal copy